Discussion:
The very first result of IA _abbyy.gz parsing & bot uploading into nsPage
(too old to reply)
Alex Brollo
2017-10-16 15:11:03 UTC
Permalink
Raw Message
Here:
Pagina:D'Ayala_-_Dizionario_militare_francese_italiano.djvu/46
<https://it.wikisource.org/wiki/Pagina:D%27Ayala_-_Dizionario_militare_francese_italiano.djvu/46>
and immediately previous and following pages both the text and some
formatting from Internet Archive file bub_gb_lvzoCyRdzsoC_abbyy.gz
<https://archive.org/download/bub_gb_lvzoCyRdzsoC/bub_gb_lvzoCyRdzsoC_abbyy.gz>
(in
previous pages only some templates have been added and a little bit of
regex manipulation has be done)

Internet Archive _abbyy.gz files are gzipped, enormous xml files where any
detail of FineReader OCR output is exported - but, even if enormous and
terribly complex, they can be parsed and any detail (a little bit
painfully...) can be used; presently, only bold, italic, smallcaps and
paragraphs have been explored, translated into wiki code by a prettily
simple python code.

Alex
Asaf Bartov
2017-10-16 15:42:20 UTC
Permalink
Raw Message
That's really promising!

Thank you for sharing this.

A.

On Oct 17, 2017 00:11, "Alex Brollo" <***@gmail.com> wrote:

> Here:
> Pagina:D'Ayala_-_Dizionario_militare_francese_italiano.djvu/46
> <https://it.wikisource.org/wiki/Pagina:D%27Ayala_-_Dizionario_militare_francese_italiano.djvu/46>
> and immediately previous and following pages both the text and some
> formatting from Internet Archive file bub_gb_lvzoCyRdzsoC_abbyy.gz
> <https://archive.org/download/bub_gb_lvzoCyRdzsoC/bub_gb_lvzoCyRdzsoC_abbyy.gz>
> (in previous pages only some templates have been added and a little bit
> of regex manipulation has be done)
>
> Internet Archive _abbyy.gz files are gzipped, enormous xml files where any
> detail of FineReader OCR output is exported - but, even if enormous and
> terribly complex, they can be parsed and any detail (a little bit
> painfully...) can be used; presently, only bold, italic, smallcaps and
> paragraphs have been explored, translated into wiki code by a prettily
> simple python code.
>
> Alex
>
>
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-***@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
Andrea Zanni
2017-10-16 17:35:06 UTC
Permalink
Raw Message
Thanks Alex!
I really hope this is a direction where other developers will follow: being
able to harness the full potential of structured data from OCR software is
absolutely crucial for Wikisource:
we could actually automatize *a lot* of the formatting work now done by
volunteers, and their time could be spent still formatting, proofreading
and validating, but with much power than before.
IMO, it changes a lot if a book is formatted ~50% by a machine, we could do
much more books in less time.
Go Alex!

Aubrey

On Mon, Oct 16, 2017 at 5:42 PM, Asaf Bartov <***@wikimedia.org> wrote:

> That's really promising!
>
> Thank you for sharing this.
>
> A.
>
> On Oct 17, 2017 00:11, "Alex Brollo" <***@gmail.com> wrote:
>
>> Here:
>> Pagina:D'Ayala_-_Dizionario_militare_francese_italiano.djvu/46
>> <https://it.wikisource.org/wiki/Pagina:D%27Ayala_-_Dizionario_militare_francese_italiano.djvu/46>
>> and immediately previous and following pages both the text and some
>> formatting from Internet Archive file bub_gb_lvzoCyRdzsoC_abbyy.gz
>> <https://archive.org/download/bub_gb_lvzoCyRdzsoC/bub_gb_lvzoCyRdzsoC_abbyy.gz>
>> (in previous pages only some templates have been added and a little bit
>> of regex manipulation has be done)
>>
>> Internet Archive _abbyy.gz files are gzipped, enormous xml files where
>> any detail of FineReader OCR output is exported - but, even if enormous and
>> terribly complex, they can be parsed and any detail (a little bit
>> painfully...) can be used; presently, only bold, italic, smallcaps and
>> paragraphs have been explored, translated into wiki code by a prettily
>> simple python code.
>>
>> Alex
>>
>>
>>
>> _______________________________________________
>> Wikisource-l mailing list
>> Wikisource-***@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-***@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
Anika Born
2017-10-16 18:09:01 UTC
Permalink
Raw Message
as aubrey: Thank you very much!

I shared these news at the Scriptorium of de.ws.

I also used the opportunity to inform them about your "Visualizzatore".
This is so cool!!!! (especially the search-function)

And because I had some time (and the best things come in threes) I invited
them to your it.WikiCon in Trento (
https://meta.wikimedia.org/wiki/ItWikiCon/2017/Proposte#Wikisource). Have
fun there! My best wishes to the organizers. I co-organized it three times
in a row for the all-German-Community....

https://de.wikisource.org/wiki/Wikisource:Skriptorium#Italien:_17._bis_19._November_WikiCon_in_Trient



Anika

2017-10-16 19:35 GMT+02:00 Andrea Zanni <***@gmail.com>:

> Thanks Alex!
> I really hope this is a direction where other developers will follow:
> being able to harness the full potential of structured data from OCR
> software is absolutely crucial for Wikisource:
> we could actually automatize *a lot* of the formatting work now done by
> volunteers, and their time could be spent still formatting, proofreading
> and validating, but with much power than before.
> IMO, it changes a lot if a book is formatted ~50% by a machine, we could
> do much more books in less time.
> Go Alex!
>
> Aubrey
>
> On Mon, Oct 16, 2017 at 5:42 PM, Asaf Bartov <***@wikimedia.org>
> wrote:
>
>> That's really promising!
>>
>> Thank you for sharing this.
>>
>> A.
>>
>> On Oct 17, 2017 00:11, "Alex Brollo" <***@gmail.com> wrote:
>>
>>> Here:
>>> Pagina:D'Ayala_-_Dizionario_militare_francese_italiano.djvu/46
>>> <https://it.wikisource.org/wiki/Pagina:D%27Ayala_-_Dizionario_militare_francese_italiano.djvu/46>
>>> and immediately previous and following pages both the text and some
>>> formatting from Internet Archive file bub_gb_lvzoCyRdzsoC_abbyy.gz
>>> <https://archive.org/download/bub_gb_lvzoCyRdzsoC/bub_gb_lvzoCyRdzsoC_abbyy.gz>
>>> (in previous pages only some templates have been added and a little
>>> bit of regex manipulation has be done)
>>>
>>> Internet Archive _abbyy.gz files are gzipped, enormous xml files where
>>> any detail of FineReader OCR output is exported - but, even if enormous and
>>> terribly complex, they can be parsed and any detail (a little bit
>>> painfully...) can be used; presently, only bold, italic, smallcaps and
>>> paragraphs have been explored, translated into wiki code by a prettily
>>> simple python code.
>>>
>>> Alex
>>>
>>>
>>>
>>> _______________________________________________
>>> Wikisource-l mailing list
>>> Wikisource-***@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>>
>> _______________________________________________
>> Wikisource-l mailing list
>> Wikisource-***@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-***@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
Alex Brollo
2017-10-16 20:07:07 UTC
Permalink
Raw Message
thanks for appreciation - please consider my tries only as a proof that "it
can be done". I'll share the test python code I'm using here:

https://it.wikisource.org/wiki/Progetto:Bot/Programmi_in_Python_per_i_bot/abbyyXml.py

Alex





<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Mail
priva di virus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2017-10-16 20:09 GMT+02:00 Anika Born <***@wikipedia.de>:

> as aubrey: Thank you very much!
>
> I shared these news at the Scriptorium of de.ws.
>
> I also used the opportunity to inform them about your "Visualizzatore".
> This is so cool!!!! (especially the search-function)
>
> And because I had some time (and the best things come in threes) I invited
> them to your it.WikiCon in Trento (https://meta.wikimedia.org/
> wiki/ItWikiCon/2017/Proposte#Wikisource). Have fun there! My best wishes
> to the organizers. I co-organized it three times in a row for the
> all-German-Community....
>
> https://de.wikisource.org/wiki/Wikisource:Skriptorium#
> Italien:_17._bis_19._November_WikiCon_in_Trient
>
>
> Anika
>
> 2017-10-16 19:35 GMT+02:00 Andrea Zanni <***@gmail.com>:
>
>> Thanks Alex!
>> I really hope this is a direction where other developers will follow:
>> being able to harness the full potential of structured data from OCR
>> software is absolutely crucial for Wikisource:
>> we could actually automatize *a lot* of the formatting work now done by
>> volunteers, and their time could be spent still formatting, proofreading
>> and validating, but with much power than before.
>> IMO, it changes a lot if a book is formatted ~50% by a machine, we could
>> do much more books in less time.
>> Go Alex!
>>
>> Aubrey
>>
>> On Mon, Oct 16, 2017 at 5:42 PM, Asaf Bartov <***@wikimedia.org>
>> wrote:
>>
>>> That's really promising!
>>>
>>> Thank you for sharing this.
>>>
>>> A.
>>>
>>> On Oct 17, 2017 00:11, "Alex Brollo" <***@gmail.com> wrote:
>>>
>>>> Here:
>>>> Pagina:D'Ayala_-_Dizionario_militare_francese_italiano.djvu/46
>>>> <https://it.wikisource.org/wiki/Pagina:D%27Ayala_-_Dizionario_militare_francese_italiano.djvu/46>
>>>> and immediately previous and following pages both the text and some
>>>> formatting from Internet Archive file bub_gb_lvzoCyRdzsoC_abbyy.gz
>>>> <https://archive.org/download/bub_gb_lvzoCyRdzsoC/bub_gb_lvzoCyRdzsoC_abbyy.gz>
>>>> (in previous pages only some templates have been added and a little
>>>> bit of regex manipulation has be done)
>>>>
>>>> Internet Archive _abbyy.gz files are gzipped, enormous xml files where
>>>> any detail of FineReader OCR output is exported - but, even if enormous and
>>>> terribly complex, they can be parsed and any detail (a little bit
>>>> painfully...) can be used; presently, only bold, italic, smallcaps and
>>>> paragraphs have been explored, translated into wiki code by a prettily
>>>> simple python code.
>>>>
>>>> Alex
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Wikisource-l mailing list
>>>> Wikisource-***@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>>
>>>>
>>> _______________________________________________
>>> Wikisource-l mailing list
>>> Wikisource-***@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>>
>>
>> _______________________________________________
>> Wikisource-l mailing list
>> Wikisource-***@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-***@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
Alex Brollo
2017-10-16 21:25:55 UTC
Permalink
Raw Message
@Anika: happy to know that you like "visualizzatore" and that you
discovered the search function, that is perhaps the most useful trick,
together with pre-viewing of OCR for "red" pages, the latter allowing to
refine a book-specific shared regex set.

Alex

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Mail
priva di virus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2017-10-16 20:09 GMT+02:00 Anika Born <***@wikipedia.de>:

> as aubrey: Thank you very much!
>
> I shared these news at the Scriptorium of de.ws.
>
> I also used the opportunity to inform them about your "Visualizzatore".
> This is so cool!!!! (especially the search-function)
>
> And because I had some time (and the best things come in threes) I invited
> them to your it.WikiCon in Trento (https://meta.wikimedia.org/
> wiki/ItWikiCon/2017/Proposte#Wikisource). Have fun there! My best wishes
> to the organizers. I co-organized it three times in a row for the
> all-German-Community....
>
> https://de.wikisource.org/wiki/Wikisource:Skriptorium#
> Italien:_17._bis_19._November_WikiCon_in_Trient
>
>
> Anika
>
> 2017-10-16 19:35 GMT+02:00 Andrea Zanni <***@gmail.com>:
>
>> Thanks Alex!
>> I really hope this is a direction where other developers will follow:
>> being able to harness the full potential of structured data from OCR
>> software is absolutely crucial for Wikisource:
>> we could actually automatize *a lot* of the formatting work now done by
>> volunteers, and their time could be spent still formatting, proofreading
>> and validating, but with much power than before.
>> IMO, it changes a lot if a book is formatted ~50% by a machine, we could
>> do much more books in less time.
>> Go Alex!
>>
>> Aubrey
>>
>> On Mon, Oct 16, 2017 at 5:42 PM, Asaf Bartov <***@wikimedia.org>
>> wrote:
>>
>>> That's really promising!
>>>
>>> Thank you for sharing this.
>>>
>>> A.
>>>
>>> On Oct 17, 2017 00:11, "Alex Brollo" <***@gmail.com> wrote:
>>>
>>>> Here:
>>>> Pagina:D'Ayala_-_Dizionario_militare_francese_italiano.djvu/46
>>>> <https://it.wikisource.org/wiki/Pagina:D%27Ayala_-_Dizionario_militare_francese_italiano.djvu/46>
>>>> and immediately previous and following pages both the text and some
>>>> formatting from Internet Archive file bub_gb_lvzoCyRdzsoC_abbyy.gz
>>>> <https://archive.org/download/bub_gb_lvzoCyRdzsoC/bub_gb_lvzoCyRdzsoC_abbyy.gz>
>>>> (in previous pages only some templates have been added and a little
>>>> bit of regex manipulation has be done)
>>>>
>>>> Internet Archive _abbyy.gz files are gzipped, enormous xml files where
>>>> any detail of FineReader OCR output is exported - but, even if enormous and
>>>> terribly complex, they can be parsed and any detail (a little bit
>>>> painfully...) can be used; presently, only bold, italic, smallcaps and
>>>> paragraphs have been explored, translated into wiki code by a prettily
>>>> simple python code.
>>>>
>>>> Alex
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Wikisource-l mailing list
>>>> Wikisource-***@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>>
>>>>
>>> _______________________________________________
>>> Wikisource-l mailing list
>>> Wikisource-***@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>>
>>>
>>
>> _______________________________________________
>> Wikisource-l mailing list
>> Wikisource-***@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-***@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>
Loading...