[Wikisource-l] OCR as a service?

Discussion:

Asaf Bartov

2015-07-11 10:04:13 UTC

Hi.

Speaking of Wikisource software, do we already have any instance set up for
OCR as a service? (I'm thinking of OpenOCR[1] hosted in a Docker[2]
container somewhere, perhaps on Labs.)

If yes, where is it and who maintains it, and can I use it as a client?
If not, I am prepared to set this up.

A.

[1] http://www.openocr.net/
[2] https://www.docker.com/

Alex Brollo

2015-07-11 15:32:51 UTC

Permalink

Very, very interesting.... I can't help you, my skill is very limited, but
I'm very interested about and I hope that my interest will be largely
shared.

Alex

Post by Asaf Bartov
Hi.
Speaking of Wikisource software, do we already have any instance set up
for OCR as a service? (I'm thinking of OpenOCR[1] hosted in a Docker[2]
container somewhere, perhaps on Labs.)
If yes, where is it and who maintains it, and can I use it as a client?
If not, I am prepared to set this up.
A.
[1] http://www.openocr.net/
[2] https://www.docker.com/
--
Asaf Bartov
Wikimedia Foundation <http://www.wikimediafoundation.org>
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Nicolas VIGNERON

2015-07-11 15:44:55 UTC

Permalink

Hi,

I'm not a techie so I'm not sure to know what is OCR-as-service but you
should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is
behind tools like http://tools.wmflabs.org/phetools/ocr.php ).

Cdlt, ~nicolas

Andrea Zanni

2015-07-11 16:59:10 UTC

Permalink

uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that means
ABBYY Finereader, which is very nice).

But ideally we could think of a "customizable" OCR software that gets
trained language per language: htat would be extremely useful for
Wiikisources.

(i can also imagine to divide, inside every language, per centuries,
because languages too changes over time ;-)

Aubrey

On Sat, Jul 11, 2015 at 5:44 PM, Nicolas VIGNERON <

Post by Nicolas VIGNERON
Hi,
I'm not a techie so I'm not sure to know what is OCR-as-service but you
should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is
behind tools like http://tools.wmflabs.org/phetools/ocr.php ).
Cdlt, ~nicolas
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Asaf Bartov

2015-07-12 09:25:27 UTC

Permalink

Post by Andrea Zanni
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that means
ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the archive's
Open Library API does not offer a way to retrieve the OCR output
programmatically, and certainly not for an arbitrary page rather than the
whole item. What I'm working on requires the ability to OCR a single page
on demand.

But ideally we could think of a "customizable" OCR software that gets

Post by Andrea Zanni
trained language per language: htat would be extremely useful for
Wiikisources.
(i can also imagine to divide, inside every language, per centuries,
because languages too changes over time ;-)

Indeed.

A.

Andrea Zanni

2015-07-12 10:50:03 UTC

Permalink

Post by Asaf Bartov

Post by Andrea Zanni
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that means
ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the
archive's Open Library API does not offer a way to retrieve the OCR output
programmatically, and certainly not for an arbitrary page rather than the
whole item. What I'm working on requires the ability to OCR a single page
on demand.
True.

I've recently met Giovanni, a new (italian) guy who's now working with
Internet Archive and Open Library.
We discussed about a number of possible parnerships/projects, this is
definitely one to bring it up.

But if we manage to do it directly in the Wikimedia world it's even better.

Aubrey

Post by Asaf Bartov
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Alex Brollo

2015-07-12 22:27:56 UTC

Permalink

I explored abbyy gx files, the full xml output from ABBYY ocr engine
running at Internet Archive, and I've been astonished by the amount of data
they contain - they are stored at XCA_Extended detaiI (as documented at
http://www.abbyy-developers.com/en:tech:features:xml ).

Something that wikisource best developers should explore; comparing those
data with the little bit of data into mapped text layer of djvu files is
impressive and should be inspiring.

But they are static data coming from a standard setting... nothing similar
to a service with simple, shared, deep learning features for difficult and
ancient texts. I tried "ancient italian" tesseract dictionary with very
poor results.

So Asaf, I can't wait for good news from you. :-)

Alex

Post by Andrea Zanni

Post by Asaf Bartov

Post by Andrea Zanni
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that means
ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the
archive's Open Library API does not offer a way to retrieve the OCR output
programmatically, and certainly not for an arbitrary page rather than the
whole item. What I'm working on requires the ability to OCR a single page
on demand.
True.

I've recently met Giovanni, a new (italian) guy who's now working with
Internet Archive and Open Library.
We discussed about a number of possible parnerships/projects, this is
definitely one to bring it up.
But if we manage to do it directly in the Wikimedia world it's even better.
Aubrey

Post by Asaf Bartov
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Asaf Bartov

2015-07-29 06:23:25 UTC

Permalink

Hello again.

So, I've set up an OpenOCR instance on Labs that's available for use as a
service. Just call it and point to an image. Example:

*curl -X POST -H "Content-Type: application/json" -d
'{"img_url":"http://bit.ly/ocrimage
<http://bit.ly/ocrimage>","engine":"tesseract"}'
http://openocr.wmflabs.org/ocr <http://openocr.wmflabs.org/ocr>*

should yield:

"You can create local variables for the pipelines within the template by
preï¬xing the variable name with a â$" sign. Variable names have to be
composed of alphanumeric characters and the underscore. In the example
below I have used a few variations that work for variable names."

If we see evidence of abuse, we might have to protect it with API keys, but
for now, let's AGF. :)

I'm working on something that would be a client of this service, but don't
have a demo yet. Stay tuned! :)

A.

Post by Alex Brollo
I explored abbyy gx files, the full xml output from ABBYY ocr engine
running at Internet Archive, and I've been astonished by the amount of data
they contain - they are stored at XCA_Extended detaiI (as documented at
http://www.abbyy-developers.com/en:tech:features:xml ).
Something that wikisource best developers should explore; comparing those
data with the little bit of data into mapped text layer of djvu files is
impressive and should be inspiring.
But they are static data coming from a standard setting... nothing similar
to a service with simple, shared, deep learning features for difficult and
ancient texts. I tried "ancient italian" tesseract dictionary with very
poor results.
So Asaf, I can't wait for good news from you. :-)
Alex

Post by Andrea Zanni

Post by Asaf Bartov

Post by Andrea Zanni
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that
means ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the
archive's Open Library API does not offer a way to retrieve the OCR output
programmatically, and certainly not for an arbitrary page rather than the
whole item. What I'm working on requires the ability to OCR a single page
on demand.
True.

I've recently met Giovanni, a new (italian) guy who's now working with
Internet Archive and Open Library.
We discussed about a number of possible parnerships/projects, this is
definitely one to bring it up.
But if we manage to do it directly in the Wikimedia world it's even better.
Aubrey

Post by Asaf Bartov
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Jane Darnell

2015-07-29 06:33:30 UTC

Permalink

Nice! I will wait for the client though, thx. Where will the source images
be stored? Labs or Commons? It would be nice if you could somehow make a
client that builds a djvu file locally with the page image and the OCR text
that you can cleanup before putting it into the djvu file. Now it just
seems there are so many hurdles to ws that it's quicker to post pages to
Commons and add the text in the template there.

Post by Asaf Bartov
Hello again.
So, I've set up an OpenOCR instance on Labs that's available for use as a
*curl -X POST -H "Content-Type: application/json" -d
'{"img_url":"http://bit.ly/ocrimage
<http://bit.ly/ocrimage>","engine":"tesseract"}'
http://openocr.wmflabs.org/ocr <http://openocr.wmflabs.org/ocr>*
"You can create local variables for the pipelines within the template by
preï¬xing the variable name with a â$" sign. Variable names have to be
composed of alphanumeric characters and the underscore. In the example
below I have used a few variations that work for variable names."
If we see evidence of abuse, we might have to protect it with API keys,
but for now, let's AGF. :)
I'm working on something that would be a client of this service, but don't
have a demo yet. Stay tuned! :)
A.

Post by Alex Brollo
I explored abbyy gx files, the full xml output from ABBYY ocr engine
running at Internet Archive, and I've been astonished by the amount of data
they contain - they are stored at XCA_Extended detaiI (as documented at
http://www.abbyy-developers.com/en:tech:features:xml ).
Something that wikisource best developers should explore; comparing those
data with the little bit of data into mapped text layer of djvu files is
impressive and should be inspiring.
But they are static data coming from a standard setting... nothing
similar to a service with simple, shared, deep learning features for
difficult and ancient texts. I tried "ancient italian" tesseract dictionary
with very poor results.
So Asaf, I can't wait for good news from you. :-)
Alex

Post by Andrea Zanni

Post by Asaf Bartov

Post by Andrea Zanni
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that
means ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the
archive's Open Library API does not offer a way to retrieve the OCR output
programmatically, and certainly not for an arbitrary page rather than the
whole item. What I'm working on requires the ability to OCR a single page
on demand.
True.

I've recently met Giovanni, a new (italian) guy who's now working with
Internet Archive and Open Library.
We discussed about a number of possible parnerships/projects, this is
definitely one to bring it up.
But if we manage to do it directly in the Wikimedia world it's even better.
Aubrey

Post by Asaf Bartov
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

--
Asaf Bartov
Wikimedia Foundation <http://www.wikimediafoundation.org>
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Asaf Bartov

2015-07-12 09:23:17 UTC

Permalink

On Sat, Jul 11, 2015 at 8:44 AM, Nicolas VIGNERON <

Thanks for the pointer! I don't see any documentation on how to feed
images to it, though, and no pointer to the source code to figure it out on
my own. Help?

A.

billinghurst

2015-07-12 10:21:24 UTC

Permalink

OCR is available by a javascript. Numbers of wikisources have it enabled as
a gadget, though I cannot speak for all the wikis. I presume it relates to
the languages available in the OCR.

Script is noted at
https://wikisource.org/wiki/Wikisource:Shared_Scripts

Regards, Billinghurst

Post by Asaf Bartov
On Sat, Jul 11, 2015 at 8:44 AM, Nicolas VIGNERON <

Thanks for the pointer! I don't see any documentation on how to feed
images to it, though, and no pointer to the source code to figure it out on
my own. Help?
A.
--
Asaf Bartov
Wikimedia Foundation <http://www.wikimediafoundation.org>
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
https://donate.wikimedia.org
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l