[Wikisource-l] Djvu OCR layer

Alex Brollo

2018-01-23 15:07:06 UTC

Only a little bit of djvu OCR/text contents is currently used, I think
that we can do more:
1. xml and dsed (LISP-like) representations have pros and cons, that should
be carefully considered;
2. djvu text layer can host an unlimited number of metadata and free text
content, indipendent from mapped OCR;
3. hOCR (by tesseract) can be translated in dsed, a converting script would
be very useful to inject tesseract output into djvu OCR layer;
4. IA shares a terrible g-zipped xml, _abbyy.gz, where any possible detail
about OCR recognition can be found, and a converting tool to dsed (perhaps,
recovering too many formatting details!) would be very useful.

I'm playing into all from these issues, I'd like to know if any other
wikisource contributor is interested about.