Discussion:
[Wikisource-l] Do we have tools for offline collaboration?
mathieu stumpf guntz
2018-03-24 14:58:32 UTC
Permalink
Hello,

A person in a local Wikisource workshop asked me if we could download
all material of a specific work to proofread it offline. So download
both the pictures and the OCRed text. Additionaly I think it would be
good to provide tool to at least have side by side plain text and pictures.

So, are you aware of anything close to such a tool? :)

Cheers
billinghurst
2018-03-24 15:22:46 UTC
Permalink
Though that would defeat the purpose of online proofreading with account
verification. Some of the true value of our online process is that
contribution builds a level of trust and knowledge and that is reflected
in both our patrolling and the allocation of autopatrolled status. Also
how would you have access to templates, and components like that from
off-line?

Also we generally cannot download the images separately as that is
usually part of the later clean-up where people have the technical
skills.

So yes, there is the capacity to have the text and proofread the text,
that actual checking the text against the image is not the sole
component of proofreading, and further it would not be at all helpful
for validation.

-- billinghurst

------ Original Message ------
From: "mathieu stumpf guntz" <***@culture-libre.org>
To: "discussion list for Wikisource, the free library"
<wikisource-***@lists.wikimedia.org>
Sent: 25/03/2018 1:58:32 AM
Subject: [Wikisource-l] Do we have tools for offline collaboration?
Post by mathieu stumpf guntz
Hello,
A person in a local Wikisource workshop asked me if we could download
all material of a specific work to proofread it offline. So download
both the pictures and the OCRed text. Additionaly I think it would be
good to provide tool to at least have side by side plain text and pictures.
So, are you aware of anything close to such a tool? :)
Cheers
mathieu stumpf guntz
2018-03-26 08:21:26 UTC
Permalink
Post by billinghurst
Though that would defeat the purpose of online proofreading with
account verification. Some of the true value of our online process is
that contribution builds a level of trust and knowledge and that is
reflected in both our patrolling and the allocation of autopatrolled
status.
How providing tools to make batch work offline would interfere in anyway
with that? Once the work is done, it can be uploaded to Wikisource with
whichever account the user want.

Actually, to my mind, the main benefit of the online aspect is the peer
to peer production model. Also there is no need of a central node
carrying accounts to take into account the trust given to a particular
contributor. There is digital signature technologies such as gpg for
example. Having a central node with a web interface just makes things
easier for most users, it doesn't improve the trustability of the
environment. On the contrary, with a single point of failure, we
actually rely on a weaker solution on this regard.
Post by billinghurst
 Also how would you have access to templates, and components like that
from off-line?
Well, that just show how innefecient are this tools to continue to
contribute while being offline. It's allways possible to install
Mediawiki and download required templates, but currently this process
seems way to complicated, doesn't it.
Post by billinghurst
Also we generally cannot download the images separately as that is
usually part of the later clean-up where people have the technical skills.
I'm afraid the term "image" misguided your answer. It's seems you
interpreted that as picture elements from files, while I was talking
about this files themselves.
Post by billinghurst
So yes, there is the capacity to have the text and proofread the text,
that actual checking the text against the image is not the sole
component of proofreading, and further it would not be at all helpful
for validation.
There is nothing magic about working directly in a browser. People do
download and upload all the required material anyway, but on a page per
page base. The result is just as valid as it is done when transactions
are operated on a file repository level.

Cheers
Nahum Wengrov
2018-03-26 15:27:18 UTC
Permalink
I frequently work offline on he.wikisource. I download the entire pdf file
from commons to my hard drive, and OCR the page I need myself. One can use
the OCR of wikisource and download the text too, I guess, page by page.
Then I proof the text in a Word document, open to the lower half of my
screen, with the pdf open on the upper half of the screen, where I go to
the page I need with acrobat reader, and scroll both windows down or up as
needed.

On Mon, Mar 26, 2018 at 11:21 AM, mathieu stumpf guntz <
Post by billinghurst
Though that would defeat the purpose of online proofreading with account
verification. Some of the true value of our online process is that
contribution builds a level of trust and knowledge and that is reflected in
both our patrolling and the allocation of autopatrolled status.
How providing tools to make batch work offline would interfere in anyway
with that? Once the work is done, it can be uploaded to Wikisource with
whichever account the user want.
Actually, to my mind, the main benefit of the online aspect is the peer to
peer production model. Also there is no need of a central node carrying
accounts to take into account the trust given to a particular contributor.
There is digital signature technologies such as gpg for example. Having a
central node with a web interface just makes things easier for most users,
it doesn't improve the trustability of the environment. On the contrary,
with a single point of failure, we actually rely on a weaker solution on
this regard.
Also how would you have access to templates, and components like that
from off-line?
Well, that just show how innefecient are this tools to continue to
contribute while being offline. It's allways possible to install Mediawiki
and download required templates, but currently this process seems way to
complicated, doesn't it.
Also we generally cannot download the images separately as that is usually
part of the later clean-up where people have the technical skills.
I'm afraid the term "image" misguided your answer. It's seems you
interpreted that as picture elements from files, while I was talking about
this files themselves.
So yes, there is the capacity to have the text and proofread the text,
that actual checking the text against the image is not the sole component
of proofreading, and further it would not be at all helpful for validation.
There is nothing magic about working directly in a browser. People do
download and upload all the required material anyway, but on a page per
page base. The result is just as valid as it is done when transactions are
operated on a file repository level.
Cheers
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
mathieu stumpf guntz
2018-04-12 23:22:36 UTC
Permalink
Thank you Nahum,

Could you indicate which OCR solution you are using?
Post by Nahum Wengrov
I frequently work offline on he.wikisource. I download the entire pdf
file from commons to my hard drive, and OCR the page I need myself.
One can use the OCR of wikisource and download the text too, I guess,
page by page. Then I proof the text in a Word document, open to the
lower half of my screen, with the pdf open on the upper half of the
screen, where I go to the page I need with acrobat reader, and scroll
both windows down or up as needed.
On Mon, Mar 26, 2018 at 11:21 AM, mathieu stumpf guntz
Post by billinghurst
Though that would defeat the purpose of online proofreading with
account verification. Some of the true value of our online
process is that contribution builds a level of trust and
knowledge and that is reflected in both our patrolling and the
allocation of autopatrolled status.
How providing tools to make batch work offline would interfere in
anyway with that? Once the work is done, it can be uploaded to
Wikisource with whichever account the user want.
Actually, to my mind, the main benefit of the online aspect is the
peer to peer production model. Also there is no need of a central
node carrying accounts to take into account the trust given to a
particular contributor. There is digital signature technologies
such as gpg for example. Having a central node with a web
interface just makes things easier for most users, it doesn't
improve the trustability of the environment. On the contrary, with
a single point of failure, we actually rely on a weaker solution
on this regard.
Post by billinghurst
 Also how would you have access to templates, and components like
that from off-line?
Well, that just show how innefecient are this tools to continue to
contribute while being offline. It's allways possible to install
Mediawiki and download required templates, but currently this
process seems way to complicated, doesn't it.
Post by billinghurst
Also we generally cannot download the images separately as that
is usually part of the later clean-up where people have the
technical skills.
I'm afraid the term "image" misguided your answer. It's seems you
interpreted that as picture elements from files, while I was
talking about this files themselves.
Post by billinghurst
So yes, there is the capacity to have the text and proofread the
text, that actual checking the text against the image is not the
sole component of proofreading, and further it would not be at
all helpful for validation.
There is nothing magic about working directly in a browser. People
do download and upload all the required material anyway, but on a
page per page base. The result is just as valid as it is done when
transactions are operated on a file repository level.
Cheers
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
<https://lists.wikimedia.org/mailman/listinfo/wikisource-l>
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
mathieu stumpf guntz
2018-04-13 06:54:49 UTC
Permalink
Good to know. I consulted the website of ABBYY and it say one option is
an "Open license for local use on workstations", but I guess it's not a
FLOSS license, unfortunately.

By the way, what is the state of the affair regarding Indic languages?

Do we have a central page documenting existing OCR pipeline used by the
wikisource community?

What should I say to a contributor which come to me asking "I have this
old PD book in my personnal library that I would like to digitalize,
share and proofread in Wikisource, where should I start?". Do we have an
online service, for example on tool labs, which enable to either upload
or simply input url of a facsimile and that launch the OCR for example
backed on tesseract?

Shouldn't we update our roadmap[1], or is there a more up to date
document elsewhere?

[1] https://meta.wikimedia.org/wiki/Wikisource_roadmap
I use ABBYY Finereader, don't remember the exact version (probably 12
or 11). I bought it a few years ago and it works perfectly for my
language (Hebrew).
On Fri, Apr 13, 2018 at 2:22 AM, mathieu stumpf guntz
Thank you Nahum,
Could you indicate which OCR solution you are using?
Post by Nahum Wengrov
I frequently work offline on he.wikisource. I download the entire
pdf file from commons to my hard drive, and OCR the page I need
myself. One can use the OCR of wikisource and download the text
too, I guess, page by page. Then I proof the text in a Word
document, open to the lower half of my screen, with the pdf open
on the upper half of the screen, where I go to the page I need
with acrobat reader, and scroll both windows down or up as needed.
On Mon, Mar 26, 2018 at 11:21 AM, mathieu stumpf guntz
Post by billinghurst
Though that would defeat the purpose of online proofreading
with account verification. Some of the true value of our
online process is that contribution builds a level of trust
and knowledge and that is reflected in both our patrolling
and the allocation of autopatrolled status.
How providing tools to make batch work offline would
interfere in anyway with that? Once the work is done, it can
be uploaded to Wikisource with whichever account the user want.
Actually, to my mind, the main benefit of the online aspect
is the peer to peer production model. Also there is no need
of a central node carrying accounts to take into account the
trust given to a particular contributor. There is digital
signature technologies such as gpg for example. Having a
central node with a web interface just makes things easier
for most users, it doesn't improve the trustability of the
environment. On the contrary, with a single point of failure,
we actually rely on a weaker solution on this regard.
Post by billinghurst
 Also how would you have access to templates, and components
like that from off-line?
Well, that just show how innefecient are this tools to
continue to contribute while being offline. It's allways
possible to install Mediawiki and download required
templates, but currently this process seems way to
complicated, doesn't it.
Post by billinghurst
Also we generally cannot download the images separately as
that is usually part of the later clean-up where people have
the technical skills.
I'm afraid the term "image" misguided your answer. It's seems
you interpreted that as picture elements from files, while I
was talking about this files themselves.
Post by billinghurst
So yes, there is the capacity to have the text and proofread
the text, that actual checking the text against the image is
not the sole component of proofreading, and further it would
not be at all helpful for validation.
There is nothing magic about working directly in a browser.
People do download and upload all the required material
anyway, but on a page per page base. The result is just as
valid as it is done when transactions are operated on a file
repository level.
Cheers
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
<https://lists.wikimedia.org/mailman/listinfo/wikisource-l>
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
<https://lists.wikimedia.org/mailman/listinfo/wikisource-l>
Nicolas VIGNERON
2018-04-13 09:33:33 UTC
Permalink
2018-04-13 8:54 GMT+02:00 mathieu stumpf guntz <
Good to know. I consulted the website of ABBYY and it say one option is an
"Open license for local use on workstations", but I guess it's not a FLOSS
license, unfortunately.
Not at all, read more carefully, this license is available only when you
already purchased more than 50 licenses (
https://www.abbyy.com/en-ca/finereader/licensing/ ) so at least 5000 € IIRC.
By the way, what is the state of the affair regarding Indic languages?
I left that one for people more acquainted with that but it seems to work
fine.
Do we have a central page documenting existing OCR pipeline used by the
wikisource community?
Not that I know of.
And AFAIK, each Wikisource and Wikisourcerer have different systems
(sometimes small differences but sometimes big differences).
What should I say to a contributor which come to me asking "I have this
old PD book in my personnal library that I would like to digitalize, share
and proofread in Wikisource, where should I start?". Do we have an online
service, for example on tool labs, which enable to either upload or simply
input url of a facsimile and that launch the OCR for example backed on
tesseract?
There is BUB https://tools.wmflabs.org/bub/ but only for certains websites.
Shouldn't we update our roadmap[1], or is there a more up to date document
elsewhere?
Whe should write a new document.

Cdlt, ~nicolas
Bodhisattwa Mandal
2018-04-13 12:19:09 UTC
Permalink
Post by Nicolas VIGNERON
There is BUB https://tools.wmflabs.org/bub/ but only for certains websites.
BUB is not working for more than a year.
--
Bodhisattwa
Yann Forget
2018-03-25 11:49:18 UTC
Permalink
FYI, Zoé on the French Wikisource works offline, and then copy-paste the
proofread text back to Wikisource.
Seeing the result, she has quite a good process, fast and good quality.
You might want to ask her how she works:
https://fr.wikisource.org/wiki/Sp%C3%A9cial:Contributions/Zo%C3%A9

Regards,

Yann


2018-03-24 20:28 GMT+05:30 mathieu stumpf guntz <
Post by mathieu stumpf guntz
Hello,
A person in a local Wikisource workshop asked me if we could download all
material of a specific work to proofread it offline. So download both the
pictures and the OCRed text. Additionaly I think it would be good to
provide tool to at least have side by side plain text and pictures.
So, are you aware of anything close to such a tool? :)
Cheers
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
--
Jai Jagat 2020 Grand March Coordinator
https://www.jaijagat2020.org/
+91-62 60 140 319
+91-74 34 93 33 58
Loading...