Discussion:
IA Upload tool — higher-quality DjVus
(too old to reply)
Sam Wilson
2017-02-02 01:33:40 UTC
Permalink
Raw Message
I've been tinkering with the ia-upload tool and incorporating Alex
Brollo's better system of DjVu generation (better than converting from
PDF, that is; instead it works from the original Jpeg2000 files and
merges the OCR data in).

I've set up a test installation of the tool at
http://tools.wmflabs.org/ia-upload/test/ and would love anyone to have a
go at it, and to report any bugs at
https://github.com/wikisource/ia-upload/issues

Because DjVu generation can take a while (quite a while if you've got a
crappy slow laptop like me), the tool runs each job on the grid engine,
starting every 5 minutes. The queue is shown on the homepage of the
tool, with a status of each job. (Unless you're just re-using an
existing DjVu file from the IA, in which case it's just uploaded
directly to Commons while you wait, like the tool's always done.)

Thanks!
Sam Wilson
2017-02-09 03:10:57 UTC
Permalink
Raw Message
This new feature is now live on the ia-upload tool:
http://tools.wmflabs.org/ia-upload/
Please raise any issues on Github:
https://github.com/wikisource/ia-upload/issues

The conversion process takes about 15 minutes for most books, it seems
like. (For books that already have DjVus at IA, it uploads them
immediately though.)

Thanks,
Sam.
Post by Sam Wilson
I've been tinkering with the ia-upload tool and incorporating Alex
Brollo's better system of DjVu generation (better than converting from
PDF, that is; instead it works from the original Jpeg2000 files and
merges the OCR data in).
I've set up a test installation of the tool at
http://tools.wmflabs.org/ia-upload/test/ and would love anyone to have a
go at it, and to report any bugs at
https://github.com/wikisource/ia-upload/issues
Because DjVu generation can take a while (quite a while if you've got a
crappy slow laptop like me), the tool runs each job on the grid engine,
starting every 5 minutes. The queue is shown on the homepage of the
tool, with a status of each job. (Unless you're just re-using an
existing DjVu file from the IA, in which case it's just uploaded
directly to Commons while you wait, like the tool's always done.)
Thanks!
Alex Brollo
2017-02-09 07:13:42 UTC
Permalink
Raw Message
Thanks Sam!
Now we should focus on help about requisites of a good,
wikisource-oriented IA upload: proper scan quality, good file names and
useful metadata. IMHO it would be great to build a "wikisource collection"
into IA, since collection admins can edit any item detail but its ID, and
fix most mistakes.

Alex
Post by Sam Wilson
http://tools.wmflabs.org/ia-upload/
https://github.com/wikisource/ia-upload/issues
The conversion process takes about 15 minutes for most books, it seems
like. (For books that already have DjVus at IA, it uploads them
immediately though.)
Thanks,
Sam.
Post by Sam Wilson
I've been tinkering with the ia-upload tool and incorporating Alex
Brollo's better system of DjVu generation (better than converting from
PDF, that is; instead it works from the original Jpeg2000 files and
merges the OCR data in).
I've set up a test installation of the tool at
http://tools.wmflabs.org/ia-upload/test/ and would love anyone to have a
go at it, and to report any bugs at
https://github.com/wikisource/ia-upload/issues
Because DjVu generation can take a while (quite a while if you've got a
crappy slow laptop like me), the tool runs each job on the grid engine,
starting every 5 minutes. The queue is shown on the homepage of the
tool, with a status of each job. (Unless you're just re-using an
existing DjVu file from the IA, in which case it's just uploaded
directly to Commons while you wait, like the tool's always done.)
Thanks!
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Sam Wilson
2017-02-09 07:38:31 UTC
Permalink
Raw Message
Post by Alex Brollo
Thanks Sam!
Now we should focus on help about requisites of a good, wikisource-
oriented IA upload: proper scan quality, good file names and useful
metadata. IMHO it would be great to build a "wikisource collection"
into IA, since collection admins can edit any item detail but its ID,
and fix most mistakes.
That sounds like a great idea! So it sounds like[1] we need to have 50
items already uploaded before they'll create a collection for us. Then,
maybe we build it into ia-upload: a way of uploading and setting
metadata for a set of scan files? It would upload files to IA and then
do the DjVu-creating thing and upload just the DjVu to Commons?


Or do people upload to Commons first? And then our tool takes a file (or
category of files), uploads it to IA, and then pulls the DjVu back from
there and adds it to the same category?


(I'm sort of thinking aloud...)






Links:

1. https://archive.org/about/faqs.php#Collections
Andrea Zanni
2017-02-12 12:17:19 UTC
Permalink
Raw Message
Hi everyone,
I made this, hopefully is helful:
https://docs.google.com/spreadsheets/d/158GvBrPBW0KfREHRmLFK7EhuB-FQBkLbm9qxJBaJTUY/edit?usp=sharing

It's the list of the files on Commons uploaded from Internet Archive.
The idea, right now, is that every language Wikisource would take care of
their uploads,
and when they are more than 50 they create a "Italian/German/Bengali
Wikisource",
collection on Internet Archive.
The whole set of collections will be inside one "Wikisource" global
collection.

Make sense? Do you agree?
Post by Alex Brollo
Thanks Sam!
Now we should focus on help about requisites of a good,
wikisource-oriented IA upload: proper scan quality, good file names and
useful metadata. IMHO it would be great to build a "wikisource collection"
into IA, since collection admins can edit any item detail but its ID, and
fix most mistakes.
That sounds like a great idea! So it sounds like
<https://archive.org/about/faqs.php#Collections> we need to have 50 items
already uploaded before they'll create a collection for us. Then, maybe we
build it into ia-upload: a way of uploading and setting metadata for a set
of scan files? It would upload files to IA and then do the DjVu-creating
thing and upload just the DjVu to Commons?
Or do people upload to Commons first? And then our tool takes a file (or
category of files), uploads it to IA, and then pulls the DjVu back from
there and adds it to the same category?
(I'm sort of thinking aloud...)
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Sam Wilson
2017-02-13 00:59:43 UTC
Permalink
Raw Message
That's a great idea!

I think we can use Wikidata to build the list:
http://tinyurl.com/zwdbzyq


I had been erroneously thinking along the lines that we'd have to be
uploading something to the items before making it part of a Wikisource
collection, but of course that's not necessary. I think your hierarchy
of wikisource collections sounds perfect.


It'd be cool if items with a page on a Wikisource could have a little
footnote like they do for Open Library ones ("[Open Library icon]This
book has an editable web page[1] on Open Library[2].).


—sam
Post by Andrea Zanni
Hi everyone,
https://docs.google.com/spreadsheets/d/158GvBrPBW0KfREHRmLFK7EhuB-FQBkLbm9qxJBaJTUY/edit?usp=sharing
It's the list of the files on Commons uploaded from Internet Archive.
The idea, right now, is that every language Wikisource would take care
of their uploads,
and when they are more than 50 they create a "Italian/German/Bengali
Wikisource",
collection on Internet Archive.
The whole set of collections will be inside one "Wikisource" global
collection.
Make sense? Do you agree?
On Thu, Feb 9, 2017 at 8:38 AM, Sam Wilson
__
Post by Alex Brollo
Thanks Sam!
Now we should focus on help about requisites of a good, wikisource-
oriented IA upload: proper scan quality, good file names and useful
metadata. IMHO it would be great to build a "wikisource collection"
into IA, since collection admins can edit any item detail but its
ID, and fix most mistakes.
That sounds like a great idea! So it sounds like[3] we need to have
50 items already uploaded before they'll create a collection for us.
Then, maybe we build it into ia-upload: a way of uploading and
setting metadata for a set of scan files? It would upload files to IA
and then do the DjVu-creating thing and upload just the DjVu to
Commons?
Or do people upload to Commons first? And then our tool takes a file
(or category of files), uploads it to IA, and then pulls the DjVu
back from there and adds it to the same category?
(I'm sort of thinking aloud...)
_______________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
_________________________________________________
Wikisource-l mailing list
https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Links:

1. http://openlibrary.org/ia/thatremystre00gaut
2. https://openlibrary.org/
3. https://archive.org/about/faqs.php#Collections
Andrea Zanni
2017-02-13 07:45:20 UTC
Permalink
Raw Message
Post by Sam Wilson
That's a great idea!
http://tinyurl.com/zwdbzyq
Probably, en.source is the only one who has filled in all Wikisource data
inside Wikidata... Or other Wikisources did that? Do you have some workflow
to share?
Post by Sam Wilson
I had been erroneously thinking along the lines that we'd have to be
uploading something to the items before making it part of a Wikisource
collection, but of course that's not necessary. I think your hierarchy of
wikisource collections sounds perfect.
perfect.
Post by Sam Wilson
It'd be cool if items with a page on a Wikisource could have a little
footnote like they do for Open Library ones ("[image: [Open Library icon]]
<https://openlibrary.org>This book has an editable web page
<http://openlibrary.org/ia/thatremystre00gaut> on Open Library
<https://openlibrary.org/>.).
We can try to convince them about that. It'd be only for a fraction of
books, few thousands over the millions they have.

Loading...