[Wikireader] offline Wikipedia, Google Gears, & XO

Mon Feb 23 20:31:30 EST 2009

We're talking about dumps again on foundation-l -- and I really would
like to see better dumps available, of commons in particular.

Erik, any advice you can offer in this regard -- can we help move
things along in some way?

Sj

On Tue, Feb 26, 2008 at 3:01 PM, Erik Zachte <erikzachte at infodisiac.com> wrote:
>> Something that can process wikimedia's XML dumps, and then crawl
>> wikipedia for the associated pictures. The picture dumps are tarred
>> (why aren't they rsync'able ?) and completely (for me) unmanageable.
>
> I may have some perl code that can be helpful.
> I won't have much time for this project right now but uploading and
> documenting some perl, I can do that.
>
> I have code that harvests all images for a given Wikipedia dump, by
> downloading them one by one.
> With a one sec interval between downloads (to avoid red flags at Wikimedia)
> this takes many weeks to be sure, but images are preserved for next run (on
> a huge disk).
>
> It is part of WikiToTome(Raider) perl scripts, but it can be isolated.
>
> The script determines the url from the filename like the Wikimedia parser
> does (subfolder names are derived from first positions of md5 hash for
> filename).
> Then tries to download the image from the Wikipedia site, if not found tries
> again at Commons. (if both have it local version takes precedence).
>
> Images are stored in a similar nested folder structure as on Wikipedia
> server.
> On subsequent runs only missing images are downloaded (updated images are
> missed).
> Meta data in image are removed (my target platform is a Palm/Pocket PC with
> 4 Gb only).
> Images are resized as smartly as possible:
> jpg's above a certain size are resized to a certain size, compression rate
> is adjusted,
> png's similar except when their compression ratio is better than factor x,
> in which case they probably are charts or diagrams with text which would
> become unreadable when downsized.
>
> In 2005 I generated the last English Wikipedia for handhelds with images,
> about 2Gb was needed for 317,000 images of 240 pixels max height/width,
> without ugly compression artifacts, and quite few larger png's at original
> size.
>
> For offline usage 10 Mb images are rather over the top, and waste a lot of
> bandwidth.
> Actually I would favor a solution where images are collected and resized on
> a Wikimedia server, then put in a tar of only a few Gb.
> Technically I can do this. Time wise, is another matter for just now, but
> that might change.
>
> I have also code to generate png's from <math>..</math>
>
> Erik Zachte
>
>
>
>
>
>
>