[Wikireader] offline Wikipedia, Google Gears, & XO

Mon Feb 23 20:50:55 EST 2009

Erik,

Thank you for the update.  Once video (if not audio) become a bigger
deal it may be helpful to separate media dumps by mediatype as well.

If you know of any older torrents out there, that would be handy.

SJ

On Mon, Feb 23, 2009 at 8:39 PM, Erik Moeller <erik at wikimedia.org> wrote:
> Not particularly; we've already got it on our radar and will make it
> happen, and it's not something that can be easily "volunteerized"
> (except by looking at the dumper code that already exists and
> improving it). Full history dumps are first priority, but full Commons
> and other media dumps (we've talked to Archive.org about setting those
> up) are definitely targeted as well.
>
> Erik
>
> 2009/2/23 Samuel Klein <meta.sj at gmail.com>:
>> We're talking about dumps again on foundation-l -- and I really would
>> like to see better dumps available, of commons in particular.
>>
>> Erik, any advice you can offer in this regard -- can we help move
>> things along in some way?
>>
>> Sj
>>
>> On Tue, Feb 26, 2008 at 3:01 PM, Erik Zachte <erikzachte at infodisiac.com> wrote:
>>>> Something that can process wikimedia's XML dumps, and then crawl
>>>> wikipedia for the associated pictures. The picture dumps are tarred
>>>> (why aren't they rsync'able ?) and completely (for me) unmanageable.
>>>
>>> I may have some perl code that can be helpful.
>>> I won't have much time for this project right now but uploading and
>>> documenting some perl, I can do that.
>>>
>>> I have code that harvests all images for a given Wikipedia dump, by
>>> downloading them one by one.
>>> With a one sec interval between downloads (to avoid red flags at Wikimedia)
>>> this takes many weeks to be sure, but images are preserved for next run (on
>>> a huge disk).
>>>
>>> It is part of WikiToTome(Raider) perl scripts, but it can be isolated.
>>>
>>> The script determines the url from the filename like the Wikimedia parser
>>> does (subfolder names are derived from first positions of md5 hash for
>>> filename).
>>> Then tries to download the image from the Wikipedia site, if not found tries
>>> again at Commons. (if both have it local version takes precedence).
>>>
>>> Images are stored in a similar nested folder structure as on Wikipedia
>>> server.
>>> On subsequent runs only missing images are downloaded (updated images are
>>> missed).
>>> Meta data in image are removed (my target platform is a Palm/Pocket PC with
>>> 4 Gb only).
>>> Images are resized as smartly as possible:
>>> jpg's above a certain size are resized to a certain size, compression rate
>>> is adjusted,
>>> png's similar except when their compression ratio is better than factor x,
>>> in which case they probably are charts or diagrams with text which would
>>> become unreadable when downsized.
>>>
>>> In 2005 I generated the last English Wikipedia for handhelds with images,
>>> about 2Gb was needed for 317,000 images of 240 pixels max height/width,
>>> without ugly compression artifacts, and quite few larger png's at original
>>> size.
>>>
>>> For offline usage 10 Mb images are rather over the top, and waste a lot of
>>> bandwidth.
>>> Actually I would favor a solution where images are collected and resized on
>>> a Wikimedia server, then put in a tar of only a few Gb.
>>> Technically I can do this. Time wise, is another matter for just now, but
>>> that might change.
>>>
>>> I have also code to generate png's from <math>..</math>
>>>
>>> Erik Zachte
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>> _______________________________________________
>> Wikireader mailing list
>> Wikireader at lists.laptop.org
>> http://lists.laptop.org/listinfo/wikireader
>>
>
>
>
> --
> Erik Möller
> Deputy Director, Wikimedia Foundation
>
> Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate
>