[Wikireader] offline Wikipedia, Google Gears, & XO

Erik Moeller erik at wikimedia.org
Mon Feb 23 20:53:33 EST 2009


I'm not really aware of the ad hoc mirroring / torrenting that's been
happening; I would suggest checking in with Brion & Gregory who should
have the relevant pointers.

Erik

2009/2/23 Samuel Klein <meta.sj at gmail.com>:
> Erik,
>
> Thank you for the update.  Once video (if not audio) become a bigger
> deal it may be helpful to separate media dumps by mediatype as well.
>
> If you know of any older torrents out there, that would be handy.
>
> SJ
>
> On Mon, Feb 23, 2009 at 8:39 PM, Erik Moeller <erik at wikimedia.org> wrote:
>> Not particularly; we've already got it on our radar and will make it
>> happen, and it's not something that can be easily "volunteerized"
>> (except by looking at the dumper code that already exists and
>> improving it). Full history dumps are first priority, but full Commons
>> and other media dumps (we've talked to Archive.org about setting those
>> up) are definitely targeted as well.
>>
>> Erik
>>
>> 2009/2/23 Samuel Klein <meta.sj at gmail.com>:
>>> We're talking about dumps again on foundation-l -- and I really would
>>> like to see better dumps available, of commons in particular.
>>>
>>> Erik, any advice you can offer in this regard -- can we help move
>>> things along in some way?
>>>
>>> Sj
>>>
>>> On Tue, Feb 26, 2008 at 3:01 PM, Erik Zachte <erikzachte at infodisiac.com> wrote:
>>>>> Something that can process wikimedia's XML dumps, and then crawl
>>>>> wikipedia for the associated pictures. The picture dumps are tarred
>>>>> (why aren't they rsync'able ?) and completely (for me) unmanageable.
>>>>
>>>> I may have some perl code that can be helpful.
>>>> I won't have much time for this project right now but uploading and
>>>> documenting some perl, I can do that.
>>>>
>>>> I have code that harvests all images for a given Wikipedia dump, by
>>>> downloading them one by one.
>>>> With a one sec interval between downloads (to avoid red flags at Wikimedia)
>>>> this takes many weeks to be sure, but images are preserved for next run (on
>>>> a huge disk).
>>>>
>>>> It is part of WikiToTome(Raider) perl scripts, but it can be isolated.
>>>>
>>>> The script determines the url from the filename like the Wikimedia parser
>>>> does (subfolder names are derived from first positions of md5 hash for
>>>> filename).
>>>> Then tries to download the image from the Wikipedia site, if not found tries
>>>> again at Commons. (if both have it local version takes precedence).
>>>>
>>>> Images are stored in a similar nested folder structure as on Wikipedia
>>>> server.
>>>> On subsequent runs only missing images are downloaded (updated images are
>>>> missed).
>>>> Meta data in image are removed (my target platform is a Palm/Pocket PC with
>>>> 4 Gb only).
>>>> Images are resized as smartly as possible:
>>>> jpg's above a certain size are resized to a certain size, compression rate
>>>> is adjusted,
>>>> png's similar except when their compression ratio is better than factor x,
>>>> in which case they probably are charts or diagrams with text which would
>>>> become unreadable when downsized.
>>>>
>>>> In 2005 I generated the last English Wikipedia for handhelds with images,
>>>> about 2Gb was needed for 317,000 images of 240 pixels max height/width,
>>>> without ugly compression artifacts, and quite few larger png's at original
>>>> size.
>>>>
>>>> For offline usage 10 Mb images are rather over the top, and waste a lot of
>>>> bandwidth.
>>>> Actually I would favor a solution where images are collected and resized on
>>>> a Wikimedia server, then put in a tar of only a few Gb.
>>>> Technically I can do this. Time wise, is another matter for just now, but
>>>> that might change.
>>>>
>>>> I have also code to generate png's from <math>..</math>
>>>>
>>>> Erik Zachte
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Wikireader mailing list
>>> Wikireader at lists.laptop.org
>>> http://lists.laptop.org/listinfo/wikireader
>>>
>>
>>
>>
>> --
>> Erik Möller
>> Deputy Director, Wikimedia Foundation
>>
>> Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate
>>
>



-- 
Erik Möller
Deputy Director, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate


More information about the Wikireader mailing list