[Wikireader] english wikireaders and 0.7
Madeleine Ball
meprice at gmail.com
Sun Sep 7 02:04:49 EDT 2008
Starting with a list 27,835 unique article names Martin Walker provided:
http://toolserver.org/~cbm/release-data/2008-9-4/HTML/index.html
I found five pages in the list that were redirects to other pages:
"Abrahamic religion" redirects to "Abrahamic religions"
"Turkcell Super League" redirects to "Süper Lig"
"Levodopa" redirects to "L-DOPA"
"City & South London Railway" redirects to "City and South London
Railway"
"Matilda d'Anjou" redirects to "Empress Matilda"
"Solidarity" redirects to "Solidarity (trade union)"
143 pages on this list are "orphaned" in the sense that they are not
linked to by any others on the list.
There are 49,299 redirects to selected pages linked to by other
selected pages. For example, "1001 Arabian Nights", "1001 nights",
and "1001 Nights" are all linked to by selected pages ("Caliph",
"Harun al-Rashid", and "Sassanid Empire") and would redirect to "One
Thousand and One Nights", also a selected page. Including these and
excluding the internally orphaned list results in a list of 76,991
pages.
Compressed into bz2, the size of this is 230MB without redirects and
233MB with them, about 3x our target size (which is about 80MB,
hoping to use 20MB for images for a total of 100MB). This is
disappointing... we wanted to be able to fit more, but not surprising
(SJ predicted it) given the greater amount of content on English
Wikipedia.
We tried starting with the top 10k set using the highest "overall
score" for each page (many pages are listed more than once, with
difference scores)... this came out to 112MB. Starting with an 8k
set, we got it down to 97MB; we'll probably be using this one.
I've tried looking at the raw top-N articles using traffic, and
there's a lot of noise from having traffic stats for just one month.
The combined list looks to be clearly superior. Thank you!
Chris will soon send out a link to the new activity for people to
download onto their XO.
- Madeleine
On Sep 4, 2008, at 5:01 PM, Martin Walker wrote:
> We found a bug in the SelectionBot script that was affecting some
> unassessed articles. That has now been fixed, and there is now an
> updated set of results, with about 28,000 articles selected.
>
> http://toolserver.org/~cbm/release-data/2008-9-4/HTML/index.html
>
>
> As for the small detailed fixes, we'll have to work on those at the
> weekend.
>
> Martin
> Walkerma on Wikipedia
>
> Samuel Klein wrote:
>> ok, let's meet friday at 1500 EST on #kiwix on freenode,
>> for those who can make it, to discuss making a main page for an
>> english 0.7 wikipedia bundle.
>>
>> SJ
>>
>> On Thu, Aug 28, 2008 at 12:20 PM, Martin Pascal
>> <pmartin at linterweb.com <mailto:pmartin at linterweb.com>> wrote:
>>
>> Yes Sj ,
>>
>> you could join #kiwix on irc.freenode.net <http://
>> irc.freenode.net>
>> Cordialement
>> Martin Pascal
>> tel : 02 32 40 23 69, fax : 02 32 61 45 26
>> gsm : 06 13 89 77 32
>> ----- Original Message ----- From: "Martin Walker"
>> <walkerma at potsdam.edu <mailto:walkerma at potsdam.edu>>
>>
>> To: "Samuel Klein" <sj at laptop.org <mailto:sj at laptop.org>>
>> Cc: "Madeleine Ball" <mad at printf.net <mailto:mad at printf.net>>;
>> "Offline Wikireaders" <wikireader at lists.laptop.org
>> <mailto:wikireader at lists.laptop.org>>
>> Sent: Thursday, August 28, 2008 6:16 PM
>>
>> Subject: Re: [Wikireader] english wikireaders and 0.7
>>
>>
>> SJ,
>>
>> I can manage an IRC meeting on Friday - say at 3pm EDT (1900h
>> UTC)? If
>> this is difficult for others, I will be around next week. We
>> have the
>> #wikipedia-1.0 channel ( irc://irc.freenode.net/
>> #wikipedia-1.0
>> <http://irc.freenode.net/#wikipedia-1.0> ) if you
>> wish, but perhaps you have a wikireader channel that may
>> be more
>> appropriate?
>>
>> Martin
>>
>>
>> Samuel Klein wrote:
>>
>> @martin -- How about having a Friday afternoon wikireader
>> meeting?
>> For this week, whether or not we meet, a pressing
>> question
>> is :
>> Generating the main page. For the spanish WP, Madeleine
>> did most of
>> the main page by hand with a bit of help. We may have to
>> do the same
>> here until better scripts are set up.
>>
>> A couple people built the main page for our
>> spanish-language bundle
>> more or less by hand from a portal template.
>>
>> Metadata :
>>
>> 1. metadata that is currently particularly useful for
>> us is:
>> - a blacklist of article titles, and a blacklist of
>> images, for the
>> very few that we explicitly leave out despite other
>> metadata
>> - a whitelist of both, again to ensure inclusion.
>>
>> 2. In a general system, I'd like to see this tagged with
>> the name of
>> the group associated; say olpc-peru-blacklist and
>> olpc-peru-whitelist.
>>
>> @cfabian -- testing this on bee units sounds like a fun
>> test of the
>> metadata slimming!
>>
>> SJ
>>
>> ps - any news from the offline spanish wp project that
>> got
>> started a
>> while back?
>>
>>
>> On Sun, Aug 24, 2008 at 6:12 PM, Martin Walker
>> <walkerma at potsdam.edu <mailto:walkerma at potsdam.edu>
>> <mailto:walkerma at potsdam.edu
>> <mailto:walkerma at potsdam.edu>>> wrote:
>>
>> Things are looking very promising for the Version 0.7
>> selection -
>> we should have a complete article list within a
>> week or so,
>> containing about 30,000 articles organized by a
>> combination of
>> quality and importance. With our basic system of
>> compression ,
>> using I think probably Zeno format), I believe we
>> should be able
>> to include 30,000 long-ish articles with thumbnails on
>> one DVD,
>> along with Kiwix and some index pages. I'd be
>> interested to see
>> how it would work with your compression system - we
>> could get a
>> few people to test that, I think.
>>
>> I know how you love metadata, SJ, and we now have
>> loads
>> of it
>> (from 1.4 million articles) - so we can customize the
>> selection
>> for you at will using quality, wikiproject, or the
>> four
>> importance
>> paramaters. Since this is for kids in specific
>> places,
>> we can
>> emphasize dinosaurs or birds, exclude serial killers,
>> or include
>> all articles from (say) Uganda, all as requested. Let
>> me know if
>> this feature is useful. We don't have an equivalent
>> ranking for
>> images, I'm afraid - for V0.7 we just include all
>> legal
>> images (as
>> thumbnails). As for a "main page", the plan is to
>> have
>> a set of
>> index pages generated by bot and then corrected by
>> a manual
>> "reality check", but that will take another month
>> or two.
>>
>> I'd really like to make sure that we make sure we work
>> together in
>> the coming months, because I think we can avoid a lot
>> of duplicate
>> work if we share our best resources, scripts, etc.
>> Once the
>> selection is done (~ 1st Sept), should we hold an IRC
>> discussion
>> on how we can best collaborate?
>>
>> Martin
>>
>>
>> Samuel Klein wrote:
>>
>> There's lots of motivation to get an english
>> wikireader, say,
>> taking advantage of the article selection and
>> processing of 0.7 .
>> OLPC could include this in the upcoming G1G1
>> machines this
>> winter / early next year. Other users could test
>> wikireaders
>> that read this zipped format on their own
>> machines,
>> which
>> would flesh out the reader code.
>>
>> Martin -- what's the status on the 0.7
>> articlelist?
>> Do you
>> have a similar imagelist that ranks images by
>> importance to
>> that set of articles?
>> How is work on a 0.7 main page? I'd love to see
>> how large a
>> snapshot is with our curent wikireader code
>> (without even
>> moving to 7z, or trimming the list).
>>
>> SJ
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wikireader mailing list
>> Wikireader at lists.laptop.org
>> <mailto:Wikireader at lists.laptop.org>
>> http://lists.laptop.org/listinfo/wikireader
>>
>>
>>
>
>
>
More information about the Wikireader
mailing list