[Wikireader] english wikireaders and 0.7
Samuel Klein
sj at laptop.org
Sun Sep 7 16:05:33 EDT 2008
Andrew, what is that list for precisely? Is it a list of good
revisions, or a list of the revisionid checked at the time and any
comments made? We won't be able to apply comments by hand, but might
be able to use a specific revision rather than the latest one.
I checked a handful of the revs listed, and in general the latest
revision seems just as appropriate and usually more detailed or better
sourced.
@cjb and mad -- I'm compiling a list of mods; can you post the 2000
articles between 8k and 10k?
I'm both creating a tiny blacklist and adding some articles clearly
needed to complete various sets (which must have fallen a bit
above/below the cutoff). Note for the future : we need to have
categories/groups with aggregate priorities, and quantized priorities
within each category, so that a selection can be fit into a given size
without including 88 of 90 elements or 8 of 10 world cups.
SJ
On Sun, Sep 7, 2008 at 2:26 AM, Andrew Cates <Andrew at soschildren.org> wrote:
> Samuel,
>
> Some of the volunteers listed their comments online like this:
> http://en.wikipedia.org/w/index.php?title=Wikipedia%3AWikipedia_CD_Selection%2Fadditions_and_updates&diff=212207029&oldid=211039948
>
> But although we took out vandalism when it was still visible in the
> current versions we did not generally remove adult stuff so other
> editoral changes to the main WP. The database contains a "section"
> delete and "string delete" specific to each article as well as the
> generic ones (so 1800 births are not taken out but 1985 births are).
> In a few cases where the script was hard to do I actually did do the
> changes and then self reverted, so there is a child friendly version
> in the edit histoty, mainly year pages and serial killers (e.g.
> http://en.wikipedia.org/w/index.php?title=1998&diff=219094725&oldid=219094618)
> but for things like sysmatic section removal it was not done on WP.
> Births is not taken out in this revision because the script will catch
> it.
>
> This needless to say is a lot of work...
>
> Andrew
> =============================
>
> On Sat, Sep 6, 2008 at 9:21 PM, Samuel Klein <sj at laptop.org> wrote:
>> That's great, thank you Andrew. do you post these changers back to wp
>> proper? I'd like for every article revision we include in our
>> bundle to have a permalink online. (and it makes sense to me that
>> some other people who currently only read wp might like your versions
>> as well...)
>>
>> I will certainly support you in running an SOS-bot that publishes its
>> preferred cleaner revisions to articles, with an edit summary
>> indicating it is posting the version from the latest
>> childrens-wikipedia, and a bot-option to self-revert and leave a
>> message on the talk page (if editors start to get annoyed with it --
>> that way the regulars on any given article can choose to include or
>> not include its changes, but it doesn't change the latest-current
>> version and start what may already be ongoing edit wars).
>>
>> SJ
>>
>> (You know the content-review is overseen by a Wikipedian when... it
>> includes cleaning out 'births' since 1980 and 'trivia' sections in
>> bios. :-)
>>
>> On Sat, Sep 6, 2008 at 2:22 AM, Andrew Cates <Andrew at soschildren.org> wrote:
>>> Hi Samuel
>>>
>>> Just to be clear, we have finished checking our 5400 articles for
>>> vandalism etc and have this list. But as well as choosing versions we
>>> have a cleanup script which removes unsuitable paragraphs within
>>> articles, and editorial notices (e.g. empty sections, "see also" to
>>> articles not on the list, the sections labelled "personal life" in
>>> biographies which tends to be full of speculation about sexual
>>> orientation, the "births" section in years post 1980 which is full of
>>> rubbish, topic boxes where most of them are not included, category
>>> lists from portal pages, editorial notices where the issue is minor
>>> etc.). The remaining two weeks work is on the script not on finding
>>> the versions.
>>>
>>> The "near current" state of play is at
>>> http://schools-wikipedia-test.soschildren.org/wp/index/subject.htm
>>> which is only a week old.
>>>
>>> Andrew
>>>
>>> On Fri, Sep 5, 2008 at 6:46 PM, Samuel Klein <sj at laptop.org> wrote:
>>>> Thanks for the update. bozmo, it's great to hear your group is
>>>> working on assessments as well... we won't be able to wait another two
>>>> weeks for a revised version list, but may be able to recompile once
>>>> next week. However, I think for olpc's coming release we want a final
>>>> draft bundle this weekend.
>>>>
>>>> Warmly,
>>>> SJ
>>>>
>>>> On Thu, Sep 4, 2008 at 5:01 PM, Martin Walker <walkerma at potsdam.edu> wrote:
>>>>> We found a bug in the SelectionBot script that was affecting some unassessed
>>>>> articles. That has now been fixed, and there is now an updated set of
>>>>> results, with about 28,000 articles selected.
>>>>>
>>>>> http://toolserver.org/~cbm/release-data/2008-9-4/HTML/index.html
>>>>>
>>>>>
>>>>> As for the small detailed fixes, we'll have to work on those at the weekend.
>>>>>
>>>>> Martin
>>>>> Walkerma on Wikipedia
>>>>>
>>>>> Samuel Klein wrote:
>>>>>>
>>>>>> ok, let's meet friday at 1500 EST on #kiwix on freenode,
>>>>>> for those who can make it, to discuss making a main page for an english
>>>>>> 0.7 wikipedia bundle.
>>>>>>
>>>>>> SJ
>>>>>>
>>>>>> On Thu, Aug 28, 2008 at 12:20 PM, Martin Pascal <pmartin at linterweb.com
>>>>>> <mailto:pmartin at linterweb.com>> wrote:
>>>>>>
>>>>>> Yes Sj ,
>>>>>>
>>>>>> you could join #kiwix on irc.freenode.net <http://irc.freenode.net>
>>>>>> Cordialement
>>>>>> Martin Pascal
>>>>>> tel : 02 32 40 23 69, fax : 02 32 61 45 26
>>>>>> gsm : 06 13 89 77 32
>>>>>> ----- Original Message ----- From: "Martin Walker"
>>>>>> <walkerma at potsdam.edu <mailto:walkerma at potsdam.edu>>
>>>>>>
>>>>>> To: "Samuel Klein" <sj at laptop.org <mailto:sj at laptop.org>>
>>>>>> Cc: "Madeleine Ball" <mad at printf.net <mailto:mad at printf.net>>;
>>>>>> "Offline Wikireaders" <wikireader at lists.laptop.org
>>>>>> <mailto:wikireader at lists.laptop.org>>
>>>>>> Sent: Thursday, August 28, 2008 6:16 PM
>>>>>>
>>>>>> Subject: Re: [Wikireader] english wikireaders and 0.7
>>>>>>
>>>>>>
>>>>>> SJ,
>>>>>>
>>>>>> I can manage an IRC meeting on Friday - say at 3pm EDT (1900h
>>>>>> UTC)? If
>>>>>> this is difficult for others, I will be around next week. We
>>>>>> have the
>>>>>> #wikipedia-1.0 channel ( irc://irc.freenode.net/#wikipedia-1.0
>>>>>> <http://irc.freenode.net/#wikipedia-1.0> ) if you
>>>>>> wish, but perhaps you have a wikireader channel that may be more
>>>>>> appropriate?
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>> Samuel Klein wrote:
>>>>>>
>>>>>> @martin -- How about having a Friday afternoon wikireader
>>>>>> meeting?
>>>>>> For this week, whether or not we meet, a pressing question
>>>>>> is :
>>>>>> Generating the main page. For the spanish WP, Madeleine
>>>>>> did most of
>>>>>> the main page by hand with a bit of help. We may have to
>>>>>> do the same
>>>>>> here until better scripts are set up.
>>>>>>
>>>>>> A couple people built the main page for our
>>>>>> spanish-language bundle
>>>>>> more or less by hand from a portal template.
>>>>>>
>>>>>> Metadata :
>>>>>>
>>>>>> 1. metadata that is currently particularly useful for us is:
>>>>>> - a blacklist of article titles, and a blacklist of
>>>>>> images, for the
>>>>>> very few that we explicitly leave out despite other metadata
>>>>>> - a whitelist of both, again to ensure inclusion.
>>>>>>
>>>>>> 2. In a general system, I'd like to see this tagged with
>>>>>> the name of
>>>>>> the group associated; say olpc-peru-blacklist and
>>>>>> olpc-peru-whitelist.
>>>>>>
>>>>>> @cfabian -- testing this on bee units sounds like a fun
>>>>>> test of the
>>>>>> metadata slimming!
>>>>>>
>>>>>> SJ
>>>>>>
>>>>>> ps - any news from the offline spanish wp project that got
>>>>>> started a
>>>>>> while back?
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 24, 2008 at 6:12 PM, Martin Walker
>>>>>> <walkerma at potsdam.edu <mailto:walkerma at potsdam.edu>
>>>>>> <mailto:walkerma at potsdam.edu
>>>>>> <mailto:walkerma at potsdam.edu>>> wrote:
>>>>>>
>>>>>> Things are looking very promising for the Version 0.7
>>>>>> selection -
>>>>>> we should have a complete article list within a week or so,
>>>>>> containing about 30,000 articles organized by a
>>>>>> combination of
>>>>>> quality and importance. With our basic system of
>>>>>> compression ,
>>>>>> using I think probably Zeno format), I believe we
>>>>>> should be able
>>>>>> to include 30,000 long-ish articles with thumbnails on
>>>>>> one DVD,
>>>>>> along with Kiwix and some index pages. I'd be
>>>>>> interested to see
>>>>>> how it would work with your compression system - we
>>>>>> could get a
>>>>>> few people to test that, I think.
>>>>>>
>>>>>> I know how you love metadata, SJ, and we now have loads
>>>>>> of it
>>>>>> (from 1.4 million articles) - so we can customize the
>>>>>> selection
>>>>>> for you at will using quality, wikiproject, or the four
>>>>>> importance
>>>>>> paramaters. Since this is for kids in specific places,
>>>>>> we can
>>>>>> emphasize dinosaurs or birds, exclude serial killers,
>>>>>> or include
>>>>>> all articles from (say) Uganda, all as requested. Let
>>>>>> me know if
>>>>>> this feature is useful. We don't have an equivalent
>>>>>> ranking for
>>>>>> images, I'm afraid - for V0.7 we just include all legal
>>>>>> images (as
>>>>>> thumbnails). As for a "main page", the plan is to have
>>>>>> a set of
>>>>>> index pages generated by bot and then corrected by a manual
>>>>>> "reality check", but that will take another month or two.
>>>>>>
>>>>>> I'd really like to make sure that we make sure we work
>>>>>> together in
>>>>>> the coming months, because I think we can avoid a lot
>>>>>> of duplicate
>>>>>> work if we share our best resources, scripts, etc.
>>>>>> Once the
>>>>>> selection is done (~ 1st Sept), should we hold an IRC
>>>>>> discussion
>>>>>> on how we can best collaborate?
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>>
>>>>>> Samuel Klein wrote:
>>>>>>
>>>>>> There's lots of motivation to get an english
>>>>>> wikireader, say,
>>>>>> taking advantage of the article selection and
>>>>>> processing of 0.7 .
>>>>>> OLPC could include this in the upcoming G1G1
>>>>>> machines this
>>>>>> winter / early next year. Other users could test
>>>>>> wikireaders
>>>>>> that read this zipped format on their own machines,
>>>>>> which
>>>>>> would flesh out the reader code.
>>>>>>
>>>>>> Martin -- what's the status on the 0.7 articlelist?
>>>>>> Do you
>>>>>> have a similar imagelist that ranks images by
>>>>>> importance to
>>>>>> that set of articles?
>>>>>> How is work on a 0.7 main page? I'd love to see
>>>>>> how large a
>>>>>> snapshot is with our curent wikireader code
>>>>>> (without even
>>>>>> moving to 7z, or trimming the list).
>>>>>>
>>>>>> SJ
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Wikireader mailing list
>>>>>> Wikireader at lists.laptop.org <mailto:Wikireader at lists.laptop.org>
>>>>>> http://lists.laptop.org/listinfo/wikireader
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> Wikireader mailing list
>>>> Wikireader at lists.laptop.org
>>>> http://lists.laptop.org/listinfo/wikireader
>>>>
>>>
>> _______________________________________________
>> Wikireader mailing list
>> Wikireader at lists.laptop.org
>> http://lists.laptop.org/listinfo/wikireader
>>
>
More information about the Wikireader
mailing list