[sugar] [Wikireader] english wikireaders and 0.7

Samuel Klein sj at laptop.org
Sun Sep 7 19:14:23 EDT 2008


I copy the sugar list, since having an updated english wikipedia slice
will be awesome, and others may want to get involved.

SJ

ps - The 2% stat referenced below is that Andrew Cates finds that at
any given moment there is a 2% chance of some sort of
vandalism/error/flaw in an article (and a 0.5% chance of same when
looking at the last trusted editor's contribs) making it helpful to
have specific revisionids for articles in a snapshot.


On Sun, Sep 7, 2008 at 7:11 PM, Samuel Klein <sj at laptop.org> wrote:
> Where is the code for this?  Lede-detection code is a priority for me,
> and I'd like to work on it.  It should be easy to sense the start of
> the first H2 and drop the rest of the article.
>
> Is there some way to estimate the size impact on the whole of adding
> one template (given how often it is referenced)?  If we could rank
> templates by their footprint, it would be easier to "fill up" a space
> allocation for them, as we do for images.
>
> SJ
>
> On Sun, Sep 7, 2008 at 7:02 PM, Chris Ball <cjb at laptop.org> wrote:
>> Hi SJ,
>>
>>   > To Andrew -- thank you.  The 2% vandalism stat is very valuable!
>>   > CJB, would it be possible to grab revision ids from this page,
>>   > wherever there is a simple newline/title/oldid= ?
>>
>> Possible, yeah, but I'm not sure it'll be the best use of the time I
>> have remaining to work on this once the work-week starts up again and I
>> get back to blockers for the release.  We'd have to switch over from the
>> "current versions" archive to the "all versions" archive, and then write
>> scripts to create a new archive with the versions we want.
>>
>>   > Other replies inline: I am working on an article list here:
>>
>>   >   http://en.wikipedia.org/wiki/User:Sj/en-g1g1#D
>>
>>   > Agreed.  It seems that removing extraneous references to Harry
>>   > Potter frees up another thousand articles or so...
>>
>> Can't tell whether this was humor.  ;-)
>>
>>   > en:wp articles tend to grow without shrinking.  Like you, I'm
>>   > worried about not having enough articles to make a valuable
>>   > reference work, especially in the sense of having a solid network
>>   > of internal links.  I also see in this snapshot a lot of articles
>>   > that are interesting but don't need to be nearly so detailed for
>>   > our audience (and may simply bore).
>>
>>   > Can we try 6000 articles + 21000 ledes, to include every article in
>>   > Martin's list?
>>
>> In principle, yeah, but like the revisions work it requires new work
>> for detecting leads and putting them into their own articles.  My
>> gut feeling is that this work just isn't important enough for this
>> particular snapshot where our users have access to the net if they
>> need it.  (Given time constraints.)
>>
>>   > I'm also happy with making this larger than 100MB for g1g1, perhaps
>>   > even 150MB.  In the future our goal can be to expand coverage while
>>   > reducing size... with less time pressure.
>>
>> Absolutely.
>>
>>   > We definitely need a template blacklist again.  How about the top
>>   > 5000, excluding certain template categories?
>>
>> Another 5000 (small) articles is going to have a big impact on disk
>> space, I think.  We'll see how it looks.
>>
>> Oh, Mad reminded me that you wanted to see a list of the 2k articles
>> that are in the 10k slice and not the 8k slice.  Here it is:
>>
>>   http://dev.laptop.org/~cjb/enwiki/8k-10k-diff
>>
>> - Chris.
>> --
>> Chris Ball   <cjb at laptop.org>
>>
>


More information about the Sugar mailing list