[sugar] [Wikireader] english wikireaders and 0.7
Samuel Klein
sj at laptop.org
Sun Sep 7 19:14:23 EDT 2008
I copy the sugar list, since having an updated english wikipedia slice
will be awesome, and others may want to get involved.
SJ
ps - The 2% stat referenced below is that Andrew Cates finds that at
any given moment there is a 2% chance of some sort of
vandalism/error/flaw in an article (and a 0.5% chance of same when
looking at the last trusted editor's contribs) making it helpful to
have specific revisionids for articles in a snapshot.
On Sun, Sep 7, 2008 at 7:11 PM, Samuel Klein <sj at laptop.org> wrote:
> Where is the code for this? Lede-detection code is a priority for me,
> and I'd like to work on it. It should be easy to sense the start of
> the first H2 and drop the rest of the article.
>
> Is there some way to estimate the size impact on the whole of adding
> one template (given how often it is referenced)? If we could rank
> templates by their footprint, it would be easier to "fill up" a space
> allocation for them, as we do for images.
>
> SJ
>
> On Sun, Sep 7, 2008 at 7:02 PM, Chris Ball <cjb at laptop.org> wrote:
>> Hi SJ,
>>
>> > To Andrew -- thank you. The 2% vandalism stat is very valuable!
>> > CJB, would it be possible to grab revision ids from this page,
>> > wherever there is a simple newline/title/oldid= ?
>>
>> Possible, yeah, but I'm not sure it'll be the best use of the time I
>> have remaining to work on this once the work-week starts up again and I
>> get back to blockers for the release. We'd have to switch over from the
>> "current versions" archive to the "all versions" archive, and then write
>> scripts to create a new archive with the versions we want.
>>
>> > Other replies inline: I am working on an article list here:
>>
>> > http://en.wikipedia.org/wiki/User:Sj/en-g1g1#D
>>
>> > Agreed. It seems that removing extraneous references to Harry
>> > Potter frees up another thousand articles or so...
>>
>> Can't tell whether this was humor. ;-)
>>
>> > en:wp articles tend to grow without shrinking. Like you, I'm
>> > worried about not having enough articles to make a valuable
>> > reference work, especially in the sense of having a solid network
>> > of internal links. I also see in this snapshot a lot of articles
>> > that are interesting but don't need to be nearly so detailed for
>> > our audience (and may simply bore).
>>
>> > Can we try 6000 articles + 21000 ledes, to include every article in
>> > Martin's list?
>>
>> In principle, yeah, but like the revisions work it requires new work
>> for detecting leads and putting them into their own articles. My
>> gut feeling is that this work just isn't important enough for this
>> particular snapshot where our users have access to the net if they
>> need it. (Given time constraints.)
>>
>> > I'm also happy with making this larger than 100MB for g1g1, perhaps
>> > even 150MB. In the future our goal can be to expand coverage while
>> > reducing size... with less time pressure.
>>
>> Absolutely.
>>
>> > We definitely need a template blacklist again. How about the top
>> > 5000, excluding certain template categories?
>>
>> Another 5000 (small) articles is going to have a big impact on disk
>> space, I think. We'll see how it looks.
>>
>> Oh, Mad reminded me that you wanted to see a list of the 2k articles
>> that are in the 10k slice and not the 8k slice. Here it is:
>>
>> http://dev.laptop.org/~cjb/enwiki/8k-10k-diff
>>
>> - Chris.
>> --
>> Chris Ball <cjb at laptop.org>
>>
>
More information about the Sugar
mailing list