[Wikireader] english wikireaders and 0.7

Sun Sep 7 18:17:13 EDT 2008

To  Andrew -- thank you.  The 2% vandalism stat is very valuable!
CJB, would it be possible to grab revision ids from this page,
wherever there is a simple newline/title/oldid= ?

http://en.wikipedia.org/w/index.php?title=Wikipedia%3AWikipedia_CD_Selection%2Fadditions_and_updates&diff=212207029&oldid=211039948

Other replies inline: I am working on an article list here:

  http://en.wikipedia.org/wiki/User:Sj/en-g1g1#D

Pulling out articles I'd like to exclude at the start of each section.

On Sun, Sep 7, 2008 at 4:57 PM, Chris Ball <cjb at laptop.org> wrote:
> Hi SJ,
>
>   > Can you say more about this aggression against images?  I should
>   > think the smaller the # of articles the greater the impact of a
>   > nice selection of them :-) But I know what you mean -- you'd rather
>   > spend space on text.
>
> The nice selection of them will cause a smaller number of articles
> still, though, and 8000 is already far smaller than we'd hoped for.

Agreed.  It seems that removing extraneous references to Harry Potter
frees up another thousand articles or so...

> Where should we draw a line and say that fewer than n-thousand
> articles is not useful enough regardless of how many images we have?

en:wp articles tend to grow without shrinking.  Like you, I'm worried
about not having enough articles to make a valuable reference work,
especially in the sense of having a solid network of internal links.
I also see in this snapshot a lot of articles that are interesting but
don't need to be nearly so detailed for our audience (and may simply
bore).

Can we try 6000 articles + 21000 ledes, to include every article in
Martin's list?

I'm also happy with making this larger than 100MB for g1g1, perhaps even 150MB.
In the future our goal can be to expand coverage while reducing
size... with less time pressure.

>   > What is it that takes time in the prep and review process?  How
>   > about one image per article linked from the main page?
>
> It's semi-automated (by running a bunch of scripts a couple of times),
> but the tasks are:
>
> * Find out what the largest size an image is used at on a page that
>  we include is, and reduce it to that size
> * Scale back the quality to reduce disk space
> * Review every image chosen for inappropriateness
>
> We're also missing all templates, I'm afraid, and we don't have a
> perfect method for determining which templates are used in our pageset.
> We're thinking about including the top-500 templates or so and leaving
> it at that; the full template collection is 174000 template articles.

We definitely need a template blacklist again.
How about the top 5000, excluding certain template categories?

SJ