an interesting filesystem challenge: static pull of wiki.laptop.org

Wed Nov 12 15:23:11 EST 2008

I've been preparing a static pull of wiki.laptop.org to send to
bandwidth-challenged regions, as well as to use as a failover in case
of high load.

It's basically a simple:
  wget -EkKm http://wiki.laptop.org
of the site.

Interesting fact: the root directory contains 1,061,633 separate
files, and an 'ls' of that directory takes 9m24s.  This is an
ext3-formatted partition.  Repeating the ls takes only 10s; linux's
dcache is a marvel.

apache seems to perform reasonably well serving files from such huge
directories.  Should I be concerned?  Can anyone suggest:
  a) a patched wget or a tool other than wget which would fabricate
appropriate directory structure to prevent everything from being
thrown together in the root or /go/ directories?
  b) whether reformatting with reiserfs or some other filesystem is
worth the trouble?  ext3 already has btree-structured directories, so
reiserfs isn't quite the obvious win it used to be.
  c) patched wget or other tool that will actually honor robot
exclusion directives in <meta> tags in page headers?  wget seems to
honor 'nofollow', but mediawiki uses <meta name="robots"
content="noindex,nofollow" /> in the <head> of edit and printable
pages, which isn't sufficient to convince wget to delete the file it
just downloaded.  We really don't need those pages in the static pull;
they just bloat our directories.  I could rig a find script after the
fact, but I'd prefer not to have to go through the stage of having a
bazillion files in the directory before it's cleaned up.

I also tweaked the language settings on the wiki slightly, which
should reduce the number of files by a factor of 5 or so, by
suppressing the &setlang=<LANG> links in the side bar; the existing
"In other languages" links are preferable for this purpose.  But maybe
the combined wisdom of devel@ can suggest other things I could be
trying.
 --scott

-- 
                         ( http://cscott.net/ )