an interesting filesystem challenge: static pull of wiki.laptop.org
C. Scott Ananian
cscott at cscott.net
Wed Nov 12 15:23:11 EST 2008
I've been preparing a static pull of wiki.laptop.org to send to
bandwidth-challenged regions, as well as to use as a failover in case
of high load.
It's basically a simple:
wget -EkKm http://wiki.laptop.org
of the site.
Interesting fact: the root directory contains 1,061,633 separate
files, and an 'ls' of that directory takes 9m24s. This is an
ext3-formatted partition. Repeating the ls takes only 10s; linux's
dcache is a marvel.
apache seems to perform reasonably well serving files from such huge
directories. Should I be concerned? Can anyone suggest:
a) a patched wget or a tool other than wget which would fabricate
appropriate directory structure to prevent everything from being
thrown together in the root or /go/ directories?
b) whether reformatting with reiserfs or some other filesystem is
worth the trouble? ext3 already has btree-structured directories, so
reiserfs isn't quite the obvious win it used to be.
c) patched wget or other tool that will actually honor robot
exclusion directives in <meta> tags in page headers? wget seems to
honor 'nofollow', but mediawiki uses <meta name="robots"
content="noindex,nofollow" /> in the <head> of edit and printable
pages, which isn't sufficient to convince wget to delete the file it
just downloaded. We really don't need those pages in the static pull;
they just bloat our directories. I could rig a find script after the
fact, but I'd prefer not to have to go through the stage of having a
bazillion files in the directory before it's cleaned up.
I also tweaked the language settings on the wiki slightly, which
should reduce the number of files by a factor of 5 or so, by
suppressing the &setlang=<LANG> links in the side bar; the existing
"In other languages" links are preferable for this purpose. But maybe
the combined wisdom of devel@ can suggest other things I could be
( http://cscott.net/ )
More information about the Devel