Compression of HTML files
Ian Bicking
ianb at colorstudy.com
Wed Jul 25 14:29:41 EDT 2007
Arael8 (Petr?) posted a zip file of HTML files,
http://dictionary.110mb.com/files/short-wiki.zip
The files are pretty small, almost all from 500 bytes to 1.5K.
Here's the compressed sizes:
4.6M raw
1.3M short-wiki.zip
704K short-wiki.tar.gz
496K short-wiki.tar.bz2
3.8M gzipped
3.8M gzipped-1
3.8M bzipped
raw is the uncompressed files. gzipped is a directory where each
*individual* file is gzipped. gzipped-1 uses gzip -1 (faster/worse
compression), which apparently has negligible impact. bzip2 is a better
compression algorithm, but substantially slower and fairly CPU intensive
(on my relatively fast machine, bzip2 takes a noticeable amount of time,
where gzip takes much less time).
I did this using "gzip -r *" to compress the directory (or "bzip2 `find
. ! -type d`" for gzip), and "du -sh *" to get the amounts.
As you can see, compressing very small files individually works really
poorly. I *believe* this is what JFFS does. Compressing the files into
an archive works well, as the compression can use similarities between
files to increase the compression.
Notably Lector (http://openberg.sourceforge.net) loads files from zip
files. I am pretty sure that zip files are considerably faster for
random access than .tar.gz files, which is why they are used in lots of
places despite being less compact. .tar.gz is really just suitable for
passing around archives that get completely unpacked on arrival.
We should probably ask over on the dev list about how to see what real
jffs sizes are on the laptop, and how to accurately estimate them. What
I'm suggesting here is more of a guess than based on anything specific I
know about jffs.
On larger files (e.g., Wikipedia sized, verses dictionary sizes) these
numbers will probably look different, and gzipping individual files
won't look quite as bad.
--
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
More information about the Library
mailing list