Compression of HTML files

Ian Bicking ianb at colorstudy.com
Wed Jul 25 14:29:41 EDT 2007


Arael8 (Petr?) posted a zip file of HTML files, 
http://dictionary.110mb.com/files/short-wiki.zip

The files are pretty small, almost all from 500 bytes to 1.5K.

Here's the compressed sizes:

4.6M	raw
1.3M	short-wiki.zip
704K	short-wiki.tar.gz
496K	short-wiki.tar.bz2
3.8M	gzipped
3.8M	gzipped-1
3.8M	bzipped


raw is the uncompressed files.  gzipped is a directory where each 
*individual* file is gzipped.  gzipped-1 uses gzip -1 (faster/worse 
compression), which apparently has negligible impact.  bzip2 is a better 
compression algorithm, but substantially slower and fairly CPU intensive 
(on my relatively fast machine, bzip2 takes a noticeable amount of time, 
where gzip takes much less time).

I did this using "gzip -r *" to compress the directory (or "bzip2 `find 
. ! -type d`" for gzip), and "du -sh *" to get the amounts.

As you can see, compressing very small files individually works really 
poorly.  I *believe* this is what JFFS does.  Compressing the files into 
an archive works well, as the compression can use similarities between 
files to increase the compression.

Notably Lector (http://openberg.sourceforge.net) loads files from zip 
files.  I am pretty sure that zip files are considerably faster for 
random access than .tar.gz files, which is why they are used in lots of 
places despite being less compact.  .tar.gz is really just suitable for 
passing around archives that get completely unpacked on arrival.

We should probably ask over on the dev list about how to see what real 
jffs sizes are on the laptop, and how to accurately estimate them.  What 
I'm suggesting here is more of a guess than based on anything specific I 
know about jffs.

On larger files (e.g., Wikipedia sized, verses dictionary sizes) these 
numbers will probably look different, and gzipping individual files 
won't look quite as bad.


-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org



More information about the Library mailing list