Compression of HTML files

Thu Jul 26 10:32:22 EDT 2007

May I also note the following:

Much larger changes in behavior in HTML and web content are caused by
careful use of style sheets, getting rid of junk in the HTML, and care
on use of images (particularly when CSS can be used instead).  Images
usually dominate text, don't compress well (as they are already
compressed) and can/should be transformed knowing the exact quality of
our display.  Go look to your own house first....

Even ensuring that all tags are upper or lower case (all the same case,
but preferably lower case) makes a significant change in size in
compressed data, independent of whether gzip, bzip2, or jffs2's
compression is used..

Somewhat dated, but the following will give some information and tips:
http://www.w3.org/Protocols/HTTP/Performance/
http://www.w3.org/Protocols/HTTP/Performance/Pipeline

I would claim that the first thing to do is ensure our content is done
well, before worrying about different compression algorithms.  That
can/will be improved with time, so our effort should go toward ensuring
efficient content.
                           - Jim

On Wed, 2007-07-25 at 13:29 -0500, Ian Bicking wrote:
> Arael8 (Petr?) posted a zip file of HTML files, 
> http://dictionary.110mb.com/files/short-wiki.zip
> 
> The files are pretty small, almost all from 500 bytes to 1.5K.
> 
> Here's the compressed sizes:
> 
> 4.6M	raw
> 1.3M	short-wiki.zip
> 704K	short-wiki.tar.gz
> 496K	short-wiki.tar.bz2
> 3.8M	gzipped
> 3.8M	gzipped-1
> 3.8M	bzipped
> 
> 
> raw is the uncompressed files.  gzipped is a directory where each 
> *individual* file is gzipped.  gzipped-1 uses gzip -1 (faster/worse 
> compression), which apparently has negligible impact.  bzip2 is a better 
> compression algorithm, but substantially slower and fairly CPU intensive 
> (on my relatively fast machine, bzip2 takes a noticeable amount of time, 
> where gzip takes much less time).
> 
> I did this using "gzip -r *" to compress the directory (or "bzip2 `find 
> . ! -type d`" for gzip), and "du -sh *" to get the amounts.
> 
> As you can see, compressing very small files individually works really 
> poorly.  I *believe* this is what JFFS does.  Compressing the files into 
> an archive works well, as the compression can use similarities between 
> files to increase the compression.
> 
> Notably Lector (http://openberg.sourceforge.net) loads files from zip 
> files.  I am pretty sure that zip files are considerably faster for 
> random access than .tar.gz files, which is why they are used in lots of 
> places despite being less compact.  .tar.gz is really just suitable for 
> passing around archives that get completely unpacked on arrival.
> 
> We should probably ask over on the dev list about how to see what real 
> jffs sizes are on the laptop, and how to accurately estimate them.  What 
> I'm suggesting here is more of a guess than based on anything specific I 
> know about jffs.
> 
> On larger files (e.g., Wikipedia sized, verses dictionary sizes) these 
> numbers will probably look different, and gzipping individual files 
> won't look quite as bad.
> 
> 
-- 
Jim Gettys
One Laptop Per Child