Compression of HTML files

Wed Jul 25 14:59:41 EDT 2007

Hi Ian,

Arael is Zdenek Broz (doing some great work with html parsing and 
generation).  Performance reading from tar.gz vs. zip vs uncompressed 
files for various readers would be interesting.  We are looking into zip 
or xar formats for other aspects of the system as well, so there might be 
extra benefits to using that as a reader format.

Asking devel about real jffs sizes is a good idea; there too, if there are 
specific file types or sizes that compress well, this is useful to know.

On Wed, 25 Jul 2007, Ian Bicking wrote:

> Arael8 (Petr?) posted a zip file of HTML files,
> http://dictionary.110mb.com/files/short-wiki.zip
>
> The files are pretty small, almost all from 500 bytes to 1.5K.
>
> Here's the compressed sizes:
>
> 4.6M	raw
> 1.3M	short-wiki.zip
> 704K	short-wiki.tar.gz
> 496K	short-wiki.tar.bz2
> 3.8M	gzipped
> 3.8M	gzipped-1
> 3.8M	bzipped

are the gzipped and bzipped files really the same size?

SJ

> raw is the uncompressed files.  gzipped is a directory where each
> *individual* file is gzipped.  gzipped-1 uses gzip -1 (faster/worse
> compression), which apparently has negligible impact.  bzip2 is a better
> compression algorithm, but substantially slower and fairly CPU intensive
> (on my relatively fast machine, bzip2 takes a noticeable amount of time,
> where gzip takes much less time).
>
> I did this using "gzip -r *" to compress the directory (or "bzip2 `find
> . ! -type d`" for gzip), and "du -sh *" to get the amounts.
>
> As you can see, compressing very small files individually works really
> poorly.  I *believe* this is what JFFS does.  Compressing the files into
> an archive works well, as the compression can use similarities between
> files to increase the compression.
>
> Notably Lector (http://openberg.sourceforge.net) loads files from zip
> files.  I am pretty sure that zip files are considerably faster for
> random access than .tar.gz files, which is why they are used in lots of
> places despite being less compact.  .tar.gz is really just suitable for
> passing around archives that get completely unpacked on arrival.
>
> We should probably ask over on the dev list about how to see what real
> jffs sizes are on the laptop, and how to accurately estimate them.  What
> I'm suggesting here is more of a guess than based on anything specific I
> know about jffs.
>
> On larger files (e.g., Wikipedia sized, verses dictionary sizes) these
> numbers will probably look different, and gzipping individual files
> won't look quite as bad.
>
>
> -- 
> Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
>
> _______________________________________________
> Library mailing list
> Library at lists.laptop.org
> http://lists.laptop.org/listinfo/library
>