[Http-crcsync] Literal blocks in crcsync protocol are now deflated, and some further discussion on the protocol

Mon Apr 27 14:05:09 EDT 2009

Hi,

I have updated the crccache client and server to compress the literal blocks 
(e.g. blocks of non-matched data) with zlib deflate algorithm, so that the 
total size is as small as possible.

I have also made a few small changes with respect to how to deal with 
gzip-encoded content:

1) The crccache-server now dynamically uses mod-deflate to inflate 
gzip-encoded content; it only inserts the inflate filter (just before itself) 
when the request contains the crcblocks header. E.g. when itself wants to 
crcsync-encode the response.
2) On the client-side, mod-deflate is now always invoked in inflate mode so 
that fresh pages (e.g. pages that have not yet been crcsync-encoded) are 
stored non-compressed in the cache, so that they can serve as a good basis to 
calculate crcblocks on the next request for the same page.

This whole deflation/inflation setup is required because the principal of 
crcsync does not work well with compressed files/pages. It only works well 
with non-compressed data. After all, as soon as only one byte changes in the 
original page, the entire compressed stream after that single byte changes 
completely, so no further blocks would match anymore if working with 
gzip-encoded streams and gzip-encoded cache entries.

This has also made me think a lot about the whole protocol story. I have 
studied the original delta-http RFC that has been mentioned already a few 
times in this thread and have also given it a few good nights of sleep. I 
have now some idea on how we can make a clean protocol but must still work it 
out on paper. Several ideas from the original delta-http protocol RFC are not 
really applicable to our situation; that paper is entirely based on the 
assumption that the server works with semi-static pages and keeps a history 
of those pages, while our implementation is based on the idea that servers 
will *not* know the page that was previously served to the client. Which is 
indeed more realistic due to the dynamic nature of most web sites. So we need 
a slightly different approach then described in that RFC.

Will put my ideas tomorrow on paper and submit them to this list for further 
discussions.

Kind regards,
Alex