[Http-crcsync] Fwd: Re: [Fwd: Re: Google summer of code : 'web pages over rsync']

Wed Jun 17 11:11:04 EDT 2009

Hey guys,

Here are some thoughts on the protocol from a Mozilla contributor who 
has worked on similar things in the past. I thought they might be very 
useful when considering the protocol design, particularly the bit about 
packet sizes.

Gerv

-------- Original Message --------
Subject: Re: [Fwd: Re: Google summer of code : 'web pages over rsync']
Date: Wed, 01 Apr 2009 14:14:40 -0400
From: Patrick McManus <mcmanus at ducksong.com>
To: Gervase Markham <gerv at mozilla.org>

Hi Guys,

I did a commercial implementation of this back in the late 90's -
DeltaEdge was part of AppliedTheory's CDN offering. (later clearblue,
now navisite - I think they are still running it somewhere at navisite -
I'm not affiliated with them anymore).

I am the inventor (though not the assignee) of a related patent -
http://www.wikipatents.com/6826626.html

The CDN introduced 2 proxies - one "close" to the UA and one colocated
with the origin server. The delta took place between them - hopefully on
the high latency hop. It was all done with TE and Transfer-Encoding so
any client or server that wanted to participate could - it wasn't a
proprietary tunnel. The proxies are closed source software written from
scratch.

Akamai and Limelight have similar technology they offer. iirc (its been
a long while) those do use propriety tunnels and might require special
APIs.. DeltaEdge was just an encoding which I think is what you're
talking about here.

Anyhow - the big papers on the topic were a SIGCOMM paper that inspired
me (97 perhaps) by Jeff Mogul, and some work by Jim Douglis at ATT..
worth googling them.

Jason's question was about metrics.. the first cut was media-type.. I
only dealt with text types as content-aware algorithms are typically
needed to get useful delta results with binary types.. does the rsync
proposal address that?

as for text types, the issue is really the # of rtts needed to move the
data.. all text is generally relatively small, but you can make it
really snappy if you get it below the magic 3000 byte mark. 3000
generally turns into 2 pkts which will almost always fit in one
congestion window - which in turn means you can do the whole req/resp in
one rtt and that's a HUGE win. there's a fair degree of exceptions, but
iirc the measurements (and I could take a million of them on a very busy
application server running this) beared this out very well - many of the
transactions were either super fast or just marginally improved
depending on which side of this magic line they fell on.

And yes, you can jam a lot of compressed delta info in 3KB - so this is
a pretty effective strategy. The client CPU requirements are pretty
simple - it's generally a one pass process to reassemble a lot of
pointer/length references and do some huffman decoding - though that's
going to depend on the delta algorithm of course.

I always thought the hard part was "delta against what?". i.e. what is
the base document.. that's easy from the client's pov - it should be the
last copy of the document that it has. But the server cannot keep around
every freaking revision of a dynamic resource (or even 1 per client)..
so it has to have a canonical document to delta against.. in deltaedge
we called that the reference-entity. A lot of time was spent on figuring
out how to a set of manage RE's, when to update them, etc.. It's
probably not a big deal for a mozilla client implementation, but its
worth thinking about what it might mean in terms of overall uptake of
the system. (i.e. it can make the server side pretty resource hungry.)

my two cents.

-Patrick