[Http-crcsync] Fwd: Re: [Fwd: Re: Google summer of code : 'web pages over rsync']

Sun Jun 21 18:00:32 EDT 2009

Hi,

I have analyzed my logs. Here are the statistics:
Number of responses containing some delta-data: 233
Average size of response body: 3191 bytes
Number of responses with a body smaller then 3000 bytes: 131
Number of responses with a body smaller then 2500 bytes: 127

I currently don't have statistics on the response header size. I realize that 
the body+header together should be less then 3000 bytes. That is why I also 
included the number of responses with a body smaller then 2500 bytes, 
assuming header will often be smaller then 500 bytes.

These statistics are with the default compression ratio of zlib. Have not yet 
adapted the code to use the strongest compression ratio but given these 
numbers it might be worth doing this effort. It will probably push a 
significant number of responses from just above 3000 bytes to just below 3000 
bytes.

Though, I don't know how realistic my tests are. Most tests are based on 
refetching home page of slashdot several times on a row. It is ideal for 
testing the protocol development (slashdot homepage changes frequently in 
small area's due to advertising stuff) but maybe not a good candidate to get 
statistics on real-world behaviour. After all, I don't believe that the 
school kids, who are the primary target audience of this project, will be 
reading slashdot that much ;-)

Also please be aware that in total I have found 412 requests/responses in my 
logs, but above statistics don't contain the 79 responses indicating 'page 
was exactly the same'. In such cases, the delta-response body was only 82 
bytes. Though, this is obviously also a major bandwith saving because the 
actual response returned by slashdot was typically around 80 kB in stead of a 
mere 82 bytes but for above statistics, I'm mainly interested in the response 
size in case that there is an actual delta.

Cheers,
Alex

Op woensdag 17 juni 2009, schreef Alex Wulms:
> Hi,
>
> Interesting info. I almost got concerned when I noticed the patent
> reference but after reading the patent description it turns out that we are
> fortunately following a different approach, if I understand it well.
>
> In their implementation it is the server that keeps track of the reference
> resources (the canonical document) while in our implementation it is the
> client (the cached page, over which the crcblocks are calculated).
>
> The advantage of the server managing the reference resources is that they
> can generate a really minimal delta. In our case, the delta is still small
> but, once uncompressed, it is always a multiple of the blocksize
> (client-page-size/40, assuming we stick to 40 blocks).
>
> We would have to do some measurements to see how often we we stay below the
> magical 3000 bytes threshold. Will analyze my test logs coming weekend.
> Will also see if, on the server side, I can put the compression ratio of
> zlib more aggresively and analyze the impact of that.
>
> Thanks and brs,
> Alex
>
> Ps: so-far, the http-crcsync protocol only deals properly with text-based
> resources. It can't do a proper delta on multi-media resources.
>
> Op woensdag 17 juni 2009, schreef Gervase Markham:
> > Hey guys,
> >
> > Here are some thoughts on the protocol from a Mozilla contributor who
> > has worked on similar things in the past. I thought they might be very
> > useful when considering the protocol design, particularly the bit about
> > packet sizes.
> >
> > Gerv
> >
> > -------- Original Message --------
> > Subject: Re: [Fwd: Re: Google summer of code : 'web pages over rsync']
> > Date: Wed, 01 Apr 2009 14:14:40 -0400
> > From: Patrick McManus <mcmanus at ducksong.com>
> > To: Gervase Markham <gerv at mozilla.org>
> >
> > Hi Guys,
> >
> > I did a commercial implementation of this back in the late 90's -
> > DeltaEdge was part of AppliedTheory's CDN offering. (later clearblue,
> > now navisite - I think they are still running it somewhere at navisite -
> > I'm not affiliated with them anymore).
> >
> > I am the inventor (though not the assignee) of a related patent -
> > http://www.wikipatents.com/6826626.html
> >
> > The CDN introduced 2 proxies - one "close" to the UA and one colocated
> > with the origin server. The delta took place between them - hopefully on
> > the high latency hop. It was all done with TE and Transfer-Encoding so
> > any client or server that wanted to participate could - it wasn't a
> > proprietary tunnel. The proxies are closed source software written from
> > scratch.
> >
> > Akamai and Limelight have similar technology they offer. iirc (its been
> > a long while) those do use propriety tunnels and might require special
> > APIs.. DeltaEdge was just an encoding which I think is what you're
> > talking about here.
> >
> > Anyhow - the big papers on the topic were a SIGCOMM paper that inspired
> > me (97 perhaps) by Jeff Mogul, and some work by Jim Douglis at ATT..
> > worth googling them.
> >
> > Jason's question was about metrics.. the first cut was media-type.. I
> > only dealt with text types as content-aware algorithms are typically
> > needed to get useful delta results with binary types.. does the rsync
> > proposal address that?
> >
> > as for text types, the issue is really the # of rtts needed to move the
> > data.. all text is generally relatively small, but you can make it
> > really snappy if you get it below the magic 3000 byte mark. 3000
> > generally turns into 2 pkts which will almost always fit in one
> > congestion window - which in turn means you can do the whole req/resp in
> > one rtt and that's a HUGE win. there's a fair degree of exceptions, but
> > iirc the measurements (and I could take a million of them on a very busy
> > application server running this) beared this out very well - many of the
> > transactions were either super fast or just marginally improved
> > depending on which side of this magic line they fell on.
> >
> > And yes, you can jam a lot of compressed delta info in 3KB - so this is
> > a pretty effective strategy. The client CPU requirements are pretty
> > simple - it's generally a one pass process to reassemble a lot of
> > pointer/length references and do some huffman decoding - though that's
> > going to depend on the delta algorithm of course.
> >
> > I always thought the hard part was "delta against what?". i.e. what is
> > the base document.. that's easy from the client's pov - it should be the
> > last copy of the document that it has. But the server cannot keep around
> > every freaking revision of a dynamic resource (or even 1 per client)..
> > so it has to have a canonical document to delta against.. in deltaedge
> > we called that the reference-entity. A lot of time was spent on figuring
> > out how to a set of manage RE's, when to update them, etc.. It's
> > probably not a big deal for a mozilla client implementation, but its
> > worth thinking about what it might mean in terms of overall uptake of
> > the system. (i.e. it can make the server side pretty resource hungry.)
> >
> > my two cents.
> >
> > -Patrick
> > _______________________________________________
> > Http-crcsync mailing list
> > Http-crcsync at lists.laptop.org
> > http://lists.laptop.org/listinfo/http-crcsync
>
> _______________________________________________
> Http-crcsync mailing list
> Http-crcsync at lists.laptop.org
> http://lists.laptop.org/listinfo/http-crcsync