[Http-crcsync] General comments on crcsync document

Wed Oct 14 04:09:04 EDT 2009

I have spent some time reading over the mailing list threads and have
updated the spec document to reflect the discussions there. Attached is the
latest version for review, most of the changes are in the protocol details
section at the end of the document.

These changes are not reflected in the implementation yet, that will happen
over the next couple of weeks.

Regards,
Toby

2009/7/30 Alex Wulms <alex.wulms at scarlet.be>

> Op zaterdag 18 juli 2009, schreef Patrick McManus:
> > On Sat, 2009-07-18 at 10:51 +0200, Toby Collett wrote:
> > > The problem here is that a non buffering implementation of a chunked
> > > transfer doesnt know it has only got one hash match before it starts
> > > sending the data...
> >
> > what I was thinking was more along the lines of "I haven't seen a hash
> > match even after calculating 4(?) full windows worth of data - I'll stop
> > looking because even if this is a legitimate delta it is going to be a
> > badly performing one"..
> >
> > that bounds the number of hashes so should really help with the false
> > positive problem and doesn't pose a problem for streaming the response..
> I have been thinking about this and to be honest, I don't like this
> approach
> too much, for two reasons.
>
> The first reason is that it will make the server implementation needlessly
> complex. It would have to deal for example with the situation where the
> first
> few blocks match (e.g. because of common page headers, menu's, etc) and
> then
> the next X blocks mismatch. In such case, when X is high enough (4 for
> example) we would have to decide to send a full response after all. This
> can
> ofcourse be dealt with by buffering stuff in the server (block references
> are
> only small) but still, it will make the code more fragile and more
> susceptible to subtle bugs.
>
> The second -and maybe even more important- reason is that we have to put
> the
> threshold of X pretty low if we want keep the chance on false positives low
> enough so that the protocol won't get the perception of being unreliable.
>
> E.g. if a 100kB page would be 50% the same, we would already have a chance
> of
> 1/2000 on false positives. Which is in my opinion unacceptable high. So we
> would have to put the threshold indeed on 4 or 5 blocks or so. Meaning that
> we would be often sending full pages in stead of delta's where a delta
> could
> have been sent when using a stronger checksum.
>
> By choosing a larger CRC (e.g. CRC-64-ISO, which is used for
> HDLC/High-Level
> Data Link Control), we avoid all these issues.
>
> So we must balance the extra cost of a software based 64-bit CRC in stead
> of a
> hardware based 32-bit CRC against the cost of transmitting full responses
> in
> stead of delta-responses in cases that the non-delta part would be larger
> then say 20% or so. And this taking into consideration that one of the main
> drivers for this project is the needs of OLPC, many of which are deployed
> in
> rural area's with dial-up connections. Which I assume means that the
> bandwith
> is 64kbit/s at most or more likely 56.6kbps.
>
> We also have to look at how much time do the client and the server spent on
> the CRC calculation versus the other parts of the protocol handling:
> - For the client, it means decompressing literal data and copying matching
> blocks from the base page to the reconstructed page. And obviously
> calculating the SHA256 over the reconstructed page to detect corruptions.
> - For the server, it means matching the rolling CRC against the list of
> CRC-numbers, compressing the non-matched literal data and calculating the
> SHA256 over the original response.
>
> So in order to balance this, I have made a benchmark program that measures
> the
> timing required for the various parts and executed it on different
> machines:
> - A recent laptop with a 64-bit Intel CPU
> - An old laptop with a 32-bit AMD Athlon XP
> - An XO, with the 32-bit AMD Geode processor
>
> I have each time tested with a copy of the slashdot homepage from a few
> months
> ago. It concerns a page of 84240 bytes, which compresses to 19684 (when
> using
> zlib with default compression ratio, like is done in the crcsync-proxy
> code)
>
> The benchmark has been done by simulating that the page will be split in 40
> complete blocks and one tail block.
>
> Here are the timing results for the slowest of the machines (the XO). All
> times are in milliseconds:
> Calculate the 64-bit CRC blocks: 4.9 ms
> Decompression: 6.95 ms
> Copy blocks: 0.4 ms
> Calculate SHA256: 8,85 ms
> Compression: 60.9 ms
> Validate page with exact match: 55.4 ms
> Validate page, everything mismatches: 145.15 ms
>
> So it means that if the client receives a full match (only block numbers),
> that the client will spent 4.9+0.4+8.85=14.15 ms of CPU time
> Likewise, if it receives a full mismatch (only compressed literal data),
> the
> CPU time will be 4.9+6.95+8.85=20.7 ms
> And in case of a 50% match, the time will be 4.9+0.5*6.95+0.5*0.4+8.85=17.4
> ms
>
> Now, let's look at the time required to transport the response body, over a
> 56.6 kbps dial-up line:
> full match: 41*2*10/56600=14 milliseconds
> Note that I multiply with 10 to go from bytes to bits. I realize that there
> are only 8 bits in a byte, but by taking a factor 10, I compensate for
> TCP/IP
> and PPP protocol overhead. This worked very well to estimate download times
> when I still had a dial-up line myself...
> full mismatch: 19685*10/56600=3.4 seconds
> 50% match: (20*2+0.5*19685)*10/56600=1.7 seconds
>
> So if we take a low-end machine like the XO, being able to return a 50%
> match
> reduces the transfer time with 1.7 seconds. And that at the cost of
> spending
> 4.9 milliseconds to calculate the 64-bit CRC in software in stead of ?
> milliseconds with a hardware based 32-bit CRC.
>
> When it comes to the server, the number game obviously becomes different,
> as
> the server will probably have to handle request from many clients in
> parallel. So the CPU time spent on the calculations is more important. But
> as
> the numbers show, the time to calculate the CRC's is for the server very
> small compared to the time spent on the other parts, like compressing the
> non-matched data and validating the CRC numbers against the list of numbers
> received from the client. So the performance impact of calculating CRC
> numbers in hardware is minimal. The server can benefit more from
> implementing
> the CRC validations in hardware, for example by storing the list of numbers
> in an associative memory which can validate all numbers in one clock-cycle.
> And I'm pretty sure that a hardware manufacturer who can implement that,
> can
> also add a 64-bit rolling CRC calculator to the hardware. After all, CRC
> was
> invented in 1961 exactly because it was (and still is) so easy to implement
> in hardware.
>
> Given these numbers, I strongly believe that we should settle for a 64-bit
> CRC
> after all, but indeed for a standard one (like the example mentioned above)
> in stead of a non-standard 60-bit CRC.
>
> Please note that the spreadsheet with the full results can be found in the
> GIT
> repository, in the documentation subfolder. I have also checked-in the
> benchmark program.
>
> Also note that on the 'normal' laptops, the absolute performance of the
> various steps is significantly better then on the XO, although the relative
> times seem to be more or less the same (e.g. CRC calculation time versus
> compression time versus validation time).
>
> Cheers,
> Alex
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.laptop.org/pipermail/http-crcsync/attachments/20091014/e0e0018b/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: http_crcsync_protocol.odt
Type: application/vnd.oasis.opendocument.text
Size: 39186 bytes
Desc: not available
Url : http://lists.laptop.org/pipermail/http-crcsync/attachments/20091014/e0e0018b/attachment-0001.odt