[Http-crcsync] General comments on crcsync document

Thu Jul 30 17:08:57 EDT 2009

Op zaterdag 18 juli 2009, schreef Patrick McManus:
> On Sat, 2009-07-18 at 10:51 +0200, Toby Collett wrote:
> > The problem here is that a non buffering implementation of a chunked
> > transfer doesnt know it has only got one hash match before it starts
> > sending the data...
>
> what I was thinking was more along the lines of "I haven't seen a hash
> match even after calculating 4(?) full windows worth of data - I'll stop
> looking because even if this is a legitimate delta it is going to be a
> badly performing one"..
>
> that bounds the number of hashes so should really help with the false
> positive problem and doesn't pose a problem for streaming the response..
I have been thinking about this and to be honest, I don't like this approach 
too much, for two reasons.

The first reason is that it will make the server implementation needlessly 
complex. It would have to deal for example with the situation where the first 
few blocks match (e.g. because of common page headers, menu's, etc) and then 
the next X blocks mismatch. In such case, when X is high enough (4 for 
example) we would have to decide to send a full response after all. This can 
ofcourse be dealt with by buffering stuff in the server (block references are 
only small) but still, it will make the code more fragile and more 
susceptible to subtle bugs.

The second -and maybe even more important- reason is that we have to put the 
threshold of X pretty low if we want keep the chance on false positives low 
enough so that the protocol won't get the perception of being unreliable.

E.g. if a 100kB page would be 50% the same, we would already have a chance of 
1/2000 on false positives. Which is in my opinion unacceptable high. So we 
would have to put the threshold indeed on 4 or 5 blocks or so. Meaning that 
we would be often sending full pages in stead of delta's where a delta could 
have been sent when using a stronger checksum.

By choosing a larger CRC (e.g. CRC-64-ISO, which is used for HDLC/High-Level 
Data Link Control), we avoid all these issues.

So we must balance the extra cost of a software based 64-bit CRC in stead of a 
hardware based 32-bit CRC against the cost of transmitting full responses in 
stead of delta-responses in cases that the non-delta part would be larger 
then say 20% or so. And this taking into consideration that one of the main 
drivers for this project is the needs of OLPC, many of which are deployed in 
rural area's with dial-up connections. Which I assume means that the bandwith 
is 64kbit/s at most or more likely 56.6kbps. 

We also have to look at how much time do the client and the server spent on 
the CRC calculation versus the other parts of the protocol handling:
- For the client, it means decompressing literal data and copying matching 
blocks from the base page to the reconstructed page. And obviously 
calculating the SHA256 over the reconstructed page to detect corruptions.
- For the server, it means matching the rolling CRC against the list of 
CRC-numbers, compressing the non-matched literal data and calculating the 
SHA256 over the original response.

So in order to balance this, I have made a benchmark program that measures the 
timing required for the various parts and executed it on different machines:
- A recent laptop with a 64-bit Intel CPU
- An old laptop with a 32-bit AMD Athlon XP
- An XO, with the 32-bit AMD Geode processor

I have each time tested with a copy of the slashdot homepage from a few months 
ago. It concerns a page of 84240 bytes, which compresses to 19684 (when using 
zlib with default compression ratio, like is done in the crcsync-proxy code)

The benchmark has been done by simulating that the page will be split in 40 
complete blocks and one tail block.

Here are the timing results for the slowest of the machines (the XO). All 
times are in milliseconds:
Calculate the 64-bit CRC blocks: 4.9 ms
Decompression: 6.95 ms
Copy blocks: 0.4 ms
Calculate SHA256: 8,85 ms
Compression: 60.9 ms
Validate page with exact match: 55.4 ms
Validate page, everything mismatches: 145.15 ms

So it means that if the client receives a full match (only block numbers), 
that the client will spent 4.9+0.4+8.85=14.15 ms of CPU time
Likewise, if it receives a full mismatch (only compressed literal data), the 
CPU time will be 4.9+6.95+8.85=20.7 ms
And in case of a 50% match, the time will be 4.9+0.5*6.95+0.5*0.4+8.85=17.4 ms

Now, let's look at the time required to transport the response body, over a 
56.6 kbps dial-up line:
full match: 41*2*10/56600=14 milliseconds
Note that I multiply with 10 to go from bytes to bits. I realize that there 
are only 8 bits in a byte, but by taking a factor 10, I compensate for TCP/IP 
and PPP protocol overhead. This worked very well to estimate download times 
when I still had a dial-up line myself...
full mismatch: 19685*10/56600=3.4 seconds
50% match: (20*2+0.5*19685)*10/56600=1.7 seconds

So if we take a low-end machine like the XO, being able to return a 50% match 
reduces the transfer time with 1.7 seconds. And that at the cost of spending 
4.9 milliseconds to calculate the 64-bit CRC in software in stead of ? 
milliseconds with a hardware based 32-bit CRC.

When it comes to the server, the number game obviously becomes different, as 
the server will probably have to handle request from many clients in 
parallel. So the CPU time spent on the calculations is more important. But as 
the numbers show, the time to calculate the CRC's is for the server very 
small compared to the time spent on the other parts, like compressing the 
non-matched data and validating the CRC numbers against the list of numbers 
received from the client. So the performance impact of calculating CRC 
numbers in hardware is minimal. The server can benefit more from implementing 
the CRC validations in hardware, for example by storing the list of numbers 
in an associative memory which can validate all numbers in one clock-cycle. 
And I'm pretty sure that a hardware manufacturer who can implement that, can 
also add a 64-bit rolling CRC calculator to the hardware. After all, CRC was 
invented in 1961 exactly because it was (and still is) so easy to implement 
in hardware.

Given these numbers, I strongly believe that we should settle for a 64-bit CRC 
after all, but indeed for a standard one (like the example mentioned above) 
in stead of a non-standard 60-bit CRC.

Please note that the spreadsheet with the full results can be found in the GIT 
repository, in the documentation subfolder. I have also checked-in the 
benchmark program.

Also note that on the 'normal' laptops, the absolute performance of the 
various steps is significantly better then on the XO, although the relative 
times seem to be more or less the same (e.g. CRC calculation time versus 
compression time versus validation time).

Cheers,
Alex