[Http-crcsync] General comments on crcsync document
Alex Wulms
alex.wulms at scarlet.be
Thu Jul 30 17:08:57 EDT 2009
Op zaterdag 18 juli 2009, schreef Patrick McManus:
> On Sat, 2009-07-18 at 10:51 +0200, Toby Collett wrote:
> > The problem here is that a non buffering implementation of a chunked
> > transfer doesnt know it has only got one hash match before it starts
> > sending the data...
>
> what I was thinking was more along the lines of "I haven't seen a hash
> match even after calculating 4(?) full windows worth of data - I'll stop
> looking because even if this is a legitimate delta it is going to be a
> badly performing one"..
>
> that bounds the number of hashes so should really help with the false
> positive problem and doesn't pose a problem for streaming the response..
I have been thinking about this and to be honest, I don't like this approach
too much, for two reasons.
The first reason is that it will make the server implementation needlessly
complex. It would have to deal for example with the situation where the first
few blocks match (e.g. because of common page headers, menu's, etc) and then
the next X blocks mismatch. In such case, when X is high enough (4 for
example) we would have to decide to send a full response after all. This can
ofcourse be dealt with by buffering stuff in the server (block references are
only small) but still, it will make the code more fragile and more
susceptible to subtle bugs.
The second -and maybe even more important- reason is that we have to put the
threshold of X pretty low if we want keep the chance on false positives low
enough so that the protocol won't get the perception of being unreliable.
E.g. if a 100kB page would be 50% the same, we would already have a chance of
1/2000 on false positives. Which is in my opinion unacceptable high. So we
would have to put the threshold indeed on 4 or 5 blocks or so. Meaning that
we would be often sending full pages in stead of delta's where a delta could
have been sent when using a stronger checksum.
By choosing a larger CRC (e.g. CRC-64-ISO, which is used for HDLC/High-Level
Data Link Control), we avoid all these issues.
So we must balance the extra cost of a software based 64-bit CRC in stead of a
hardware based 32-bit CRC against the cost of transmitting full responses in
stead of delta-responses in cases that the non-delta part would be larger
then say 20% or so. And this taking into consideration that one of the main
drivers for this project is the needs of OLPC, many of which are deployed in
rural area's with dial-up connections. Which I assume means that the bandwith
is 64kbit/s at most or more likely 56.6kbps.
We also have to look at how much time do the client and the server spent on
the CRC calculation versus the other parts of the protocol handling:
- For the client, it means decompressing literal data and copying matching
blocks from the base page to the reconstructed page. And obviously
calculating the SHA256 over the reconstructed page to detect corruptions.
- For the server, it means matching the rolling CRC against the list of
CRC-numbers, compressing the non-matched literal data and calculating the
SHA256 over the original response.
So in order to balance this, I have made a benchmark program that measures the
timing required for the various parts and executed it on different machines:
- A recent laptop with a 64-bit Intel CPU
- An old laptop with a 32-bit AMD Athlon XP
- An XO, with the 32-bit AMD Geode processor
I have each time tested with a copy of the slashdot homepage from a few months
ago. It concerns a page of 84240 bytes, which compresses to 19684 (when using
zlib with default compression ratio, like is done in the crcsync-proxy code)
The benchmark has been done by simulating that the page will be split in 40
complete blocks and one tail block.
Here are the timing results for the slowest of the machines (the XO). All
times are in milliseconds:
Calculate the 64-bit CRC blocks: 4.9 ms
Decompression: 6.95 ms
Copy blocks: 0.4 ms
Calculate SHA256: 8,85 ms
Compression: 60.9 ms
Validate page with exact match: 55.4 ms
Validate page, everything mismatches: 145.15 ms
So it means that if the client receives a full match (only block numbers),
that the client will spent 4.9+0.4+8.85=14.15 ms of CPU time
Likewise, if it receives a full mismatch (only compressed literal data), the
CPU time will be 4.9+6.95+8.85=20.7 ms
And in case of a 50% match, the time will be 4.9+0.5*6.95+0.5*0.4+8.85=17.4 ms
Now, let's look at the time required to transport the response body, over a
56.6 kbps dial-up line:
full match: 41*2*10/56600=14 milliseconds
Note that I multiply with 10 to go from bytes to bits. I realize that there
are only 8 bits in a byte, but by taking a factor 10, I compensate for TCP/IP
and PPP protocol overhead. This worked very well to estimate download times
when I still had a dial-up line myself...
full mismatch: 19685*10/56600=3.4 seconds
50% match: (20*2+0.5*19685)*10/56600=1.7 seconds
So if we take a low-end machine like the XO, being able to return a 50% match
reduces the transfer time with 1.7 seconds. And that at the cost of spending
4.9 milliseconds to calculate the 64-bit CRC in software in stead of ?
milliseconds with a hardware based 32-bit CRC.
When it comes to the server, the number game obviously becomes different, as
the server will probably have to handle request from many clients in
parallel. So the CPU time spent on the calculations is more important. But as
the numbers show, the time to calculate the CRC's is for the server very
small compared to the time spent on the other parts, like compressing the
non-matched data and validating the CRC numbers against the list of numbers
received from the client. So the performance impact of calculating CRC
numbers in hardware is minimal. The server can benefit more from implementing
the CRC validations in hardware, for example by storing the list of numbers
in an associative memory which can validate all numbers in one clock-cycle.
And I'm pretty sure that a hardware manufacturer who can implement that, can
also add a 64-bit rolling CRC calculator to the hardware. After all, CRC was
invented in 1961 exactly because it was (and still is) so easy to implement
in hardware.
Given these numbers, I strongly believe that we should settle for a 64-bit CRC
after all, but indeed for a standard one (like the example mentioned above)
in stead of a non-standard 60-bit CRC.
Please note that the spreadsheet with the full results can be found in the GIT
repository, in the documentation subfolder. I have also checked-in the
benchmark program.
Also note that on the 'normal' laptops, the absolute performance of the
various steps is significantly better then on the XO, although the relative
times seem to be more or less the same (e.g. CRC calculation time versus
compression time versus validation time).
Cheers,
Alex
More information about the Http-crcsync
mailing list