[Http-crcsync] General comments on crcsync document

Wed Oct 28 15:41:53 EDT 2009

The current version in git now implements the standard document completely
as far as I am aware (doc is available from git
http://repo.or.cz/w/httpd-crcsyncproxy.git?a=tree;f=crccache/doc;h=37d90acd37bb0199a37e6d6a779c37c4f37da29b;hb=HEAD)

So now we need some testing, not sure the best way to do this, Martin, do
you want to set up access to a server?

Rusty: There was an assertion that tailsize be < block-size in the crc code.
The latest version has tail_size = blocksize + remainder. It seems to work
when that assertion is removed and I couldnt see any reason why it can't be
greater in the current implementation. Could you confirm?

Toby

2009/10/14 Toby Collett <toby.collett at gmail.com>

> I have spent some time reading over the mailing list threads and have
> updated the spec document to reflect the discussions there. Attached is the
> latest version for review, most of the changes are in the protocol details
> section at the end of the document.
>
> These changes are not reflected in the implementation yet, that will happen
> over the next couple of weeks.
>
> Regards,
> Toby
>
> 2009/7/30 Alex Wulms <alex.wulms at scarlet.be>
>
> Op zaterdag 18 juli 2009, schreef Patrick McManus:
>> > On Sat, 2009-07-18 at 10:51 +0200, Toby Collett wrote:
>> > > The problem here is that a non buffering implementation of a chunked
>> > > transfer doesnt know it has only got one hash match before it starts
>> > > sending the data...
>> >
>> > what I was thinking was more along the lines of "I haven't seen a hash
>> > match even after calculating 4(?) full windows worth of data - I'll stop
>> > looking because even if this is a legitimate delta it is going to be a
>> > badly performing one"..
>> >
>> > that bounds the number of hashes so should really help with the false
>> > positive problem and doesn't pose a problem for streaming the response..
>> I have been thinking about this and to be honest, I don't like this
>> approach
>> too much, for two reasons.
>>
>> The first reason is that it will make the server implementation needlessly
>> complex. It would have to deal for example with the situation where the
>> first
>> few blocks match (e.g. because of common page headers, menu's, etc) and
>> then
>> the next X blocks mismatch. In such case, when X is high enough (4 for
>> example) we would have to decide to send a full response after all. This
>> can
>> ofcourse be dealt with by buffering stuff in the server (block references
>> are
>> only small) but still, it will make the code more fragile and more
>> susceptible to subtle bugs.
>>
>> The second -and maybe even more important- reason is that we have to put
>> the
>> threshold of X pretty low if we want keep the chance on false positives
>> low
>> enough so that the protocol won't get the perception of being unreliable.
>>
>> E.g. if a 100kB page would be 50% the same, we would already have a chance
>> of
>> 1/2000 on false positives. Which is in my opinion unacceptable high. So we
>> would have to put the threshold indeed on 4 or 5 blocks or so. Meaning
>> that
>> we would be often sending full pages in stead of delta's where a delta
>> could
>> have been sent when using a stronger checksum.
>>
>> By choosing a larger CRC (e.g. CRC-64-ISO, which is used for
>> HDLC/High-Level
>> Data Link Control), we avoid all these issues.
>>
>> So we must balance the extra cost of a software based 64-bit CRC in stead
>> of a
>> hardware based 32-bit CRC against the cost of transmitting full responses
>> in
>> stead of delta-responses in cases that the non-delta part would be larger
>> then say 20% or so. And this taking into consideration that one of the
>> main
>> drivers for this project is the needs of OLPC, many of which are deployed
>> in
>> rural area's with dial-up connections. Which I assume means that the
>> bandwith
>> is 64kbit/s at most or more likely 56.6kbps.
>>
>> We also have to look at how much time do the client and the server spent
>> on
>> the CRC calculation versus the other parts of the protocol handling:
>> - For the client, it means decompressing literal data and copying matching
>> blocks from the base page to the reconstructed page. And obviously
>> calculating the SHA256 over the reconstructed page to detect corruptions.
>> - For the server, it means matching the rolling CRC against the list of
>> CRC-numbers, compressing the non-matched literal data and calculating the
>> SHA256 over the original response.
>>
>> So in order to balance this, I have made a benchmark program that measures
>> the
>> timing required for the various parts and executed it on different
>> machines:
>> - A recent laptop with a 64-bit Intel CPU
>> - An old laptop with a 32-bit AMD Athlon XP
>> - An XO, with the 32-bit AMD Geode processor
>>
>> I have each time tested with a copy of the slashdot homepage from a few
>> months
>> ago. It concerns a page of 84240 bytes, which compresses to 19684 (when
>> using
>> zlib with default compression ratio, like is done in the crcsync-proxy
>> code)
>>
>> The benchmark has been done by simulating that the page will be split in
>> 40
>> complete blocks and one tail block.
>>
>> Here are the timing results for the slowest of the machines (the XO). All
>> times are in milliseconds:
>> Calculate the 64-bit CRC blocks: 4.9 ms
>> Decompression: 6.95 ms
>> Copy blocks: 0.4 ms
>> Calculate SHA256: 8,85 ms
>> Compression: 60.9 ms
>> Validate page with exact match: 55.4 ms
>> Validate page, everything mismatches: 145.15 ms
>>
>> So it means that if the client receives a full match (only block numbers),
>> that the client will spent 4.9+0.4+8.85=14.15 ms of CPU time
>> Likewise, if it receives a full mismatch (only compressed literal data),
>> the
>> CPU time will be 4.9+6.95+8.85=20.7 ms
>> And in case of a 50% match, the time will be
>> 4.9+0.5*6.95+0.5*0.4+8.85=17.4 ms
>>
>> Now, let's look at the time required to transport the response body, over
>> a
>> 56.6 kbps dial-up line:
>> full match: 41*2*10/56600=14 milliseconds
>> Note that I multiply with 10 to go from bytes to bits. I realize that
>> there
>> are only 8 bits in a byte, but by taking a factor 10, I compensate for
>> TCP/IP
>> and PPP protocol overhead. This worked very well to estimate download
>> times
>> when I still had a dial-up line myself...
>> full mismatch: 19685*10/56600=3.4 seconds
>> 50% match: (20*2+0.5*19685)*10/56600=1.7 seconds
>>
>> So if we take a low-end machine like the XO, being able to return a 50%
>> match
>> reduces the transfer time with 1.7 seconds. And that at the cost of
>> spending
>> 4.9 milliseconds to calculate the 64-bit CRC in software in stead of ?
>> milliseconds with a hardware based 32-bit CRC.
>>
>> When it comes to the server, the number game obviously becomes different,
>> as
>> the server will probably have to handle request from many clients in
>> parallel. So the CPU time spent on the calculations is more important. But
>> as
>> the numbers show, the time to calculate the CRC's is for the server very
>> small compared to the time spent on the other parts, like compressing the
>> non-matched data and validating the CRC numbers against the list of
>> numbers
>> received from the client. So the performance impact of calculating CRC
>> numbers in hardware is minimal. The server can benefit more from
>> implementing
>> the CRC validations in hardware, for example by storing the list of
>> numbers
>> in an associative memory which can validate all numbers in one
>> clock-cycle.
>> And I'm pretty sure that a hardware manufacturer who can implement that,
>> can
>> also add a 64-bit rolling CRC calculator to the hardware. After all, CRC
>> was
>> invented in 1961 exactly because it was (and still is) so easy to
>> implement
>> in hardware.
>>
>> Given these numbers, I strongly believe that we should settle for a 64-bit
>> CRC
>> after all, but indeed for a standard one (like the example mentioned
>> above)
>> in stead of a non-standard 60-bit CRC.
>>
>> Please note that the spreadsheet with the full results can be found in the
>> GIT
>> repository, in the documentation subfolder. I have also checked-in the
>> benchmark program.
>>
>> Also note that on the 'normal' laptops, the absolute performance of the
>> various steps is significantly better then on the XO, although the
>> relative
>> times seem to be more or less the same (e.g. CRC calculation time versus
>> compression time versus validation time).
>>
>> Cheers,
>> Alex
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.laptop.org/pipermail/http-crcsync/attachments/20091028/abc4bba8/attachment.htm