[Http-crcsync] General comments on crcsync document

Wed Jul 15 17:11:08 EDT 2009

Hi,

How about making crc32c the default, that must be supported by any crcsync 
server and client but foresee in the protocol that other algorithms might be 
added in future if pilot testing by early adapters might show that crc32 
gives too many clashes in practice or if crc32 works fine for most people but 
in a specific deployment with a specific usage scenario it turns out to be 
too weak.

E.g. the client indicates the algorithm in the If-Block header and the server 
may indicate all supported algorithms, in order of preference and with 
prefered blocksize-multiple, in the capabilities header. If the server only 
indicates that it is capable of crcsync, the client must assume that the 
server only supports crc32 and that the server does not care about the exact 
blocksize. However, if the server has preference for a certain blocksize 
multiple, the indicates that together with the algorithm. The client should 
then respect that.

Regarding the encoding of the check-sums:
We could store them in network-byte-order in memory (4 bytes per checksum in 
case of crc32) and then convert that array of bytes into a base64 encoded 
string.

Putting it all togher:

Example of request header from the client (with 3 blocks, csl stands for 
checksum list):

If-Block: alg=crc32c, fs=45, bs=20, csl=aaaabbbbccccdddd

Example of capability header from the server, supporting crc32c and md-mumble, 
prefering blocksize multiple of 4 for both algorithms:

Capability: crcsync, alg=crc32c; m=4, alg=md-mumble; m=4

Or am I now overcomplicating things?

Cheers,
Alex

Op maandag 13 juli 2009, schreef Patrick McManus:
> On Sun, 2009-07-12 at 16:37 +0200, Alex Wulms wrote:
> >  Should we make such hardware dependent optimizations
> > part of the specification?
>
> Well, you absolutely do want to take into account the realities and
> trends of machine organization, sure. That's not the same thing as
> optimizing for one specific implementation.
>
> crc-32c is ubiquitous in both software and hardware in part because it
> is well suited to a broad range of architectures. It is even good enough
> for iSCSI and largely due to the fairly wide adoption of that protocol
> you are seeing crc32c implemented in commodity hardware (sse4 on
> nehalem). I suggest you ride that curve instead of pioneering a new
> path.
>
> I would argue that if you think 32 bits isn't enough, you should embrace
> a different standardized hash instead of something like crc-60 - even
> though they are more expensive. md-muble or sha-mumble.. Those
> algorithms also have widely available well optimized software
> implementations and are also often implemented in hardware for the
> network and security processors generally used to build network
> applicances - and that is one kind of platform I would really expect to
> see crcsync widely deployed on server side.
>
> In any event, 32 bits doesn't worry me a bit (ha ha! :)). Primarily
> because crcsync does not rely on it for correctness - it has that
> overarching sha-256 to (more or less) guarantee correctness. It's also
> possible that the fact that the false-positive is not independent data
> (i.e. that's not a random byte stream, its another revision of the same
> URI which likely looks a lot like the revision you want) is going to
> reduce the number of false positives.. CRCs don't give uniformly random
> distributions quite on purpose. (i.e. they make lousy hash table
> functions).
>
> So yes, if you use 32 bits and repeat that process for 40 blocks in a
> transaction and then have several dozen million transactions the odds
> are good that one person will have the sha-256 invalidate their
> transaction and have to repeat what was a stateless non reliable
> transaction (i.e. HTTP) anyhow. That sounds ok to me all things
> considered.