[Http-crcsync] General comments on crcsync document

Toby Collett toby.collett at gmail.com
Fri Jul 17 11:53:49 EDT 2009


Okay, so here is my analysis of the numbers, completely out of context, as
has been stated we need to compare them with existing error rates..

I am using the following assumptions:
1) the hash is evenly distributed, which is wrong for CRC, but should give
us a ballpark to start from
2) the file size locally and remotely are the same
3) taking the worst case where only a single hash might match, where source
and remote data are both random

Variables used
N = number of hashes transmitted
S = file size
B = Block size (which should be ~S/N)
H = Hash size (in bits)
M = Hashes calculated on server size, which is S-B, or S(1-1/N)

so the number of hash comparisons X is
X=N*S(1-1/N)
which only has two variables the Size and Number of Hashes

Chance of a collision our hashes are effectively random bits is
1 in (2^H)/X

so I wrote a bit of python code to generate the numbers, a file size of 100k
is about slashdot for example
#!/usr/bin/python

for N in (20,40):
        for S in (1024,102400,1024000):
                X=N*S*(1.0-1.0/N)
                for H in (16,24,30,32,48,60,64):
                        R = (2**H)/X
                        print "N=%d, S=%d, H=%d, 1 in %f" % (N,S,H,R)

This gives the following results
python hashes.py
N=20, S=1024, H=16, 1 in 3.368421
N=20, S=1024, H=24, 1 in 862.315789
N=20, S=1024, H=30, 1 in 55188.210526
N=20, S=1024, H=32, 1 in 220752.842105
N=20, S=1024, H=48, 1 in 14467258260.210526
N=20, S=1024, H=60, 1 in 59257889833822.312500
N=20, S=1024, H=64, 1 in 948126237341157.000000
N=20, S=102400, H=16, 1 in 0.033684
N=20, S=102400, H=24, 1 in 8.623158
N=20, S=102400, H=30, 1 in 551.882105
N=20, S=102400, H=32, 1 in 2207.528421
N=20, S=102400, H=48, 1 in 144672582.602105
N=20, S=102400, H=60, 1 in 592578898338.223145
N=20, S=102400, H=64, 1 in 9481262373411.570312
N=20, S=1024000, H=16, 1 in 0.003368
N=20, S=1024000, H=24, 1 in 0.862316
N=20, S=1024000, H=30, 1 in 55.188211
N=20, S=1024000, H=32, 1 in 220.752842
N=20, S=1024000, H=48, 1 in 14467258.260211
N=20, S=1024000, H=60, 1 in 59257889833.822319
N=20, S=1024000, H=64, 1 in 948126237341.157104
N=40, S=1024, H=16, 1 in 1.641026
N=40, S=1024, H=24, 1 in 420.102564
N=40, S=1024, H=30, 1 in 26886.564103
N=40, S=1024, H=32, 1 in 107546.256410
N=40, S=1024, H=48, 1 in 7048151460.102564
N=40, S=1024, H=60, 1 in 28869228380580.101562
N=40, S=1024, H=64, 1 in 461907654089281.625000
N=40, S=102400, H=16, 1 in 0.016410
N=40, S=102400, H=24, 1 in 4.201026
N=40, S=102400, H=30, 1 in 268.865641
N=40, S=102400, H=32, 1 in 1075.462564
N=40, S=102400, H=48, 1 in 70481514.601026
N=40, S=102400, H=60, 1 in 288692283805.801025
N=40, S=102400, H=64, 1 in 4619076540892.816406
N=40, S=1024000, H=16, 1 in 0.001641
N=40, S=1024000, H=24, 1 in 0.420103
N=40, S=1024000, H=30, 1 in 26.886564
N=40, S=1024000, H=32, 1 in 107.546256
N=40, S=1024000, H=48, 1 in 7048151.460103
N=40, S=1024000, H=60, 1 in 28869228380.580101
N=40, S=1024000, H=64, 1 in 461907654089.281616

Which indicates that with 40 hashes and a page about the size of slashdot we
have a 1/1000 error rate, assuming completely random data. I am not sure how
much impact the asymmetric nature of CRC has on this, are we talking 10%
better or factors of 10?

Now I dont have heaps of experience with this sort of analysis which is why
I included my full working so people can point out where I went wrong. But
at the moment the error rate with 32bits is potentially pretty high.
Especially when you consider many web pages are several requests to display
a single page.

Anyhow, hope this gives us something more concrete to continue our
discussions around.

Toby

-- 
This email is intended for the addressee only and may contain privileged
and/or confidential information
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.laptop.org/pipermail/http-crcsync/attachments/20090717/56d04ca7/attachment.htm 


More information about the Http-crcsync mailing list