[Http-crcsync] crccache ready for some testing I think

Sat Apr 4 18:03:06 EDT 2009

Hi Rusty,

Many thanks!!!

I'll adapt crccache code this sunday to support the improved algorithm and 
submit all to the git repo.

Hi Toby,

I'll change the crccache-client to send the original file size in stead  of 
the blocksize in the header. In that way, the crccache-server can calculate 
the blocksize and the tailsize by itself. It is more efficient then using two 
headers (one with blocksize, one with tailsize). I'll also adapt the client 
to put the crc of all 21 blocks (the 20 full blocks + the 1 tail block) in 
the header. I must still figure out what to do in case that the tail has a 
zero size (e.g. if filesize is exact multiple of 20) but I think that while 
writing the code, the answer will crystalize by itself.

Thanks and brs,
Alex

Op zaterdag 4 april 2009, schreef Rusty Russell:
> On Thursday 02 April 2009 09:30:38 Alex Wulms wrote:
> > Hi Rusty,
> >
> > I have come across another subtle bug in the crccache module that Toby
> > and myself are making, which I have meanwhile traced down to a
> > not-yet-complete understanding on how to use your library. In attachment
> > you can find another version of my test program, that I have used to
> > better my understanding of the library.
> >
> > Here is the output of the program as it would be if you would run it:
> > ./test
> > CRCs data1: 33dd97b6 0122a0a2 301f8cc8
> > CRCs data2: 33dd97b6 0122a0a2 3c28960e
> > remaining: 14, offset: 0
> > ndigested: 5, result: -1, searched-in: <ABCDEabcde123g>
> > remaining: 9, offset: 5
> > ndigested: 5, result: -2, searched-in: <abcde123g>
> > remaining: 4, offset: 10
> > ndigested: 4, result: 0, searched-in: <123g>
> > flush result: -3
> > flush result: 1
> > flush result: 0
> >
> > Basically, I work with a blocksize of 5, my 'original' string is a 13
> > character string (ABCDEabcde123) and my 'modified' string is a 14
> > character string (the original + 'g')
> >
> > What I notice is that the crc_block_read function first returns the
> > check-sums for the first two 5-byte blocks, which is as expected.
> >
> > However, the last invokation of crc_read_block returns a 'result=0' and
> > ndigested=4. So you keep the state about it in your internal buffer.
> > When I then invoke the flush function, I first get a reference to block
> > 3, which is only a 3 byte-block and not a five-byte-block and then you
> > return as last result that there is still 1 remaining non-matched
> > character (the g).
> >
> > So I guess that if I want to use your API correctly, that I should
> > specify the length of the tail-block (of the cached data) in the request,
> > so that I would be able to calculate the position of the 'g' character in
> > the above case. Is that correct?
>
> Yes, or we could drop this heuristic from the code, and only do whole-block
> matches.  I try *somewhat* to cover the append-case, but it won't happen in
>
> general as you see below:
> > I have also done another test, in which I have made the second string 5
> > characters longer then the 'original' string  (e.g. the original +
> > 'ggggg'). In such case I get following output:
> > ./test
> > CRCs data1: 33dd97b6 0122a0a2 301f8cc8
> > CRCs data2: 33dd97b6 0122a0a2 19886038 1d2c046e
> > remaining: 18, offset: 0
> > ndigested: 5, result: -1, searched-in: <ABCDEabcde123ggggg>
> > remaining: 13, offset: 5
> > ndigested: 5, result: -2, searched-in: <abcde123ggggg>
> > remaining: 8, offset: 10
> > ndigested: 8, result: 3, searched-in: <123ggggg>
> > flush result: 5
> > flush result: 0
> >
> > So in that case, the crclib does not detect that '123' matches the
> > tail-block of the original request. It returns it as a 'literal/mismatch'
> > block and then the remaining 5 characters are remembered in the internal
> > state and informed to me once I perform the flush. Is this also the
> > intended behaviour?
>
> Yes.  Currently the matching algo only looks for that truncated block match
> when it hits the end of input.  This is suboptimal:
>
> Block size 4:
> 	Original: AAAA A
> 	Modified: AAAA AB
>
> This works, since we get to the end of input and look harder for that final
> block.  This doesnt:
>
> 	Original: AAAA A
> 	Modified: AAAA ABBB B
>
> Now, we can do better; it might be worth looking for the last block
> explicitly after the second-to-last block matches.  This would be more
> efficient if we knew what length to match, which as you point out we
> probably want to send with the request anyway.
>
> Hmmm... ok, I've done that.  Interface breakage, but here's the patch:
>
> === modified file 'ccan/crcsync/crcsync.c'
> --- ccan/crcsync/crcsync.c	2009-03-31 05:18:59 +0000
> +++ ccan/crcsync/crcsync.c	2009-04-04 11:30:36 +0000
> @@ -2,6 +2,7 @@
>  #include <ccan/crc/crc.h>
>  #include <string.h>
>  #include <assert.h>
> +#include <stdbool.h>
>
>  void crc_of_blocks(const void *data, size_t len, unsigned int block_size,
>  		   unsigned int crcbits, uint32_t crc[])
> @@ -33,9 +34,14 @@
>  	size_t total_bytes;
>  	int have_match;
>
> +	/* Final block is special (if a different size) */
> +	size_t final_size;
> +	uint32_t final_crc;
> +
>  	/* Uncrc tab. */
>  	uint32_t uncrc_tab[256];
>
> +	/* This doesn't count the last CRC. */
>  	unsigned int num_crcs;
>  	uint32_t crc[];
>  };
> @@ -59,13 +65,27 @@
>  }
>
>  struct crc_context *crc_context_new(size_t block_size, unsigned crcbits,
> -				    const uint32_t crc[], unsigned num_crcs)
> +				    const uint32_t crc[], unsigned num_crcs,
> +				    size_t final_size)
>  {
>  	struct crc_context *ctx;
>
> +	assert(num_crcs > 0);
> +	assert(block_size > 0);
> +	assert(final_size > 0);
> +	assert(final_size <= block_size);
> +
>  	ctx = malloc(sizeof(*ctx) + sizeof(crc[0])*num_crcs);
>  	if (ctx) {
>  		ctx->block_size = block_size;
> +		if (final_size != block_size) {
> +			ctx->final_size = final_size;
> +			ctx->final_crc = crc[--num_crcs];
> +		} else {
> +			/* If this is 0, we never compare against it. */
> +			ctx->final_size = 0;
> +		}
> +
>  		/* Technically, 1 << 32 is undefined. */
>  		if (crcbits >= 32)
>  			ctx->crcmask = 0xFFFFFFFF;
> @@ -103,6 +123,14 @@
>  	return -1;
>  }
>
> +static bool final_matches(const struct crc_context *ctx)
> +{
> +	if (ctx->literal_bytes != ctx->final_size)
> +		return false;
> +
> +	return (ctx->running_crc & ctx->crcmask) == ctx->final_crc;
> +}
> +
>  static uint32_t crc_add_byte(uint32_t crc, uint8_t newbyte)
>  {
>  	return crc32c(crc, &newbyte, 1);
> @@ -144,10 +172,16 @@
>  	else
>  		old = NULL;
>
> -	while (!old || (crcmatch = crc_matches(ctx)) < 0) {
> +	while (ctx->literal_bytes < ctx->block_size
> +	       || (crcmatch = crc_matches(ctx)) < 0) {
>  		if (consumed == buflen)
>  			break;
>
> +		/* Increment these now in case we hit goto: below. */
> +		ctx->literal_bytes++;
> +		ctx->total_bytes++;
> +		consumed++;
> +
>  		/* Update crc. */
>  		if (old) {
>  			ctx->running_crc = crc_roll(ctx->running_crc,
> @@ -161,11 +195,13 @@
>  			ctx->running_crc = crc_add_byte(ctx->running_crc, *p);
>  			if (p == (uint8_t *)buf + ctx->block_size - 1)
>  				old = buf;
> +			/* We don't roll this csum, we only look for it after
> +			 * a block match.  It's simpler and faster. */
> +			if (final_matches(ctx)) {
> +				crcmatch = ctx->num_crcs;
> +				goto have_match;
> +			}
>  		}
> -
> -		ctx->literal_bytes++;
> -		ctx->total_bytes++;
> -		consumed++;
>  		p++;
>  	}
>
> @@ -180,8 +216,11 @@
>  		} else {
>  		have_match:
>  			*result = -crcmatch-1;
> -			ctx->literal_bytes -= ctx->block_size;
> -			assert(ctx->literal_bytes == 0);
> +			if (crcmatch == ctx->num_crcs)
> +				assert(ctx->literal_bytes == ctx->final_size);
> +			else
> +				assert(ctx->literal_bytes == ctx->block_size);
> +			ctx->literal_bytes = 0;
>  			ctx->have_match = -1;
>  			ctx->running_crc = 0;
>  			/* Nothing more in the buffer. */
> @@ -219,37 +258,9 @@
>  	return consumed;
>  }
>
> -/* We could try many techniques to match the final block.  We can
> - * simply try to checksum whatever's left at the end and see if it
> - * matches the final block checksum.  This works for the exact-match
> - * case.
> - *
> - * We can do slightly better than this: if we try to match the checksum
> - * on every block (starting with block_size 1) from where we end to EOF,
> - * we can capture the "data appended" case as well.
> - */
> -static size_t final_block_match(struct crc_context *ctx)
> -{
> -	size_t size;
> -	uint32_t crc;
> -
> -	if (ctx->num_crcs == 0)
> -		return 0;
> -
> -	crc = 0;
> -	for (size = 0; size < buffer_size(ctx); size++) {
> -		const uint8_t *p = ctx->buffer;
> -		crc = crc_add_byte(crc, p[ctx->buffer_start+size]);
> -		if ((crc & ctx->crcmask) == ctx->crc[ctx->num_crcs-1])
> -			return size+1;
> -	}
> -	return 0;
> -}
> -
>  long crc_read_flush(struct crc_context *ctx)
>  {
>  	long ret;
> -	size_t final;
>
>  	/* We might have ended right on a matched block. */
>  	if (ctx->have_match != -1) {
> @@ -263,20 +274,11 @@
>  		return ret;
>  	}
>
> -	/* Look for truncated final block. */
> -	final = final_block_match(ctx);
> -	if (!final) {
> -		/* Nope?  Just a literal. */
> -		ret = buffer_size(ctx);
> -		ctx->buffer_start += ret;
> -		ctx->literal_bytes -= ret;
> -		return ret;
> -	}
> -
> -	/* We matched (some of) what's left. */
> -	ret = -((int)ctx->num_crcs-1)-1;
> -	ctx->buffer_start += final;
> -	ctx->literal_bytes -= final;
> +	/* The rest is just a literal. */
> +	ret = buffer_size(ctx);
> +	assert(ctx->literal_bytes == ret);
> +	ctx->buffer_start = ctx->buffer_end = 0;
> +	ctx->literal_bytes = 0;
>  	return ret;
>  }
>
>
> === modified file 'ccan/crcsync/crcsync.h'
> --- ccan/crcsync/crcsync.h	2009-02-17 08:31:19 +0000
> +++ ccan/crcsync/crcsync.h	2009-04-04 11:24:03 +0000
> @@ -23,12 +23,14 @@
>   * @crcbits: the bits valid in the CRCs (<= 32)
>   * @crc: array of block crcs
>   * @num_crcs: number of block crcs
> + * @final_size: the final block size (<= blocksize).
>   *
>   * Returns an allocated pointer to the structure for crc_find_block,
>   * or NULL.  Makes a copy of @crc and @num_crcs.
>   */
>  struct crc_context *crc_context_new(size_t blocksize, unsigned crcbits,
> -				    const uint32_t crc[], unsigned num_crcs);
> +				    const uint32_t crc[], unsigned num_crcs,
> +				    size_t final_size);
>
>  /**
>   * crc_read_block - search for block matches in the buffer.
>
> > Please note that I run regularly into your tail-cases, due to the moving
> > window that I use to collect the chunks of data from the origin server
> > and feed them to your library, so it is important that I fully understand
> > how to use it properly.
> >
> > Many thanks if you can shed some further light on this so that I can
> > properly fix the crccache code.
> >
> > Thanks and kind regards,
> > Alex
> >
> > Op dinsdag 31 maart 2009, schreef Alex Wulms:
> > > I have fixed a few more bugs. Everything is checked-in to the
> > > repository. Please feel free to test it further.
> > >
> > > Op dinsdag 31 maart 2009, schreef Martin Langhoff:
> > > > On Tue, Mar 31, 2009 at 1:05 AM, Alex Wulms <alex.wulms at scarlet.be> 
wrote:
> > > > > With the new code, the compressed size is 5%.
> > > >
> > > > niiice.
> > > >
> > > > > Ps: slashdot sets 'cache private' headers so normally cache module
> > > > > does not want to cache them. I have overruled it via the
> > > > > configuration parameters
> > > >
> > > > For our use case, my thinking was that we would...
> > > >
> > > >  - use the standard mod_cache to cache "cacheable" content as per
> > > > HTTP headers
> > > >
> > > >  - use a separate storage to cache "uncacheable" content that is a
> > > > good candidate for crcsync...
> > > >
> > > > is that what you are thinking of as well?
> > > >
> > > > cheers,
> > > >
> > > >
> > > >
> > > > m
> > >
> > > _______________________________________________
> > > Http-crcsync mailing list
> > > Http-crcsync at lists.laptop.org
> > > http://lists.laptop.org/listinfo/http-crcsync