[Http-crcsync] Progress on implementing CRCSync client logic to base encoding on 'similar' pages in the cache

Alex Wulms alex.wulms at scarlet.be
Sat Jul 31 07:18:24 EDT 2010


Hi,

One of the todo's for the proxy was to implement the ability to use a
'similar' page as a basis for the crcsync-encoding if no perfect match
for the requested URL can be found in the cache. This functionality is
already supported in the server-side so that the server can indicate via
a header, that contains a regular expression to be matched against the
URL, which pages are similar to each other.

I'm currently working on the client-side, to do something with the
header. At the moment, the updated code scans the disk-cache during
start-up of Apache to find all pages that have such a 'crcsync-similar'
header and stores some meta-info about those pages in memory. Then upon
a request, when there is no perfect match in the cache, it will evaluate
the regular expressions to find the first matching page. This approach
seems to work reasonably well, as can be seen in following log snippet
from browsing a news site, which demonstrates that there are indeed
matching blocks between two different pages from the same site:

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1083):
Preparing CRCSYNC/delta-http for
*http://tweakers.net/nieuws/68891/eu-gaat-investeren-in-reggefiber.html*
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1087): *No page
found in cache for requested URL. Trying to find page for similar URLs*
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(294): Comparing
content type text/html; charset=ISO-8859-15 versus accept
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(310): Content
type prefix: text, accept prefix: text
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(351): *Basing
CRCSYNC/delta-http for requested URL on cached page for URL
http://tweakers.net:80/nieuws/68893/hackers-demonstreren-rootkit-software-voor-android.html?*
of size 67218
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1111): Read
file into bucket
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1119):
crccache_client: 39 blocks of 1680 bytes, one block of 1698 bytes
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1187):
Successfully prepared CRCSYNC/delta-http
....
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1243): CRCSYNC
returned status code (200)
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1264): Incoming
Vary header: Accept-Encoding,If-Block
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1522):
CRCSYNC-DECODE inflate rslt 1, consumed 1749, produced 4883
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1420):
CRCSYNC-DECODE *block section, block 3, size
1680u*                                                                         

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1420):
CRCSYNC-DECODE *block section, block 4, size
1680u*                                                                         

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1420):
CRCSYNC-DECODE *block section, block 5, size
1680u*                                                                         

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1420):
CRCSYNC-DECODE *block section, block 6, size
1680u  *                                                                       

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1522):
CRCSYNC-DECODE inflate rslt 0, consumed 2, produced
0                                                                     
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(2075): cache:
Caching url:
http://tweakers.net/nieuws/68891/eu-gaat-investeren-in-reggefiber.html                                

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(2081): cache:
Removing CACHE_REMOVE_URL
filter.                                                                                  

[Sat Jul 31 12:58:07 2010] [debug] cache/cache.c(904): disk_cache:
Stored headers for URL
http://tweakers.net:80/nieuws/68891/eu-gaat-investeren-in-reggefiber.html?                      

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1522):
CRCSYNC-DECODE inflate rslt 0, consumed 6275, produced
30000                                                              
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1522):
CRCSYNC-DECODE inflate rslt 0, consumed 1719, produced
7165                                                               
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1522):
CRCSYNC-DECODE inflate rslt 1, consumed 805, produced
2575                                                                
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1420):
CRCSYNC-DECODE *block section, block 30, size
1680u*                                                                        

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1420):
CRCSYNC-DECODE *block section, block 31, size
1680u*                                                                        

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1522):
CRCSYNC-DECODE inflate rslt 1, consumed 1811, produced
8441                                                               
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1420):
CRCSYNC-DECODE *block section, block 37, size
1680u*                                                                        

[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1522):
CRCSYNC-DECODE inflate rslt 1, consumed 1268, produced
3456                                                               
[Sat Jul 31 12:58:07 2010] [debug] mod_crccache_client.c(1339):
CRCSYNC-DECODE HASH CHECK PASSED for uri
http://tweakers.net/nieuws/68891/eu-gaat-investeren-in-reggefiber.html           


Next step is that I must implement the logic to maintain the memory
structure for each new page retrieved (like said above, at the moment it
is only initialized/loaded on startup). Once that is done I'll check-in
the changes into the GIT repository.


Cheers,
Alex

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.laptop.org/pipermail/http-crcsync/attachments/20100731/92fb3fdd/attachment.html 


More information about the Http-crcsync mailing list