[Http-crcsync] http-sync standard

Tue Apr 28 18:45:16 EDT 2009

Hi,

Please find in attachment my current ideas on the http-crcsync protocol. 

I hope you don't mind but I have drafted the paper in open office and not as a 
plain text file. An open office version makes it easier to visibly divide the 
document in logical sections in order to enhance the readability and the 
structure. Though, I have added a text export for convenience, in case 
somenone might not have OO available.

As I mentioned yesterday in this list, the paper is based on the insights that 
have grown while working on the crccache proxy modules and after studying the 
various papers on delta-encoding (the RFC, Toby's draft and the protocol 
proposal on the original rproxy site) and everything related to it.

Like Toby proposed, maybe it would make sense to convert this into a wiki and 
continue the work there. What do you think?

Cheers,
Alex

Op maandag 6 april 2009, schreef tridge at samba.org:
> Hi Toby,
>
>  > This is transmitted in two forms. One is an SHA1 hash of the complete
>  > chached file, and the other is a set of block hashes.
>
> ...
>
>  > Content-Hash:
>  > This will be an sha1 hash of the entire cached body, and will allow the
>  > server to transmit deltas based on its knowledge of past versions of the
>  > page.
>
> How do you imagine that this hash would be used? I don't think it is
> practical to think that servers would keep a record of all the dynamic
> pages they have been served out, and for static pages I think the
> normal cache tag mechanisms already work well.
>
> I think a SHA1 or similar of the servers generated page is really
> worthwhile, but a whole file hash of what the client has in cache
> doesn't gain anything that I can see.
>
> btw, I also wonder if you've seen this:
>
>   http://www.ietf.org/rfc/rfc3229.txt
>
> That is the result of earlier efforts to standardise delta encoding in
> HTTP. I was a little bit involved in that effort, although it
> concentrated on storing old copies on the server, which I wasn't
> interested in.
>
> Still, it should provide some very useful background to the current
> effort.
>
> Cheers, Tridge
> _______________________________________________
> Http-crcsync mailing list
> Http-crcsync at lists.laptop.org
> http://lists.laptop.org/listinfo/http-crcsync

-------------- next part --------------
A non-text attachment was scrubbed...
Name: http_crcsync_protocol.odt
Type: application/vnd.oasis.opendocument.text
Size: 36855 bytes
Desc: not available
Url : http://lists.laptop.org/pipermail/http-crcsync/attachments/20090429/d284afa5/attachment-0001.odt 
-------------- next part --------------
Introduction
The purpose of this paper is to discuss and eventually describe the http-crcsync protocol. The http-crcsync protocol should work as an extension to the HTTP 1.1 protocol, described in RFC 2616 [5]. It's intended purpose is to minimize the amount of data transferred over slow links when fetching updates to web-pages or other resources, by applying a delta-encoding between the old resource instance known to the client and the updated resource instance generated by the origin server.
The http-crcsync protocol is based on ideas introduced in the rproxy project [1], it's associated HTTP Rsync Protocol [2] description and RFC 3229 - Delta encoding in HTTP [3] and further elaborates on the first draft of the http-sync standard proposal [6].
Components
Following picture shows the various components that play a role in the discussion:
+------+    +-------+    +-------+    +-------+    +-------+    +-------+
|HTTP  |    |Classic|    |Crcsync|    |Crcsync|    |Classic|    |Content|
|client+<-->+cache  +<-->+cache  +<-->+cache  +<-->+cache  +<-->+server |
|      |    |client |    |client |    |server |    |server |    |       |
+------+    +-------+    +-------+    +-------+    +-------+    +-------+ 
Deployment and implementation scenarios
Please be aware that the picture does not depict at which location each component will be implemented and deployed. This section introduces three possible deployment and implementation scenario's (scen1, scen2 and scen3)
SCEN1: One possibility could be that the HTTP client and the classic cache client are implemented in a web browser, the crcsync cache client and server in two web proxy servers and the classic cache server and content server in a web site that acts as the origin server. This deployment scenario was originally foreseen in the rproxy project. It will very likely also be used during a (probably lengthy) transition phase, until all web browsers and web sites understand the http-crcsync protocol.
SCEN2: Another possibility could be that all client components are implemented in some web browsers, the crcsync cache server in a web proxy server and the classic cache server and content server again in the web site that acts as the origin server. This scenario may arise pretty soon; Mozilla foundation is planning to support the http-crcsync protocol in the Firefox web browser. The implementation will be started as part of the 2009 Google Summer of Code initiative.
SCEN3: The last possibility could be that all client components (HTTP client, classic cache client and crcsync cache client) would be implemented in a web browser while all server components would be implemented in the web site that acts as the origin server. It will probably take a while before this vision is reached. The most likely set-up would actually be that the crcsync cache server functionally would be added to the reverse proxy server that is typically employed by a web site owner in front of the internal server farm. Such a reverse proxy server acts as an origin server from the perspective of an http client, shields the internal server farm infrastructure from the internet and usually performs central tasks like load-balancing requests and authentication and authorization for protected parts of a web site.
Component roles
HTTP client: This is the client (e.g. a browser) that asks for a new resource. The HTTP client submits the request to the classic cache client, which could be embedded in the same application or could be running elsewhere, e.g. in a proxy server..
Classic cache client and server: This is a cache implementing the HTTP 1.0 or 1.1 cache protocol. The classic cache client checks if it has a fresh instance of the requested resource in local storage. If yes, it returns it immediately to the HTTP client. Otherwise, it will ask the classic cache server (which is usually part of the origin server) for a new instance of the resource. The request can be conditional by using headers like 'if-modified-since' or 'if-match' if the client has a stale instance in local storage. The classic cache server can respond with a "304 - not modified" response (in case of a conditional request for a resource that turns out to be up to date) or with a "200 - OK" response containing a new instance (for non-conditional requests and for conditional requests that are stale). The server response may contain headers that indicate the expected lifetime of the response (e.g. an expires header). The server response may also contain headers that indicate that the response should not be cached at all or that it may be cached in a private cache (e.g. in the browser) but not in a shared cache (e.g. a web proxy server with a cache module).
This whole classic cache protocol is based on an all or nothing approach; a resource instance is entirely up to date or a resource instance is entirely obsolete and must be completely replaced.
This works reasonably fine for sites serving static documents, which almost never change. However, this does not fly well with dynamic sites. So in practice, dynamic sites use cache control headers that prevent caching to occur at all or that force a revalidation on each request (by means of a conditional request).
In the above depicted flow, the requests between the classic cache client and server will flow through the crcsync cache client and server, to alleviate this problem.
Crcsync cache client and server: This cache will implement the new http-crcsync protocol. The crcsync cache client checks if it has an instance of the requested resource or of a similar resource in local storage. A similar resource could for example be another page from the same web site that may have same menu's, headers, footers, site-bars, etc as the newly requested page. It is still unclear how to the crcsync cache could identify a similar resource for a newly requested page. One option could be that a configuration file contains (regular expression) patterns to indicate which resources are similar or that a crcsync aware server could pass such information in a response header.
If the crcscync cache does not have an appropriate instance, it will forward the request unmodified to the crcsync cache server, which will forward it unmodified to the classic cache server. Likewise, the response will be passed unmodified along the chain. 
On the other hand, if an instance is present in the local storage, the client will split this instance into several blocks, calculate a CRC checksum per block and add this list of checksums into a new header in the request before forwarding it to the crcsync cache server. 
The current implementation splits the file into 40 equally sized blocks and a trailer block if the file size is not a multiple of 40. It is still an open question if 40 blocks is the optimum; for small files, it may have a significant impact on the request header size, compared to the savings that can be gained on the response. For a large file however, it might not be fine grained enough; when changes are scattered over several places in a page, it may impact a significant amount of the blocks, leading to a large delta. So maybe the client should use some heuristic to determine the number of blocks, based on the size of instance size and specify this number of blocks in a request header.
Once the crcsync cache server has received the request with the crcsync header(s), it will  ask the classic cache server for the resource. If the classic cache server responds with a "304 - not modified" response, it can pass this response unmodified down the chain to the classic cache client. On the other hand, if the classic cache server responds with a "200 - OK" response, the crcsync cache server will try to match the content of the response against the list of CRC checksums received from the client. The response will contain the block-number for each block that matches and it will contain the literal content for blocks that can not be matched. So in effect, the crcsync cache server returns a delta between the instance that the crcsync cache client has in local storage and the instance that the origin server has generated. The crcsync cache client can then reconstruct this new instance, update it's local storage and return the reconstructed new instance to the classic cache client. In order to detect corruptions in the reconstructed instance that could arise from a CRC clash in the crcsync cache server, the crcsync cache server should also calculate a strong hash (a CRC is a weak hash so the risk on clashes is relatively large) over the new instance and return that to the crcsync cache client, which should validate it. In case of a mismatch, the crcsync cache client should return an error condition to the classical cache client and discard the original instance from it's local store, to prevent the same error from re-occurring when the user retries the request. The response containing the delta information, provided by the crcsync cache server to the crcsync cache client, will further be referenced to as the delta response in this paper.
Applicability of crcsync cache and compression issues
The crcsync cache can be used for two applications:
1)Accelerate access to unmodified resources that are served by an origin server that does not properly implement the classic cache protocol but that does serve semi static content. This happens some times with naively implemented web applications that don't set any cache headers at all and always respond with a 200 OK response. In such case, for a non-modified instance, the delta response will simply consist of the list of block numbers for all blocks. This works fine for any resource type (text, html, gif, jpg, mpg, zip, etc).
2)Accelerate access to modified resource instances, by returning a real delta in the delta response.
The second point works fine for resources in which changes are local and don't have a knock-on effect on the entire byte stream. 
This means that it will work fine for example for an non-compressed text or html file. However, it won't work well for example for a zip file or any other resource compressed with dictionary based compression algorithms like gzip, compress or deflate. Reason that it does not work well with dictionary based compression algorithms is that if a byte changes somewhere early in the original (non-compressed) source, that that change will impact how the internal dictionary is build by the compression algorithm and as a result, has a knock-on effect on a significant part of the remaining compressed data stream. So the delta response would basically consist of almost the entire original response and the crcsync cache would not provide much acceleration at all.
Given the fact that gzip, compress and deflate can be negotiated as a content-encoding between the http-client and the origin server, a crcsync cache that is sitting in between would have to take special precautions, especially in deployment scenario [scen1] . In such case, the crcsync cache should remove the compression before applying the crcsync delta calculations. E.g. the crcsync cache client should calculate the CRC hashes on the non-compressed representation of the resource and the crcsync cache server should do the comparison also based on the non-compressed representation. Note that in order to minimize the size of the delta response on the network, the literal blocks in the delta response should be compressed by the crcsync cache server and they can then be uncompressed by the crcsync cache client.
An open question is if the crcsync cache should hide from the http-client that it is uncompressing the stream. Implementation wise it is the easiest for the crcsync cache client to store the instance non-compressed in it's local storage and to return a reconstructed instance in non-compressed format to the http-client. In such an implementation, a gzip content-encoding would for example be transformed into an identity content-encoding. On the other hand, if the crccache should re-apply the original content encoding, then the crccache-server would have to pass this information on to the crccache-client in the delta response. How else would could the crccache-client know which content-encoding to apply?
Note that if the crcsync cache transforms the content-encoding, that it should also update or drop for example an etag header and an md5 header, because these header types are supposed to be associated with the encoded content; an origin server may use a different etag for a response with identity content-encoding then for one with gzip content-encoding. And the md5 sum is different for obvious reasons. The etag transformation also impacts the classical cache controls, because that one is used by the 'if-match' and related cache-control headers.
From that perspective it would make sense that the crcsync cache client re-applies the content-encoding of the origin server before handing over the reconstructed instance to the classical cache client, so that the classical cache mechanisms don't get broken. After all, a '304 - not modified' response is less resource intensive (network wise and CPU power wise on the server) then a delta response. On the other hand, this is only an optimization that applies to sites serving semi static content and properly applying etag based caching. While it demands needless extra processing power on the crcsync cache client when dealing with dynamic sites for which the classical cache mechanisms are useless.
So this has to be carefully balanced.
Communication between crcsync cache client and server
The rproxy project assumed that the decoding rproxy and the encoding rproxy (terminology used in [1]) would be sitting close to each other, with no intermediate hops in between. Therefore, the rproxy project proposed in [2] to use the hop-by-hop TE/Transfer-Encoding headers to negotiate the usage of the http-rsync protocol.
We could do the same for the http-crcsync protocol if our ambition would be to limit ourselves to deployment scenario [scen1] and [scen2]. In both cases, the crcsync cache client will be configured explicitly to send all requests to a crcsync cache proxy server, which could indeed be deployed close enough to the client to effectively be the next hop.
However, deployment scenario [scen3] foresees that the origin server/site is http-crcsync aware and that the http-crcsync request and response will be passed all along the chain between the requesting client and the origin server. In such scenario we must assume that there may be intermediate classical cache proxies in the chain. 
This prohibits the usage of a transfer-encoding so we must use a new content-encoding instead, to indicate that the response is a delta response.
Furthermore, such intermediate caches may not be http-crcsync aware and we have to deal with the issues with caching of delta responses that are further elaborated in [3] (RFC 3229 Delta encoding in HTTP) and [4] (RFC 3141 - Known HTTP Proxy/Caching problems).
Both paper [3] and [4] do not provide a satisfactory solution to prevent delta responses from being cached by intermediate proxies. Paper [3] proposes to use a new response code (206), assuming that caches will not cache responses with unknown response codes, but paper [4] explains that this is not sufficient; according to the HTTP specification, a cache may cache an unknown response, provided that the response contains cache-headers that say it is allowed. Paper [4] advises to update the HTTP specification to indicate that unknown response codes should not be cached but the HTTP specification has not been updated accordingly and even if it gets updated, that won't magically fix the existing intermediate caches in the field.
In order to work around this issue, a solution must be found. One possibility is that the delta response contains cache headers that inform classical caches that the response may not be cached at all (e.g. expiration date in the past, pragma 'no-cache' for http 1.0 caches, cache-control 'no-store' or 'no-cache' or something like that for http 1.1 caches), just like is done by dynamic websites that don't want their pages to be cached. This is a proven approach that is compliant with both the HTTP 1.0 and 1.1 specification.
However, doing so would also remove information about the cacheability of the to-be-reconstructed instance data. So the http-crcsync aware server should put this information in newly to be defined headers (e.g. x-org-cache-control) in the delta response. The crcsync cache client can then pass this information on to the classical cache-client. In order to make live easier for the crccache client, the crccache proxy server used in [scen1] and [scen2] should also use this approach to pass the origin cache headers back to the end client. In [scen1], crccache client could convert the x-org-cache-control header back into a standard cache-control header when reconstructing the instance from the delta-response, so that the classical cache client knows how to cache the instance. In [scen2], assuming that the crcsync and classical cache-logic will be tightly integrated in the browser, it could be directly used by the classical cache-logic to determine freshness of the page and to build the conditional requests, as per usual classical cache protocol.
Summary of open issues
TODO: summarize here the open issues
http-crcsync protocol proposal
New request headers to request crcsync encoded response
Crcsync Delta Response headers
Crcsync Delta Response body

References
[1] Rproxy project (http://rproxy.samba.org)
[2] HTTP Rsync Protocol (http://rproxy.samba.org/doc/protocol/protocol.html)
[3] RFC 3229 - Delta encoding in HTTP (http://tools.ietf.org/html/rfc3229)
[4] RFC 3143 - Known HTTP Proxy/Caching Problems (http://tools.ietf.org/html/rfc3143)
[5] RFC 2616 - HTTP 1.1 specification (http://tools.ietf.org/html/rfc2616)
[6] Sync standard first draft (http://lists.laptop.org/pipermail/http-crcsync/2009-April/000037.html)