Another proxy idea - web proxy logs from school servers?

Sat Dec 8 06:32:04 EST 2007

Ian,

 > A somewhat relevant idea.
 > 
 > http://wiki.openmoko.org/wiki/Server:WebProxy
 > 
 > You can get really quite large benefits from a local cache.

The problem with this approach is this line in the description:

 "then informing the proxy of which version of the page it has"

That means the proxy needs to keep all old versions of every page for
any client that might need them in order to take the diff. The key
trick with rproxy is it avoids the need to keep all the old versions.

The way rproxy works is this:

 - the laptop talks to the school server proxy

 - the school server proxy talks to a well connected upstream proxy

 - the well connected upstream proxy talks to the world

The school server proxy keeps one old version for any URL. When a
request comes in for a new copy of that page then it sends the request
to the upstream proxy, but tagged with approx 100 bytes of rsync style
rolling block hashes of the old page it has. The upstream server then
fetches the current copy of the page, and can use just the new page
plus the 100 bytes to calculate a binary diff between the old and new
page (the diff algorithm doesn't need access to the old page).

The diff is sent to the school server, which applies it to the old
page to generate the new page. A strong checksum is also calculated
and checked.

The result is:

 - no extra storage requirements on upstream server

 - store one copy of the page (with the usual LRU cache stuff) on the
   school server

 - no changes on the laptop

 - reduced bandwidth usage over the link from the school to the
   upstream server

Another neat feature is that it also works for different URLs for the
same server. If the school server doesn't find the exact URL in its
cache, it can send a hash of the best matching URL it hash. So for
example if someone has previously visited:

  http://some.site/foo?user=fred

then someone else visits:

  http://some.site/foo?user=mary

then there is often a lot of common data between the two URLs. That
common data won't go over the schools internet link. If there isn't
any common data then you pay the 100 byte price, but nothing more. You
don't even pay the price of calculating the rolling hashes, as that
can be done once while reading the page originally, in parallel with
the memcpy (and at extremely low cost).

It doesn't produce as small diff sizes as you get with a real local
diff, but the savings on disk usage make it worthwhile.

Cheers, Tridge