Archive for the ‘Google’ Category

How cachable is google (part 2) – Youtube content

November 17, 2007

Youtube is (one of) the bane of small-upstream network administrators. The flash files are megabytes in size, and a popular video can be downloaded by half the people in the office or student residential college in one afternoon.

It is, at the present time, very difficult to cache. Lets see why.

There’s actually two different methods employed to serve the actual flash media files that I’ve seen. The first method involves fetching from youtube.com servers; the second involves fetching from IP addresses in Google IP space.

The first method is very simple: the URL form is:

http://XXX-YYY.XXX.youtube.com/get_video?video_id=VIDEO_ID

XXX is the pop name; YYY is I’m guessing either a server or a cluster name.

This is pretty standard stuff – and If-Modified-Since requests seem to also be handled badly too! The query string “?” in the URL makes it uncachable to Squid by default, even though its a flash video. Its probably not going to change very often.

The second method involves a bit more work. First the video is requested from a google server. This server then issues a HTTP 302 reply pointing the content at a changing IP address. This request looks somewhat like this:

http://74.125.15.83/get_video?video_id=HrLFb47QHi0&origin=dal-v37.dal.youtube.com

Again, the “?” query string. Again, the origin, but its encoded in the URL. Finally, not only are If-Modified-Since requests not handled correctly, the replies include ETags and requests with an If-None-Match revalidation still return the whole object! Aiee!

So how to cache it?

Firstly, you have to try and cache replies with a “?” reply. It would be nice if they handled If-Modified-Since and If-None-Match requests correctly when the object hasn’t been modified – revalidation is cheap and its basically free bandwidth. They could set the revalidation to be, say, after even 30 minutes – they’re already handling all the full requests for all the content, so the request rate would stay the same but the bandwidth requirements should drop.

The URLs also have to rewritten, much like they do to cache google maps content. The “canonical” form URL will then reference a “video” regardless of which server the client is asking.

Now, how do you do this in Squid? I’ve got some beta code to do this and its in the Squid-2 development tree. Take a look here for some background information. It works around the multiple-URL-referencing-same-file problem but it won’t unfortunately work around their broken HTTP/1.1 validation code. If they fixed that then Youtube may become something which network administrators stop asking to filter.

(ObNote: the second method uses lighttpd as the serving software; and it replies with a HTTP/1.1 reply regardless of whether the request was HTTP/1.0 or HTTP/1.1. Grr!)

How cachable is google (part 1): Google Maps

November 16, 2007

I’m looking at how cachable Google content is with an eye to make Squid cache some of it better. Contrary to popular belief, a lot of the google content (that I’ve seen!) is dynamically generated “static” content – images, videos – which could be cached but unfortunately aren’t.

Google Maps works by breaking up the “map” into multiple square tiled images. The various compositing that occurs (eg maps on top of a satellite image) are rendered by the browser and not dynamically generated by Google.

We’ll take one image URL as an example:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

A few things to notice:

  1. The first part of the hostname – kh3 – can and does change (I’ve see kh0 -> kh3.) All the tiles as far as I can tell can be fetched from each of these servers. This is done to increase concurrency in the browser: the Javascript will select one of four servers for each tile so the concurrency limit is reached for multiple servers (ie, N times the concurrency limit) rather than just to one server.
  2. The query string is a 1:1 mapping between query and tile, regardless of which keyhole server they’re coming from.
  3. The use of a query string negates all possible caching, even though…
  4. .. the CGI returns Expires and Last-Modified headers!

Now, the reply headers (via a local Squid):

HTTP/1.0 200 OK
Content-Type: image/jpeg
Expires: Sat, 15 Nov 2008 02:44:29 GMT
Last-Modified: Fri, 17 Dec 2004 04:58:08 GMT
Server: Keyhole Server 2.4
Content-Length: 15040
Date: Fri, 16 Nov 2007 02:44:29 GMT
Age: 531
X-Cache: HIT from violet.local
Via: 1.0 violet.local:3128 (squid/2.HEAD-CVS)
Proxy-Connection: close

The server returns a Last-Modified header and Expires header; but as it has a query identifier in the URL (ie, the “?”) then plenty of caches and I’m guessing some browsers will not cache the response, regardless of the actual cachability of the content. See RFC2068 13.9 and RFC2616 13.9. Its unfortunate, but what we have to deal with.

Finally, assuming the content is cached, it will need to be periodically revalidated via an If-Modified-Since request. Unfortunately the keyhole server doesn’t handle IMSes correctly, always returning a 200 OK with the entire object body. This means that revalidation will always fail and the entire object will be fetched in the reply.

So how to fix it?

Well, by default (and for historical reasons!) Squid will not cache anything with “cgi-bin” or “?” in the path. Thats for a couple of reasons – firstly, replies from HTTP/1.0 servers with no expiry information shouldn’t be cached if it may be a CGI (and “?”‘s generally are); and secondly intermediate proxies in the path may “hide” the version of the origin server and you never quite know whether it was HTTP/1.0 or not.

Secondly, since the same content can come from one of four servers:

  • You’ve got a 1 in 4 chance that you’ll get the same google host for the given tile; and
  • You’ll end up caching the same tile data four times.

I’m working on Squid to work around these shortcomings. Ideally Google could fix the second one by not using query-strings but instead using URL paths with correct cachability information and handling IMS, eg:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

might become:

http://kh3.google.com.au/kh/n=404/v=23/t=trtqtt

That response would be cachable (assuming that they didn’t vary the order of the query parameters!) and browsers/caches would be able to handle that without modification.

I’ve got a refresh pattern to cache that content but its still a work in progress. Here’s an example:

refresh_pattern    ^ftp:            1440 20% 10080
refresh_pattern    ^gopher:    1440 0% 1440
refresh_pattern    cgi-bin        0 0% 0
refresh_pattern    \?                0 0% 4320
refresh_pattern    .                    0 20% 4320

I then remove the “cache deny QUERY” line and simply use a cache allow all; then I use refresh_pattern’s to match on which patterns shouldn’t be cachable if no expiry information is given (ie – if a URL with cgi-bin or ? in the path returns expiry information then Squid will cache it.)

[UPDATE: We have now merged the results of Adrians work here into Squid-2.7 and 3.1+. The new requirement for refresh_patterns are:

refresh_pattern    ^ftp:        1440  20% 10080
refresh_pattern    ^gopher:        1440   0% 1440
refresh_pattern    -i (/cgi-bin/|\?)        0   0% 0
refresh_pattern    .        0   20% 4320

hierarchy_stoplist cgi-bin ?

]

It’d then be nice if Google handled IMS requests by the keyhole server correctly!

Secondly, Squid needs to be taught that certain URLs are “equivalent” for the purposes of cache storage and retrieval. I’m working on a patch which will take a URL like this:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

Match on the URL via a regular expression, eg:

m/^http:\/\/kh(.*?)\.google\.com(.*?)\/(.*?)$/

And mapping that to a fixed URL regardless of the keyhole server number, eg:

http://keyhole-server.google.com.au.SQUIDINTERNAL/kh?n=404&v=23&t=trtqtt

The idea, of course, is that there won’t ever be a valid URL normally fetched whose host part ends in .SQUIDINTENRAL and thus we can use it as an “internal identifier” for local storage lookups.

This way we can then request the tile from any kh server ending in any country, so the following URLs would be equivalent from the point of view of caching:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt
http://kh2.google.com.au/kh?n=404&v=23&t=trtqtt
http://kh0.google.com/kh?n=404&v=23&t=trtqtt

Its important to note here that the content is still fetched from the requested host, its just stored in the cache under a different URL.

I’ll next talk about caching Google Images and finally how to cache Youtube.