Archive for the ‘Web’ Category

How cachable is google (part 1): Google Maps

November 16, 2007

I’m looking at how cachable Google content is with an eye to make Squid cache some of it better. Contrary to popular belief, a lot of the google content (that I’ve seen!) is dynamically generated “static” content – images, videos – which could be cached but unfortunately aren’t.

Google Maps works by breaking up the “map” into multiple square tiled images. The various compositing that occurs (eg maps on top of a satellite image) are rendered by the browser and not dynamically generated by Google.

We’ll take one image URL as an example:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

A few things to notice:

  1. The first part of the hostname – kh3 – can and does change (I’ve see kh0 -> kh3.) All the tiles as far as I can tell can be fetched from each of these servers. This is done to increase concurrency in the browser: the Javascript will select one of four servers for each tile so the concurrency limit is reached for multiple servers (ie, N times the concurrency limit) rather than just to one server.
  2. The query string is a 1:1 mapping between query and tile, regardless of which keyhole server they’re coming from.
  3. The use of a query string negates all possible caching, even though…
  4. .. the CGI returns Expires and Last-Modified headers!

Now, the reply headers (via a local Squid):

HTTP/1.0 200 OK
Content-Type: image/jpeg
Expires: Sat, 15 Nov 2008 02:44:29 GMT
Last-Modified: Fri, 17 Dec 2004 04:58:08 GMT
Server: Keyhole Server 2.4
Content-Length: 15040
Date: Fri, 16 Nov 2007 02:44:29 GMT
Age: 531
X-Cache: HIT from violet.local
Via: 1.0 violet.local:3128 (squid/2.HEAD-CVS)
Proxy-Connection: close

The server returns a Last-Modified header and Expires header; but as it has a query identifier in the URL (ie, the “?”) then plenty of caches and I’m guessing some browsers will not cache the response, regardless of the actual cachability of the content. See RFC2068 13.9 and RFC2616 13.9. Its unfortunate, but what we have to deal with.

Finally, assuming the content is cached, it will need to be periodically revalidated via an If-Modified-Since request. Unfortunately the keyhole server doesn’t handle IMSes correctly, always returning a 200 OK with the entire object body. This means that revalidation will always fail and the entire object will be fetched in the reply.

So how to fix it?

Well, by default (and for historical reasons!) Squid will not cache anything with “cgi-bin” or “?” in the path. Thats for a couple of reasons – firstly, replies from HTTP/1.0 servers with no expiry information shouldn’t be cached if it may be a CGI (and “?”‘s generally are); and secondly intermediate proxies in the path may “hide” the version of the origin server and you never quite know whether it was HTTP/1.0 or not.

Secondly, since the same content can come from one of four servers:

  • You’ve got a 1 in 4 chance that you’ll get the same google host for the given tile; and
  • You’ll end up caching the same tile data four times.

I’m working on Squid to work around these shortcomings. Ideally Google could fix the second one by not using query-strings but instead using URL paths with correct cachability information and handling IMS, eg:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

might become:

http://kh3.google.com.au/kh/n=404/v=23/t=trtqtt

That response would be cachable (assuming that they didn’t vary the order of the query parameters!) and browsers/caches would be able to handle that without modification.

I’ve got a refresh pattern to cache that content but its still a work in progress. Here’s an example:

refresh_pattern    ^ftp:            1440 20% 10080
refresh_pattern    ^gopher:    1440 0% 1440
refresh_pattern    cgi-bin        0 0% 0
refresh_pattern    \?                0 0% 4320
refresh_pattern    .                    0 20% 4320

I then remove the “cache deny QUERY” line and simply use a cache allow all; then I use refresh_pattern’s to match on which patterns shouldn’t be cachable if no expiry information is given (ie – if a URL with cgi-bin or ? in the path returns expiry information then Squid will cache it.)

[UPDATE: We have now merged the results of Adrians work here into Squid-2.7 and 3.1+. The new requirement for refresh_patterns are:

refresh_pattern    ^ftp:        1440  20% 10080
refresh_pattern    ^gopher:        1440   0% 1440
refresh_pattern    -i (/cgi-bin/|\?)        0   0% 0
refresh_pattern    .        0   20% 4320

hierarchy_stoplist cgi-bin ?

]

It’d then be nice if Google handled IMS requests by the keyhole server correctly!

Secondly, Squid needs to be taught that certain URLs are “equivalent” for the purposes of cache storage and retrieval. I’m working on a patch which will take a URL like this:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

Match on the URL via a regular expression, eg:

m/^http:\/\/kh(.*?)\.google\.com(.*?)\/(.*?)$/

And mapping that to a fixed URL regardless of the keyhole server number, eg:

http://keyhole-server.google.com.au.SQUIDINTERNAL/kh?n=404&v=23&t=trtqtt

The idea, of course, is that there won’t ever be a valid URL normally fetched whose host part ends in .SQUIDINTENRAL and thus we can use it as an “internal identifier” for local storage lookups.

This way we can then request the tile from any kh server ending in any country, so the following URLs would be equivalent from the point of view of caching:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

http://kh2.google.com.au/kh?n=404&v=23&t=trtqtt

http://kh0.google.com/kh?n=404&v=23&t=trtqtt

Its important to note here that the content is still fetched from the requested host, its just stored in the cache under a different URL.

I’ll next talk about caching Google Images and finally how to cache Youtube.

Why even bother making cachable content?

September 8, 2007

I see so many sites pop up in some Squid logs which seem to try and avoid any attempt at caching. I’m not sure why, but I’m going to try and cover a few points here.

  1. I want to know exactly how many bits I’m shipping! This is especially prevalent in the American internet scene. Everyone’s about shipping bits. The more bits you ship the “better” you are. (There’s some talk about the “number of prefixes you advertise” also being linked to how “big” your network is; or maybe people are just lazy at trying to aggregate their BGP announcements. I digress..) Sure, if you graph your outbound links this is true. But you can do HTTP tricks to know exactly how many requests you’re handling without shifting the whole object out. Just set the objects to “must revalidate” rather than being immediately expired; let the web cache always revalidate the request via an If-Modified-Since request. You’ll get the IMS and can send back a “not-modified” reply; you can then synthesise a graph based on what you -would- be serving. Voila, free bits. This can be quite substantial if you have lots and lots of images on your site.
  2. I want to know how many people are accessing my site! This is definitely a left-over from the 90s and even then the problem was solved. If you absolutely positively need to know about page impressions then just embed a non-cachable 1×1 transparent gif somewhere where it won’t slow the page rendering down. Leave the rest of the site cachable. Really though, these days people should just use javascript and cookies (a la the Google “urchin”) if they want accurate “people” and “impression” counts. Trying to do it based on page accesses and unique IPs just isn’t going to cut it.
  3. I don’t want people to cache the data; they have to login first! You can tell proxy caches that they must first revalidate the authentication information from the origin server before serving out content. You can have your cake and eat it too.
  4. Making my content cachable is too damned hard! How do I know what headers when and where? Its not all that difficult. Mark Nottingham’s Caching Tutorial covers a lot of useful information about building cachable websites. You can keep control of your authenticated content and push out more content than you’re actually buying transit for.

Just remember a few simple rules:

  • Don’t hide static content behind query URLs (ie, stuff with a ‘?’ in them). Caches won’t cache them (unless, of course, they’re built by me. But then, I am pretty evil.) I see plenty of websites which hide all of their images and flash videos behind a CGI script with a ? in the path – caches just won’t bother trying to cache it. Amusingly, most of those sites hide static content behind CGI scripts! Just imagine what it’d be like to be able to push five or ten times the amount of content to clients behind proxy caches.
  • Don’t be afraid to ask for help in how to optimise your site for forward caching. Heck, even asking on the squid-users mailing list will probably get you sorted out without too much trouble.
  • There are people behind proxy caches – the developing world for one, but there’s plenty of caches to be found in schools, offices, wired buildings, wireless mesh networks and the like. Bandwidth isn’t free and never will be. You might be able to buy a 40gbit pipe to your favourite transit provider in North America but that won’t help people in South Africa or Australia where international bandwidth is still expensive and will remain so for the forseeable future. And yes, we like watching Youtube as much as the next person.

Squid-2.6.STABLE16 is out!

September 6, 2007

Henrik has released Squid-2.6.STABLE16. This resolves a number of bugs, including a crash bug introduced in Squid-2.6.STABLE15.

The changeset list explains whats changed; the release page includes downloads and other useful stuff. Don’t forget to read the release notes if you’re updating from 2.5 to 2.6!

And don’t forget the Squid-2.6 Configuration Manager!

Reverse Proxying with Squid

September 3, 2007

A Squid user posted about their little “CDN” installation to speed up their content delivery to the clients of a particular ISP.

You can read more about it here.

Blocking Ads in Squid

August 29, 2007

One of the more bandwidth-intensive “features” of the Web is the proliferation of ad images and flash media which has a nasty habit of wasting bandwidth and increasing loading times.

Squid has been able to filter ads and other unwanted media for a number of years. Various articles have been written to cover how exactly its done and so I won’t bother covering the how-to here.

The original method involved the “redirector”. A redirector was simply an external program which would read in URLs on STDIN and spit out “alternate” URLs on STDOUT. This could be used for a number of things – the initial use being to rewrite URLs when using Squid as a web server accelerator – but people quickly realised they could rewrite “ad” URLs to filter them out.

Another method is to simply build a text file with identified ad content URLs and hostnames and simply deny the traffic. This is simple but can scale poorly if you try filtering thousands of URLs against regular expression matches.

Finally, another method involves using the more recent “external ACL” helper. It is an external program which can be passed a variety of information about a request (URL, client IP, authenticated username, arbitrary HTTP headers, ident to name a few, but its very customizable!) and spit back a YES or a NO, with an optional message. Content can then be filtered by simply denying access to it, but it currently doesn’t let you return modified content. One of the most popular uses of the external ACL helper is actually to implement ACL groups from sources like LDAP/Windows Active Directory.

How you do it is up to you. Here’s a few links explaining whats involved.

Proxying with Squid: A Users Perspective

July 17, 2007

Someone pointed me over to sial.org where the author wrote up a quick Howto for various Squid tasks – basic refresh_patterns for controlling cacheability of files, filetypes and web URLs; remote refreshing; performance review; and an example reverse accelerator setup.

I think its a nice high-level introduction to using Squid as an website accelerator.

New website is up!

May 15, 2007

The new website is up at http://www.squid-cache.org/ . Please report issues via the Squid Bugzilla. (Obviously, feel free to email us or comment here if the website is so broken you can’t use the Bugzilla..)

Request for some help: CSS template magic!

May 10, 2007

I’m not really a “web developer” and although I can drive style sheets, I’m not a “CSS” hacker. So here’s my first request: could someone please give me a hand adapting the CSS from the new Squid site into a WordPress-happy theme? I’d love to have the Squid blog(s) themed similarly to the website.

Oh, and if someone could give Kinkie a hand adopting the new CSS to the wiki (http://wiki.squid-cache.org/) we’d be forever grateful.

The New Squid Website is Almost Done..

May 10, 2007

The new Squid website (http://new.squid-cache.org/) is almost done. The stuff left to implement is now relatively short:

  • Fix up some of the site grammar (thanks to Chris Nighswonger and Martin Brooks)
  • Some link fixes -/Advisories/ needs to be re-created and populated; the visolve link needs updating
  • The dynamic pages need to return sane expiry and validator information so they can be cached. Having non-cached pages on a website about a proxy product is a bit hypocritical!
  • Please feel free to add any further comments about the site to this post; I’ll aim to update the site during the weekend.

Follow

Get every new post delivered to your Inbox.

Join 30 other followers