How cachable is google (part 1): Google Maps

November 16, 2007 by Adrian Chadd

I’m looking at how cachable Google content is with an eye to make Squid cache some of it better. Contrary to popular belief, a lot of the google content (that I’ve seen!) is dynamically generated “static” content - images, videos - which could be cached but unfortunately aren’t.

Google Maps works by breaking up the “map” into multiple square tiled images. The various compositing that occurs (eg maps on top of a satellite image) are rendered by the browser and not dynamically generated by Google.

We’ll take one image URL as an example:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

A few things to notice:

  1. The first part of the hostname - kh3 - can and does change (I’ve see kh0 -> kh3.) All the tiles as far as I can tell can be fetched from each of these servers. This is done to increase concurrency in the browser: the Javascript will select one of four servers for each tile so the concurrency limit is reached for multiple servers (ie, N times the concurrency limit) rather than just to one server.
  2. The query string is a 1:1 mapping between query and tile, regardless of which keyhole server they’re coming from.
  3. The use of a query string negates all possible caching, even though…
  4. .. the CGI returns Expires and Last-Modified headers!

Now, the reply headers (via a local Squid):

HTTP/1.0 200 OK
Content-Type: image/jpeg
Expires: Sat, 15 Nov 2008 02:44:29 GMT
Last-Modified: Fri, 17 Dec 2004 04:58:08 GMT
Server: Keyhole Server 2.4
Content-Length: 15040
Date: Fri, 16 Nov 2007 02:44:29 GMT
Age: 531
X-Cache: HIT from violet.local
Via: 1.0 violet.local:3128 (squid/2.HEAD-CVS)
Proxy-Connection: close

The server returns a Last-Modified header and Expires header; but as it has a query identifier in the URL (ie, the “?”) then plenty of caches and I’m guessing some browsers will not cache the response, regardless of the actual cachability of the content. See RFC2068 13.9 and RFC2616 13.9. Its unfortunate, but what we have to deal with.

Finally, assuming the content is cached, it will need to be periodically revalidated via an If-Modified-Since request. Unfortunately the keyhole server doesn’t handle IMSes correctly, always returning a 200 OK with the entire object body. This means that revalidation will always fail and the entire object will be fetched in the reply.

So how to fix it?

Well, by default (and for historical reasons!) Squid will not cache anything with “cgi-bin” or “?” in the path. Thats for a couple of reasons - firstly, replies from HTTP/1.0 servers with no expiry information shouldn’t be cached if it may be a CGI (and “?”’s generally are); and secondly intermediate proxies in the path may “hide” the version of the origin server and you never quite know whether it was HTTP/1.0 or not.

Secondly, since the same content can come from one of four servers:

  • You’ve got a 1 in 4 chance that you’ll get the same google host for the given tile; and
  • You’ll end up caching the same tile data four times.

I’m working on Squid to work around these shortcomings. Ideally Google could fix the second one by not using query-strings but instead using URL paths with correct cachability information and handling IMS, eg:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

might become:

http://kh3.google.com.au/kh/n=404/v=23/t=trtqtt

That response would be cachable (assuming that they didn’t vary the order of the query parameters!) and browsers/caches would be able to handle that without modification.

I’ve got a refresh pattern to cache that content but its still a work in progress. Here’s an example:

refresh_pattern    ^ftp:            1440 20% 10080
refresh_pattern    ^gopher:    1440 0% 1440
refresh_pattern    cgi-bin        0 0% 0
refresh_pattern    \?                0 0% 4320
refresh_pattern    .                    0 20% 4320

I then remove the “cache deny QUERY” line and simply use a cache allow all; then I use refresh_pattern’s to match on which patterns shouldn’t be cachable if no expiry information is given (ie - if a URL with cgi-bin or ? in the path returns expiry information then Squid will cache it.)

It’d then be nice if Google handled IMS requests by the keyhole server correctly!

Secondly, Squid needs to be taught that certain URLs are “equivalent” for the purposes of cache storage and retrieval. I’m working on a patch which will take a URL like this:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt

Match on the URL via a regular expression, eg:

m/^http:\/\/kh(.*?)\.google\.com(.*?)\/(.*?)$/

And mapping that to a fixed URL regardless of the keyhole server number, eg:

http://keyhole-server.google.com.au.SQUIDINTERNAL/kh?n=404&v=23&t=trtqtt

The idea, of course, is that there won’t ever be a valid URL normally fetched whose host part ends in .SQUIDINTENRAL and thus we can use it as an “internal identifier” for local storage lookups.

This way we can then request the tile from any kh server ending in any country, so the following URLs would be equivalent from the point of view of caching:

http://kh3.google.com.au/kh?n=404&v=23&t=trtqtt
http://kh2.google.com.au/kh?n=404&v=23&t=trtqtt
http://kh0.google.com/kh?n=404&v=23&t=trtqtt

Its important to note here that the content is still fetched from the requested host, its just stored in the cache under a different URL.

I’ll next talk about caching Google Images and finally how to cache Youtube.

Squid-2.6 IPv6

September 30, 2007 by Adrian Chadd

In case you didn’t know, there’s a work in progress for IPv6 support in Squid-2.6. You’ll find a patch here which, reportedly, is being used in production at a few sites.

If you’d like to see IPv6 in a future Squid-2 release - its a very large change to introduce in the squid-2.6 release so it would appear in a 2.7 or 2.8 release - then please join the squid-users mailing list and let us know.

(I hear a lot of people complaining about how Squid doesn’t “support IPv6″ and yet won’t try Squid-3+IPv6 or even try googling for alternatives. The truth is that there’s been unofficial patches to Squid-2 to support IPv6 in some fashion for a number of years now - heck, there was an IPv6 patch to Squid-1! - but noone volunteered to stand up, tidy it up and get it in shape for inclusion into the main tree. If IPv6 is important to you then please say so; please test the stuff thats out there and don’t hesitate to donate to the Squid project with a note saying “for IPv6!”.)

Squid-2 performance work

September 30, 2007 by Adrian Chadd

My main focus at the moment is to tidy up areas of Squid-2 with an eye towards both nicer code internals and better performance. Over the last year I’ve committed work to Squid-2 which has eliminated a large part of the multiple data copying and multiple request/reply parsing which went on in Squid-2.6 and earlier.

Unfortunately I don’t run any busy Squids, and so I’m not always confident my changes are correct. Squid has a lot of baggage and even with 10 years experience I still hit the occasional “wha? I didn’t know it did that” case.

My recent work is focusing on eliminating all the extra memory copying that goes on. Part of this will involve changing the internal dataflow and some inclusion of new buffer and string management code. Yes, C++ would help here (I do like the idea of the compiler enforcing my reference counting semantics to be correct!) but Squid-3 is still in beta and squid-2 is still C. I’d rather not break Squid-3 when its not yet released and this work is  “small” jump for people running Squid-2.HEAD or Squid-2.6.

The “store_copy” branch in Sourceforge will focus on converting one of the heaviest users of memcpy() - the storeClientCopy() API which allows the client-side to fetch data from the memory store - to use reference counted read-only buffers rather than copying the data into a supplied buffer. Reference counting in C is “tricky” even at the best of times. Its eliminated almost 5% of CPU use due to memcpy() but this only applies on my workbench (memory-only workload, small objects, local client/server.) It may work for you, it may not. Its part of a bigger goal - to avoid copying data where possible inside Squid - which will result in a leaner, faster proxy cache for everyone.

Its main noticable savings should be RAM use - temporary 4k buffers aren’t used anymore to store data as its being written to the client. This may be more noticable than the CPU savings. Regardless, I’d like to know how it runs for you!

To help:

  • Grab a Squid-2.HEAD tree, something reasonably recent;
  • Grab the patch from the store_copy branch at Sourceforge and patch Squid-2.HEAD;
  • Compile and run it!
  • Let me know how it runs - is it running smoothly? Is it leaking memory? Crashing? Serving incorrect data?

It “works for me” in my test bench at home. I’d love to know this is stable enough to commit to Squid-2.HEAD and move onto the next work.

Squid-2 updates: Logfiles and buffers

September 23, 2007 by Adrian Chadd

I’ve made three changes to the Squid-2.HEAD codebase this weekend.

First up - I’ve modified the memory allocator to not zero every sort of buffer. This can be quite expensive for large buffers, especially on older machines or very busy Squid servers. Squid-2.HEAD now has the “zero_buffers” option which currently defaults to “on”. To disable zero’ing buffers please add “zero_buffers off” to your squid.conf file. I’ve seen up to 10% CPU savings on my testbed at home but this may vary wildly depending upon work load.

Secondly - the ‘logtype’ configuration option has been removed and replaced with the ability to define logging types per logfile. You can now prefix your log line with “daemon:”, “stdio:”, “udp:” or “syslog:”. “syslog:” works the same as before; “stdio:” and “daemon:” just take a path, and “udp:” takes an IP:port URL.

To log to a UDP socket, try:

access_log udp://192.168.1.101:1234

Please note though that the default UDP payload size (defined in src/logging_mod_udp.c) is 1400 bytes and any application you decide to use to dump the logfile entries must be able to receive UDP packets that big. There’s a system-wide UDP packet limit in some operating systems (for example, sysctl net.inet.udp.maxdgram under FreeBSD) to also consider. If in doubt, do a tcpdump on both sides and make sure you’re seeing the packets of the right size getting there.

Note too you can’t use these options for the cache_log - it must always be a normal file path.

Logfile improvements in Squid-2-HEAD

September 19, 2007 by Adrian Chadd

I’ve committed my logfile handling improvements to Squid-2-HEAD. Essentially, it lets people write self-contained code modules to implement different logging methods. The three supported methods now are:

  • STDIO, which is how Squid currently does its logging;
  • Syslog, which is compiled in if you enable it; and
  • Daemon, which uses a simple external helper to write logfiles to disk.

Those of you who have run Squid may have noticed that it couldn’t support writing more than a hundred or so requests a second to disk before performance suffered. There’s no reason it shouldn’t handle this - a hundred requests a second is only 16 kilobytes a second to write - but the use of STDIO routines to do this had a negative impact on performance.

The logfile daemon allows the blocking disk IO to occur outside of the main Squid process; which basically means Squid can continue doing what its doing well (all the other stuff) and any blocking disk activity occurs in a seperate process.

To use? Compile and install Squid-2-HEAD, then include the following line into your configuration:

logtype daemon

In reality, Squid with the logging daemon can now handle writing -thousands of requests a second- to disk without any performance impact. Furthermore, if the logging daemon can’t write to disk fast enough Squid will log a error message stating its falling behind and drop logging entries.

I’ve tested this up to three thousand requests a second over the course of a few hours (to a dedicated logging disk however) and it handles it without a problem.

If enterprising souls wished, they could write a UDP logging helper, or a MySQL external logging helper, without needing to modify the Squid codebase.

This code will eventually also appear in Squid-3 after 3.0 is released.

Squid Sighting: Advproxy!

September 15, 2007 by Adrian Chadd

Another Squid sighting: the IPCop AdvProxy add-on is really just Squid-2.6 in disguise!

Chalk up another one for Squid.

It has a rather interesting addon - the “updates cache” which caches windows and symantec updates through a clever use of redirectors. Cute!

Why even bother making cachable content?

September 8, 2007 by Adrian Chadd

I see so many sites pop up in some Squid logs which seem to try and avoid any attempt at caching. I’m not sure why, but I’m going to try and cover a few points here.

  1. I want to know exactly how many bits I’m shipping! This is especially prevalent in the American internet scene. Everyone’s about shipping bits. The more bits you ship the “better” you are. (There’s some talk about the “number of prefixes you advertise” also being linked to how “big” your network is; or maybe people are just lazy at trying to aggregate their BGP announcements. I digress..) Sure, if you graph your outbound links this is true. But you can do HTTP tricks to know exactly how many requests you’re handling without shifting the whole object out. Just set the objects to “must revalidate” rather than being immediately expired; let the web cache always revalidate the request via an If-Modified-Since request. You’ll get the IMS and can send back a “not-modified” reply; you can then synthesise a graph based on what you -would- be serving. Voila, free bits. This can be quite substantial if you have lots and lots of images on your site.
  2. I want to know how many people are accessing my site! This is definitely a left-over from the 90s and even then the problem was solved. If you absolutely positively need to know about page impressions then just embed a non-cachable 1×1 transparent gif somewhere where it won’t slow the page rendering down. Leave the rest of the site cachable. Really though, these days people should just use javascript and cookies (a la the Google “urchin”) if they want accurate “people” and “impression” counts. Trying to do it based on page accesses and unique IPs just isn’t going to cut it.
  3. I don’t want people to cache the data; they have to login first! You can tell proxy caches that they must first revalidate the authentication information from the origin server before serving out content. You can have your cake and eat it too.
  4. Making my content cachable is too damned hard! How do I know what headers when and where? Its not all that difficult. Mark Nottingham’s Caching Tutorial covers a lot of useful information about building cachable websites. You can keep control of your authenticated content and push out more content than you’re actually buying transit for.

Just remember a few simple rules:

  • Don’t hide static content behind query URLs (ie, stuff with a ‘?’ in them). Caches won’t cache them (unless, of course, they’re built by me. But then, I am pretty evil.) I see plenty of websites which hide all of their images and flash videos behind a CGI script with a ? in the path - caches just won’t bother trying to cache it. Amusingly, most of those sites hide static content behind CGI scripts! Just imagine what it’d be like to be able to push five or ten times the amount of content to clients behind proxy caches.
  • Don’t be afraid to ask for help in how to optimise your site for forward caching. Heck, even asking on the squid-users mailing list will probably get you sorted out without too much trouble.
  • There are people behind proxy caches - the developing world for one, but there’s plenty of caches to be found in schools, offices, wired buildings, wireless mesh networks and the like. Bandwidth isn’t free and never will be. You might be able to buy a 40gbit pipe to your favourite transit provider in North America but that won’t help people in South Africa or Australia where international bandwidth is still expensive and will remain so for the forseeable future. And yes, we like watching Youtube as much as the next person.

LDAP improvements in Squid-3!

September 6, 2007 by Adrian Chadd

One of the OpenLDAP Core Developers - Pierangelo Masarati - has offered to help out by submitting Squid-3 LDAP authentication and session improvements.

The first patch improves the code organisation and (if I read his email right) begins to support SASL bind (external ldapi://).

Thanks!

Squid-2.6.STABLE16 is out!

September 6, 2007 by Adrian Chadd

Henrik has released Squid-2.6.STABLE16. This resolves a number of bugs, including a crash bug introduced in Squid-2.6.STABLE15.

The changeset list explains whats changed; the release page includes downloads and other useful stuff. Don’t forget to read the release notes if you’re updating from 2.5 to 2.6!

And don’t forget the Squid-2.6 Configuration Manager!

Squid-3.0.PRE7 is out (silently..)

September 3, 2007 by Adrian Chadd

Duane/Alex and the gang have released Squid-3.0.PRE7. Its shaping up to be a pretty good release. If you’re running Squid-3.0 or you’re interested in helping out testing the release then please visit here to download.

I’m not sure what exactly has been improved since PRE6 but they’ve been busy trying to track down and fix all the bugs they can. Alex has been committing chunks to improve the ICAP support quite significantly. Amos has also been doing more IPv6 work in his branch, which will hopefully be merged post Squid-3.0.STABLE1.