Archive for the ‘Uncategorized’ Category

How cachable is google (part 2) – Youtube content

November 17, 2007

Youtube is (one of) the bane of small-upstream network administrators. The flash files are megabytes in size, and a popular video can be downloaded by half the people in the office or student residential college in one afternoon.

It is, at the present time, very difficult to cache. Lets see why.

There’s actually two different methods employed to serve the actual flash media files that I’ve seen. The first method involves fetching from youtube.com servers; the second involves fetching from IP addresses in Google IP space.

The first method is very simple: the URL form is:

http://XXX-YYY.XXX.youtube.com/get_video?video_id=VIDEO_ID

XXX is the pop name; YYY is I’m guessing either a server or a cluster name.

This is pretty standard stuff – and If-Modified-Since requests seem to also be handled badly too! The query string “?” in the URL makes it uncachable to Squid by default, even though its a flash video. Its probably not going to change very often.

The second method involves a bit more work. First the video is requested from a google server. This server then issues a HTTP 302 reply pointing the content at a changing IP address. This request looks somewhat like this:

http://74.125.15.83/get_video?video_id=HrLFb47QHi0&origin=dal-v37.dal.youtube.com

Again, the “?” query string. Again, the origin, but its encoded in the URL. Finally, not only are If-Modified-Since requests not handled correctly, the replies include ETags and requests with an If-None-Match revalidation still return the whole object! Aiee!

So how to cache it?

Firstly, you have to try and cache replies with a “?” reply. It would be nice if they handled If-Modified-Since and If-None-Match requests correctly when the object hasn’t been modified – revalidation is cheap and its basically free bandwidth. They could set the revalidation to be, say, after even 30 minutes – they’re already handling all the full requests for all the content, so the request rate would stay the same but the bandwidth requirements should drop.

The URLs also have to rewritten, much like they do to cache google maps content. The “canonical” form URL will then reference a “video” regardless of which server the client is asking.

Now, how do you do this in Squid? I’ve got some beta code to do this and its in the Squid-2 development tree. Take a look here for some background information. It works around the multiple-URL-referencing-same-file problem but it won’t unfortunately work around their broken HTTP/1.1 validation code. If they fixed that then Youtube may become something which network administrators stop asking to filter.

(ObNote: the second method uses lighttpd as the serving software; and it replies with a HTTP/1.1 reply regardless of whether the request was HTTP/1.0 or HTTP/1.1. Grr!)

Web Cache Whitepapers/Articles

August 24, 2007

Why bother with Squid as a purely proxy server? Isn’t most of the content on the Internet today dynamic?

Perhaps; perhaps not. A few years ago “media caching” required licenced software to handle WMA and RealMedia streams; today the heavy bandwidth users are flash videos from popular sites such as YouTube. The HTML may not be cachable but all those thumbnail images, all those previews and all those large flash video files are very cachable. The problem isn’t that the Internet is “dynamic”; the problem is that website designers view caching as “evil” – they’re suddenly not 100% in control of their content – and try as hard as possible to dodge caching.

Squid has a few knobs which can be set to cache this so-called “dynamic” content. Squid has to treat everything which may be dynamic as uncacheable – the telltail “?” in the URL identifying the output as being from a script – when in fact the content isn’t all that dynamic. More on that will be covered in a future article.

ISPs who run Squid with a well-tuned configuration have shown web traffic savings of around 30%. Thats 30% of their traffic, not just hits. And thats not with any attempt at caching the “dynamic” content which can actually be cached – Youtube and Windows Updates are two big offenders here.

So Squid isn’t that useless at all!

A couple of articles which give an overview of caching follow. They’re dated – the technology isn’t new after all – and just as applicable today.

Further Info on IPv6 – Where the official site actually is…

July 3, 2007

Since people seem to be redirected here in preference to the official pages on the squid IPv6 branch. I think its about time I made some quick references back there so all of you trying to use this wonderful branch can find the actual code and know how to do so.

The IPv6 work in squid is all currently documented at http://devel.squid-cache.org/squid3-ipv6/ and related pages. My contacts, or those of any developer is kept on to maintain it should be referenced from there.

How-To’s, configuration, patches, etc, etc, ‘all the guff’ as they say, will be available there shortly as well.