Archive for the ‘Youtube’ Category

How cachable is google (part 2) – Youtube content

November 17, 2007

Youtube is (one of) the bane of small-upstream network administrators. The flash files are megabytes in size, and a popular video can be downloaded by half the people in the office or student residential college in one afternoon.

It is, at the present time, very difficult to cache. Lets see why.

There’s actually two different methods employed to serve the actual flash media files that I’ve seen. The first method involves fetching from youtube.com servers; the second involves fetching from IP addresses in Google IP space.

The first method is very simple: the URL form is:

http://XXX-YYY.XXX.youtube.com/get_video?video_id=VIDEO_ID

XXX is the pop name; YYY is I’m guessing either a server or a cluster name.

This is pretty standard stuff – and If-Modified-Since requests seem to also be handled badly too! The query string “?” in the URL makes it uncachable to Squid by default, even though its a flash video. Its probably not going to change very often.

The second method involves a bit more work. First the video is requested from a google server. This server then issues a HTTP 302 reply pointing the content at a changing IP address. This request looks somewhat like this:

http://74.125.15.83/get_video?video_id=HrLFb47QHi0&origin=dal-v37.dal.youtube.com

Again, the “?” query string. Again, the origin, but its encoded in the URL. Finally, not only are If-Modified-Since requests not handled correctly, the replies include ETags and requests with an If-None-Match revalidation still return the whole object! Aiee!

So how to cache it?

Firstly, you have to try and cache replies with a “?” reply. It would be nice if they handled If-Modified-Since and If-None-Match requests correctly when the object hasn’t been modified – revalidation is cheap and its basically free bandwidth. They could set the revalidation to be, say, after even 30 minutes – they’re already handling all the full requests for all the content, so the request rate would stay the same but the bandwidth requirements should drop.

The URLs also have to rewritten, much like they do to cache google maps content. The “canonical” form URL will then reference a “video” regardless of which server the client is asking.

Now, how do you do this in Squid? I’ve got some beta code to do this and its in the Squid-2 development tree. Take a look here for some background information. It works around the multiple-URL-referencing-same-file problem but it won’t unfortunately work around their broken HTTP/1.1 validation code. If they fixed that then Youtube may become something which network administrators stop asking to filter.

(ObNote: the second method uses lighttpd as the serving software; and it replies with a HTTP/1.1 reply regardless of whether the request was HTTP/1.0 or HTTP/1.1. Grr!)