How cachable is google (part 2) – Youtube content

by

Youtube is (one of) the bane of small-upstream network administrators. The flash files are megabytes in size, and a popular video can be downloaded by half the people in the office or student residential college in one afternoon.

It is, at the present time, very difficult to cache. Lets see why.

There’s actually two different methods employed to serve the actual flash media files that I’ve seen. The first method involves fetching from youtube.com servers; the second involves fetching from IP addresses in Google IP space.

The first method is very simple: the URL form is:

http://XXX-YYY.XXX.youtube.com/get_video?video_id=VIDEO_ID

XXX is the pop name; YYY is I’m guessing either a server or a cluster name.

This is pretty standard stuff – and If-Modified-Since requests seem to also be handled badly too! The query string “?” in the URL makes it uncachable to Squid by default, even though its a flash video. Its probably not going to change very often.

The second method involves a bit more work. First the video is requested from a google server. This server then issues a HTTP 302 reply pointing the content at a changing IP address. This request looks somewhat like this:

http://74.125.15.83/get_video?video_id=HrLFb47QHi0&origin=dal-v37.dal.youtube.com

Again, the “?” query string. Again, the origin, but its encoded in the URL. Finally, not only are If-Modified-Since requests not handled correctly, the replies include ETags and requests with an If-None-Match revalidation still return the whole object! Aiee!

So how to cache it?

Firstly, you have to try and cache replies with a “?” reply. It would be nice if they handled If-Modified-Since and If-None-Match requests correctly when the object hasn’t been modified – revalidation is cheap and its basically free bandwidth. They could set the revalidation to be, say, after even 30 minutes – they’re already handling all the full requests for all the content, so the request rate would stay the same but the bandwidth requirements should drop.

The URLs also have to rewritten, much like they do to cache google maps content. The “canonical” form URL will then reference a “video” regardless of which server the client is asking.

Now, how do you do this in Squid? I’ve got some beta code to do this and its in the Squid-2 development tree. Take a look here for some background information. It works around the multiple-URL-referencing-same-file problem but it won’t unfortunately work around their broken HTTP/1.1 validation code. If they fixed that then Youtube may become something which network administrators stop asking to filter.

(ObNote: the second method uses lighttpd as the serving software; and it replies with a HTTP/1.1 reply regardless of whether the request was HTTP/1.0 or HTTP/1.1. Grr!)

About these ads

8 Responses to “How cachable is google (part 2) – Youtube content”

  1. phreaki Says:

    Thanks for working on this very important task!

    Small and third pipe operators need this type of caching to offset P2P, so I’ll be trying this method out soon.

    I ponder however: Google Earth? I don’t know if it uses SSL, but others have gotten into trouble for making ‘map packs’. I hope the DMCA could protect like it’s intended for those that try to make everything uncacheable.

  2. Adrian Chadd Says:

    The legality isn’t my concern really. This stuff isn’t prefetching Google Earth, nor is it distributing “map packs” to make things faster. Its simply caching the content like any other content. Google isn’t specifically trying to make the map tile images uncachable.

  3. chudycebu Says:

    I’ve been trying to look for this squid that has this storeurl_rewrite features and found out that is still on development. Currently using Windows XP squid-2.6.STABLE18 (10 Jan 2008)

    I just want to cache the youtube videos (bec its #1 bandwidth sucker and #2 is the imeem) refresh pattern get_video just works fine for 3years until last thursday. Youtube has been upgraded their url video files.

    They add signature, your ip, ipbits, expire, and key in url video files. so it no longer be cache bec signature always change. so i try to use urlrewrite temporarity until storeurlrewrite feature is up.

    $| = 1;

    while () {
    chomp;
    # print STDERR $_ . “\n”;
    if (m/^http:\/\/([A-Za-z]*?)-(.*?)\.(.*)\.youtube\.com\/get_video\?video_id=(.*)\&signature=(.*)\&ip=(.*)\&ipbits=(.*)\&expire=(.*)\&key=(.*) /) {
    print “http://” . $1 . “-” . $2 . “.” . $3 . “.youtube.com/get_video?video_id=” . $4 . “\n”;
    } elsif (m/^http:\/\/(.*?)\/get_video\?video_id=(.*)\&origin=(.*)\&signature=(.*)\&ip=(.*)\&ipbits=(.*)\&expire=(.*)\&key=(.*) /) {
    print “http://” . $3 . “/get_video?video_id=” . $2 . “\n”;
    } else {
    print $_ . “\n”;
    }
    }

    i just bypass the CDN network and use the origin url from youtube.
    if the storeurlrewrite feature i really love to cache imeem(which is the #2 most annoying bandwidth sucker)

  4. whoodd Says:

    I rewrite the url without the “\?” to have friendly name :

    if (m/^http:\/\/([A-Za-z]*?)-(.*?)\.(.*)\.youtube\.com\/get_video\?video_id=(.*)\&signature.*\&ip.*/)
    {print “http://video.youtube.SQUIDCACHE/get_video/”.$4.”.flv\n”;}
    elsif (m/^http:\/\/74(.*?)\/get_video\?video_id=(.*)\&origin=.*/)
    {print “http://video.youtube.SQUIDCACHE/get_video/”.$2.”.flv\n”; }

    /// for dailymotion
    elsif (m/^http:\/\/proxy.*\.dailymotion.com\/.*\/flv\/(.*)\?.*/)
    {print “http://proxy.dailymotion.com.SQUIDCACHE/”.$1.”\n”;}
    elsif (m/^http:\/\/.*cdn\.dailymotion.com\/(.*)\?.*/)
    {print “http://proxy.dailymotion.com.SQUIDCACHE/”.$1.”.flv\n”;}

    /// for deezer.com
    elsif (m/^http:\/\/.*\.deezer\.com\/getStream2\.php\?ID=(.*)\&KEY.*/)
    { print “http://music.deezer.com.SQUIDCACHE/sound-“.$1.”.flv\n”; }

    (But there is a trouble with deezer web server = httpReadReply: Excess data from GET …)

  5. Youtube caching using squid | Fedora India Says:

    […] tried a lot of squid hacks listed all around the web. But none of them seems to work for me. And even if they work, the weird load balancing system of […]

  6. vigneshbabugj Says:

    Hai this is vignesh,

    I’ve configured Squid server with Tproxy in my system.
    Along with that i have installed Squirm redirector tool.
    When i started the squirm process, the Squid service is started in different ports, its not started in 3128 port number.
    I dont know to solve this bug. Can any one help me how to solve this bug.

    Please help me its urgent.

  7. voku1987 Says:

    http://cachevideos.com/installation

    you need…

    1.) Squid >= 2.6

    2.) Python >= 2.4

    3.) Python >= 2.4

    4.) Apache ( bzw. Web Server)

    my HowTo (only German)
    -> http://voku-online.de/comment.php?comment.news.111

  8. Youtube Caching Using Squid Says:

    […] tried a lot of squid hacks listed all around the web. But none of them seems to work for me. And even if they work, the weird load balancing system of […]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 32 other followers

%d bloggers like this: