Caching for AOL Journals

We're continuing to work on improving the scalability of the AOL Journals servers.  Our major problem recently has been traffic spikes to particular blog pages or entries caused by links on the AOL Welcome page.  Most of this peak traffic is from AOL clients using AOL connections, meaning they're using AOL caches, known as Traffic Servers.  Unfortunately, there was a small compatibility issue that prevented the Traffic Servers from caching pages served with only ETags.  That was resolved last week in a staged Traffic Server rollout.  Everything we've tested looks good so far; the Traffic Servers are correctly caching pages that can be cached, and not the ones that can't.  We're continuing to monitor things, and we'll see what happens during the next real traffic spike.

The good thing about this type of caching is that our servers are still notified about every request through validation requests, so we'll be able to track things fairly closely, and we're able to add caching without major changes to the pages.

The down side is of course that our servers can still get hammered if these validation requests themselves go up enough.  This is actually the case even for time-based caching if you get enough traffic; you still have to scale up your origin server farm to accommodate peak traffic, because caches don't really guarantee they won't pass through traffic peaks. 

We're continuing to carefully add more caching to the system while monitoring things.  We're currently evaluating using a Squid reverse caching proxy.  It's open source and has been used in other situations -- and it would give us a lot of flexibility in our caching strategy.  We're also making modifications to our databases to both distribute load and take advantage of in-memory caching, of course.  But we can scale much higher and more cheaply by pushing caching out to the front end and avoiding any work at all in the 80% case where it's not really needed.

Tags: , , , , , , ,

1 comment:

  1. How come that when an article hits the news, and it happens to be on a Journals page (not even that *major* of news), the entire Journals server goes dead? I find that to be not too reliable.

    Is it maybe due to all the numerous server side components running and being served on a given journal that it can't possibly handle anything in excess of 100 hits/hr?

    I find the AOL Journals architecture lacking and members deserve more (and those servers deserve less of a load) ... why not implement compressed output to clients, more importantly, since a majority of users are probably AOL users (with topspeed or whatever) utilize other web data-caching servers on your service?

    All I want is just more information on what you (AOL Journals Team) is doing to deliver a better experience for both readers and bloggers alike.

    ~ Teh 1337 Blog


Suspended by the Baby Boss at Twitter

Well!  I'm now suspended from Twitter for stating that Elon's jet was in London recently.  (It was flying in the air to Qatar at the...