We're continuing to work on improving the scalability of the AOL Journals servers. Our major problem recently has been traffic spikes to particular blog pages or entries caused by links on the AOL Welcome page. Most of this peak traffic is from AOL clients using AOL connections, meaning they're using AOL caches, known as Traffic Servers. Unfortunately, there was a small compatibility issue that prevented the Traffic Servers from caching pages served with only ETags. That was resolved last week in a staged Traffic Server rollout. Everything we've tested looks good so far; the Traffic Servers are correctly caching pages that can be cached, and not the ones that can't. We're continuing to monitor things, and we'll see what happens during the next real traffic spike.
The good thing about this type of caching is that our servers are still notified about every request through validation requests, so we'll be able to track things fairly closely, and we're able to add caching without major changes to the pages.
The down side is of course that our servers can still get hammered if these validation requests themselves go up enough. This is actually the case even for time-based caching if you get enough traffic; you still have to scale up your origin server farm to accommodate peak traffic, because caches don't really guarantee they won't pass through traffic peaks.
We're continuing to carefully add more caching to the system while monitoring things. We're currently evaluating using a Squid reverse caching proxy. It's open source and has been used in other situations -- and it would give us a lot of flexibility in our caching strategy. We're also making modifications to our databases to both distribute load and take advantage of in-memory caching, of course. But we can scale much higher and more cheaply by pushing caching out to the front end and avoiding any work at all in the 80% case where it's not really needed.
Tags: caching, etags, squid, scalability, traffic, performance, aol journals, blogs