LiveBlog: Fixing Twitter

John Adams, Twitter Ops

Ops

All on managed services with NTT - No clouds—too high latency - NTT runs the NOC - Frees them to deal with real thinking compsci probs

752% growth in 2008, trend happens ~11/2008 and keeps climbing

Growth Pain Fear of what’s gonna happen

Mantra:

Find weak points

Monitoring

Analyze

Turn data into info
Are things better/worse post-deploy
Create env of capacity planning, not firefighting—no more cowboys in the wild west

Deploys

Whale-watcher

simple script with massive win
503 is a whale, 500 is a robot
Whales per second exceeds whale threshold then “There’s whales!”
Darkmode: selectively disable portions of site with automatic notification to product and eng teams to let them know

Config Mgmt

You need an automated cfg mgmt system NOW. Else you won’t scale
It intros complexity, with multiple admins, unknown interactions
Peer review solves most of this; they use reviewboard with svn precommit hook requiring “reviewed by” note in comment and postcommit hook sends note about what changed to people

High communication

Subsystems

Many limiting factors in request pipeline
Oversubscribe mongrel 2:1 vs. cores
Attack plan per ssytem (e.g. bandwidth? bottleneck: network, vector: http latency, solution: servers+; timeline? db, update delay, better algo; search? db, delays, dbs+ and code; etc.)

CPUs:

Rails:

Stop blaming rails
Analysis: caching/cache invalidation, AR makes bad queries, queue latency, memcache/page corruption, rep lag
Not so much about Rails

Disk is the new Tape

Lots of caching is possible. Moving libmemcached to native C gem was bigtime helpful.

Nick’s CacheMoney AR plugin: readthru/writethru caching with memcached!

Caching everything not smart, either

Cache evictions
Cold cache after host failure/new host spinup
Cache smarter: get rid of cache busting behaviors, varnish with failover, etc. RDBMS vs message queues
Not everything needs ACID
message queues help
Most MQs suck at high load
They wrote Kestrel for this; looks like memcache
Starling was earlier version

Asynch == Good

They lean on mongrel heavily (they know it well)
Keep external service requests out of the pipeline via daemons which process message queues
Size worker daemons appropriately, have them kill themselves off rather than long-run

DB replication

Put up a status blog on some other service—transparency stops armchair engineering

← Previous Archive Next →

23 June 2009