LiveBlog: Fixing Twitter
John Adams, Twitter Ops
Ops
- Small team
- SW perf
- availability is their primary focus
All on managed services with NTT - No clouds—too high latency - NTT runs the NOC - Frees them to deal with real thinking compsci probs
752% growth in 2008, trend happens ~11/2008 and keeps climbing
Growth Pain Fear of what’s gonna happen
Mantra:
- Find the weekest point (metrics + logging + analysis)
- Take corrective action (process)
- Repeat
Find weak points
- Collect metrics and graphs (individual metrics are irrelevant)
- Logs
- SCIENCE!
- Instrument everything! More info is better
Monitoring
- Keep critical metrics as close to realtime as possible
- Using RRD, Ganglia + gMetrics, MRTG
- Mostly on 10s interval, some 5s, some 60s
- Everyone in company has access to dashboard
- “Criticals” view
- Use google analytics for failwhale and other err pages
Analyze
- Turn data into info
- Are things better/worse post-deploy
- Create env of capacity planning, not firefighting—no more cowboys in the wild west
Deploys
- Ganglia shows final deploy info for twitter, summize, and search
Whale-watcher
- simple script with massive win
- 503 is a whale, 500 is a robot
- Whales per second exceeds whale threshold then “There’s whales!”
- Darkmode: selectively disable portions of site with automatic notification to product and eng teams to let them know
Config Mgmt
- You need an automated cfg mgmt system NOW. Else you won’t scale
- It intros complexity, with multiple admins, unknown interactions
- Peer review solves most of this; they use reviewboard with svn precommit hook requiring “reviewed by” note in comment and postcommit hook sends note about what changed to people
High communication
- They use chat (campfire) with docs, graphs, logs, etc.
- skitch into campfire is a frequent working methodology
Subsystems
- Many limiting factors in request pipeline
- Oversubscribe mongrel 2:1 vs. cores
- Attack plan per ssytem (e.g. bandwidth? bottleneck: network, vector: http latency, solution: servers+; timeline? db, update delay, better algo; search? db, delays, dbs+ and code; etc.)
CPUs:
- switched to Xeon +30% gain
- replace 2x and 4x core with 8x core +40%
Rails:
- Stop blaming rails
- Analysis: caching/cache invalidation, AR makes bad queries, queue latency, memcache/page corruption, rep lag
- Not so much about Rails
Disk is the new Tape
- Social networks is very O(n^y) oriented
- Disk is too slow
- Need lots of RAM
Lots of caching is possible. Moving libmemcached to native C gem was bigtime helpful.
Nick’s CacheMoney AR plugin: readthru/writethru caching with memcached!
Caching everything not smart, either
- Cache evictions
- Cold cache after host failure/new host spinup
-
Cache smarter: get rid of cache busting behaviors, varnish with failover, etc. RDBMS vs message queues
- Not everything needs ACID
- message queues help
- Most MQs suck at high load
- They wrote Kestrel for this; looks like memcache
- Starling was earlier version
Asynch == Good
- They lean on mongrel heavily (they know it well)
- Keep external service requests out of the pipeline via daemons which process message queues
- Size worker daemons appropriately, have them kill themselves off rather than long-run
DB replication
- Multiple functional read/write masters
- never read from the master—slows it down too much
- watch your slow queries
- use mkill to kill long-running queries before they kill you.
Put up a status blog on some other service—transparency stops armchair engineering
blog comments powered by Disqus