John Adams, Twitter Ops

Ops

  • Small team
  • SW perf
  • availability is their primary focus

All on managed services with NTT - No clouds—too high latency - NTT runs the NOC - Frees them to deal with real thinking compsci probs

752% growth in 2008, trend happens ~11/2008 and keeps climbing

Growth Pain Fear of what’s gonna happen

Mantra:

  • Find the weekest point (metrics + logging + analysis)
  • Take corrective action (process)
  • Repeat

Find weak points

  • Collect metrics and graphs (individual metrics are irrelevant)
  • Logs
  • SCIENCE!
  • Instrument everything! More info is better

Monitoring

  • Keep critical metrics as close to realtime as possible
  • Using RRD, Ganglia + gMetrics, MRTG
  • Mostly on 10s interval, some 5s, some 60s
  • Everyone in company has access to dashboard
  • “Criticals” view
  • Use google analytics for failwhale and other err pages

Analyze

  • Turn data into info
  • Are things better/worse post-deploy
  • Create env of capacity planning, not firefighting—no more cowboys in the wild west

Deploys

  • Ganglia shows final deploy info for twitter, summize, and search

Whale-watcher

  • simple script with massive win
  • 503 is a whale, 500 is a robot
  • Whales per second exceeds whale threshold then “There’s whales!”
  • Darkmode: selectively disable portions of site with automatic notification to product and eng teams to let them know

Config Mgmt

  • You need an automated cfg mgmt system NOW. Else you won’t scale
  • It intros complexity, with multiple admins, unknown interactions
  • Peer review solves most of this; they use reviewboard with svn precommit hook requiring “reviewed by” note in comment and postcommit hook sends note about what changed to people

High communication

  • They use chat (campfire) with docs, graphs, logs, etc.
  • skitch into campfire is a frequent working methodology

Subsystems

  • Many limiting factors in request pipeline
  • Oversubscribe mongrel 2:1 vs. cores
  • Attack plan per ssytem (e.g. bandwidth? bottleneck: network, vector: http latency, solution: servers+; timeline? db, update delay, better algo; search? db, delays, dbs+ and code; etc.)

CPUs:

  • switched to Xeon +30% gain
  • replace 2x and 4x core with 8x core +40%

Rails:

  • Stop blaming rails
  • Analysis: caching/cache invalidation, AR makes bad queries, queue latency, memcache/page corruption, rep lag
  • Not so much about Rails

Disk is the new Tape

  • Social networks is very O(n^y) oriented
  • Disk is too slow
  • Need lots of RAM

Lots of caching is possible. Moving libmemcached to native C gem was bigtime helpful.

Nick’s CacheMoney AR plugin: readthru/writethru caching with memcached!

Caching everything not smart, either

  • Cache evictions
  • Cold cache after host failure/new host spinup
  • Cache smarter: get rid of cache busting behaviors, varnish with failover, etc. RDBMS vs message queues

  • Not everything needs ACID
  • message queues help
  • Most MQs suck at high load
  • They wrote Kestrel for this; looks like memcache
  • Starling was earlier version

Asynch == Good

  • They lean on mongrel heavily (they know it well)
  • Keep external service requests out of the pipeline via daemons which process message queues
  • Size worker daemons appropriately, have them kill themselves off rather than long-run

DB replication

  • Multiple functional read/write masters
  • never read from the master—slows it down too much
  • watch your slow queries
  • use mkill to kill long-running queries before they kill you.

Put up a status blog on some other service—transparency stops armchair engineering



blog comments powered by Disqus

Published

23 June 2009

Categories