LiveBlog: 10+ Deploys A Day: Dev and Ops at Flickr
Update 1: Slides are now on SlideShare.
Update 2: Video now available on blip.tv
John Allspaw (Ops) & Paul Hammond (Eng), Twitter
“Actually work together and aren’t huge assholes to each other.”
(omitted: photo stats … that’s a lot of kittens)
Dev vs. Ops
- It’s not my {machines,code} it’s your {code,machines}
Spock v. Scottie analogy
Ops as grumpy old man, says no all the time, cycle of “no all the time because no one tells them anything because they say no all the time”
CW: dev job to add features, ops job to keep site stable and fast
Flickr: Ops job is to enable the business (Dev’s, too)
Business requires change, otherwise you’ll be overtaken by the new guy … but change is the root cause of most outages.
Discourage change vs. Allow it to happen as often as it needs to (via tools and culture)
Lowering the risk of change via tools and culture.
- Increase confidence in change goodness
- Increase ability to react to those changes
You need {ops,devs} who think like {dev,ops}
-
Role and Config Mgmt
-
Shared Version Control: everyone looks in the same place for everything
- Code and config in same place
- Everyone has access—transparency
- Everyone knows how to use it
3a. One-step build
- Everything you need to do to convert svn co’d code into what goes to the site—one command
- They have “Perform Staging” button at bottom of a page with stats on latest commit
3b. One-step deploy
- Top of page is deploy log with notes: who, when, what (link to changes)
- Bottom has “I’m feeling lucky!” button to deploy
- Continuous deployment
You can’t pretend to deploy 10 times a day if you go down 10 times a day. That’s not being agile, that’s being retarded.
They use Hudson to generate packages which can be deployed by ops
Small frequent changes make it easier to see what went wrong and recover when needed
- [missed that tag]
- Branching is all about managing bugfixes
- Always ship trunk
- Branch in code instead: use conditionals to block out pre-release features and configure off/invis—provides an operational lever for adjustment
- Makes these open for private beta in production on production hdwe, etc.
- Allows dark launches, which allows you to size appropriately, fix major oversights, take the fear out of major new launches
They have a couple hundred “free contingency switches” to turn things off/throttle things down. Gives broad operational control to minimize effects on the site.
Tend to fail forward using these and fix the problem.
- Shared metrics
- You can see mine, I can see yours
- Use ganglia as console
- Devs know where dashboard is, and watch as obsessively as ops
- Includes app-level metrics (which exposes them to Ops)
- (helps drive accountability in both directions in the org—both can see and feel ownership)
- This begins to create opportunity to gracefully collaborate to back off an oversub’d resource/degrade/throttle as needed
- Show last site deploy info on every page/graph; you can corellate a change in the graph
- IRC and IM bots
- Heavy IRC users for ongoing dialog between dev/ops, remote/local
- Last.fm wrote a tool to inject events into IRC (monitoring, events, deployments, builds)
- Log it all and put it in a search engine
Culture
All the tools in the world won’t help if you have a contentious culture
- Respect
- No stereotypes: not all devs are lazy/cowboys, not all ops are obstructive
- respect their expertise and opinions and responsibilities (they influence priorities)
- Don’t just say “no”—it’s like saying “I don’t care about your problem/perspective”
- Best solutions are collaborative. Memcache is an excellent example; written to solve the problem of DB overheat, which impacted both
- Don’t hide things: share your solution even if you think they’ll say no; you deny their expertise and input
Talk about the impact of your code push
- What metrics change
- what are the risks
- what are the symptoms of somethign going wrong
- what are the contingencies (rollback)
- Trust
- When you walk up with all the above on hand, you demonstrate that you care enough about them and the business
- “I don’t want to tell X …” == you’re a cowboy, and “cowboys are losers”
- Have shared playbooks and contingency plans so all understand.
- Provide as many knobs and levers as you can so Ops can tweak to match the env
- Ops: be transparent, give devs insight and access to the systems. Playing telephone around shell commands is dumb.
- It’s hard to help if you can’t directly see
- Have a healthy attitude around failure—it’s going to happen.
- Think about how you’ll respond more than you think about how you’ll prevent it
- Would you rather be treated by a GP who deals with heart attacks infrequently or an EMT who handles them weekly?
- Fire drills: when ops and sr engineers are fixing a problem, have others diagnose live in parallel (but make no changes!)
- Avoiding blame
- they have a rule of no-finger-pointing; it doesn’t need enforcement, folks step up
- fixing blame wastes a ton of time, why not skip it? Feel guilty afterwards if you must.
- They’ve got a bit of a potlatch culture as people try to assert responsibility in order to fix things.
- Remember that when your code breaks, someone’s going to have to wake up to fix it. Own it and apologize, at least. Otherwise, you’re back to not respecting each other (“Screw you … aren’t you getting paid to do that?”)
Ops should provide constructive, continual feedback on how it’s going. Point out interesting things before they’re critical
blog comments powered by Disqus