LiveBlog: 10+ Deploys A Day: Dev and Ops at Flickr

Update 2: Video now available on blip.tv

John Allspaw (Ops) & Paul Hammond (Eng), Twitter

“Actually work together and aren’t huge assholes to each other.”

(omitted: photo stats … that’s a lot of kittens)

Dev vs. Ops

It’s not my {machines,code} it’s your {code,machines}

Spock v. Scottie analogy

Ops as grumpy old man, says no all the time, cycle of “no all the time because no one tells them anything because they say no all the time”

CW: dev job to add features, ops job to keep site stable and fast

Flickr: Ops job is to enable the business (Dev’s, too)

Business requires change, otherwise you’ll be overtaken by the new guy … but change is the root cause of most outages.

Discourage change vs. Allow it to happen as often as it needs to (via tools and culture)

Lowering the risk of change via tools and culture.

Increase confidence in change goodness
Increase ability to react to those changes

You need {ops,devs} who think like {dev,ops}

Role and Config Mgmt
Shared Version Control: everyone looks in the same place for everything

Code and config in same place
Everyone has access—transparency
Everyone knows how to use it

3a. One-step build

Everything you need to do to convert svn co’d code into what goes to the site—one command
They have “Perform Staging” button at bottom of a page with stats on latest commit

3b. One-step deploy

Top of page is deploy log with notes: who, when, what (link to changes)
Bottom has “I’m feeling lucky!” button to deploy
Continuous deployment

You can’t pretend to deploy 10 times a day if you go down 10 times a day. That’s not being agile, that’s being retarded.

They use Hudson to generate packages which can be deployed by ops

Small frequent changes make it easier to see what went wrong and recover when needed

[missed that tag]

Branching is all about managing bugfixes
Always ship trunk
Branch in code instead: use conditionals to block out pre-release features and configure off/invis—provides an operational lever for adjustment
Makes these open for private beta in production on production hdwe, etc.
Allows dark launches, which allows you to size appropriately, fix major oversights, take the fear out of major new launches

They have a couple hundred “free contingency switches” to turn things off/throttle things down. Gives broad operational control to minimize effects on the site.

Tend to fail forward using these and fix the problem.

Shared metrics

You can see mine, I can see yours
Use ganglia as console
Devs know where dashboard is, and watch as obsessively as ops
Includes app-level metrics (which exposes them to Ops)
(helps drive accountability in both directions in the org—both can see and feel ownership)
This begins to create opportunity to gracefully collaborate to back off an oversub’d resource/degrade/throttle as needed
Show last site deploy info on every page/graph; you can corellate a change in the graph

IRC and IM bots

Heavy IRC users for ongoing dialog between dev/ops, remote/local
Last.fm wrote a tool to inject events into IRC (monitoring, events, deployments, builds)
Log it all and put it in a search engine

Culture

All the tools in the world won’t help if you have a contentious culture

Respect

No stereotypes: not all devs are lazy/cowboys, not all ops are obstructive
respect their expertise and opinions and responsibilities (they influence priorities)
Don’t just say “no”—it’s like saying “I don’t care about your problem/perspective”
Best solutions are collaborative. Memcache is an excellent example; written to solve the problem of DB overheat, which impacted both
Don’t hide things: share your solution even if you think they’ll say no; you deny their expertise and input

Talk about the impact of your code push

What metrics change
what are the risks
what are the symptoms of somethign going wrong
what are the contingencies (rollback)

Trust

When you walk up with all the above on hand, you demonstrate that you care enough about them and the business
“I don’t want to tell X …” == you’re a cowboy, and “cowboys are losers”
Have shared playbooks and contingency plans so all understand.
Provide as many knobs and levers as you can so Ops can tweak to match the env
Ops: be transparent, give devs insight and access to the systems. Playing telephone around shell commands is dumb.
It’s hard to help if you can’t directly see

Have a healthy attitude around failure—it’s going to happen.

Think about how you’ll respond more than you think about how you’ll prevent it
Would you rather be treated by a GP who deals with heart attacks infrequently or an EMT who handles them weekly?
Fire drills: when ops and sr engineers are fixing a problem, have others diagnose live in parallel (but make no changes!)

Avoiding blame

they have a rule of no-finger-pointing; it doesn’t need enforcement, folks step up
fixing blame wastes a ton of time, why not skip it? Feel guilty afterwards if you must.
They’ve got a bit of a potlatch culture as people try to assert responsibility in order to fix things.
Remember that when your code breaks, someone’s going to have to wake up to fix it. Own it and apologize, at least. Otherwise, you’re back to not respecting each other (“Screw you … aren’t you getting paid to do that?”)

Ops should provide constructive, continual feedback on how it’s going. Point out interesting things before they’re critical

← Previous Archive Next →

blog comments powered by Disqus

Published

23 June 2009

LiveBlog: 10+ Deploys A Day: Dev and Ops at Flickr

Published

Categories