Story by Yahoo!, picture by Wordle.net:
« May 2009 | Main | July 2009 »
Story by Yahoo!, picture by Wordle.net:
2009.06.25 | Permalink | Comments (1) | TrackBack (0)
Update 1: Slides are now on SlideShare.
Update 2: Video now available on blip.tv
John Allspaw (Ops) & Paul Hammond (Eng), Twitter
“Actually work together and aren’t huge assholes to each other.”
(omitted: photo stats … that’s a lot of kittens)
Dev vs. Ops
Spock v. Scottie analogy
Ops as grumpy old man, says no all the time, cycle of “no all the time because no one tells them anything because they say no all the time”
CW: dev job to add features, ops job to keep site stable and fast
Flickr: Ops job is to enable the business (Dev’s, too)
Business requires change, otherwise you’ll be overtaken by the new guy … but change is the root cause of most outages.
Discourage change vs. Allow it to happen as often as it needs to (via tools and culture)
Lowering the risk of change via tools and culture.
You need {ops,devs} who think like {dev,ops}
1. Role and Config Mgmt
2. Shared Version Control: everyone looks in the same place for everything
3a. One-step build
3b. One-step deploy
You can’t pretend to deploy 10 times a day if you go down 10 times a day. That’s not being agile, that’s being retarded.
They use Hudson to generate packages which can be deployed by ops
Small frequent changes make it easier to see what went wrong and recover when needed
4. [missed that tag]
They have a couple hundred “free contingency switches” to turn things off/throttle things down. Gives broad operational control to minimize effects on the site.
Tend to fail forward using these and fix the problem.
5. Shared metrics
6. IRC and IM bots
Culture
All the tools in the world won’t help if you have a contentious culture
1. Respect
Talk about the impact of your code push
2. Trust
3. Have a healthy attitude around failure—it’s going to happen.
4. Avoiding blame
Ops should provide constructive, continual feedback on how it’s going. Point out interesting things before they’re critical
2009.06.23 | Permalink | Comments (3) | TrackBack (0)
A short pre-lunch session to absorb a few moment:
John Adams (Twitter) Q&A
Q: How do you log all the info from your APIs?
A: syslog, looking at scribe, generally summarize and toss
Q: How do you control abusive clients?
A: Rate limiting, apply feature limits to abusers, etc.
Q: What would you do differently?
A: Implemented change controls much sooner. Process is much better now with more control, predictability
Q: How does your on-call team work?
A: More people reduces length in rotation. Nagios with alerts and aggregation of alerts. Make alerts actionable (db fails? see one page for db down, not 500 webservers). Also prevents burnout
Q: Carry a real pager?
A: Some, mostly SMS. There are escalations if you don’t answer. Always someone from Ops and Eng on the pager chain.
2009.06.23 | Permalink | Comments (0) | TrackBack (0)
Bryan McQuade (Google), Richard Rabbat (Google, Inc.)
What’s Page SpeedPrioritizes according to importance/savings, gives easy way to see detailed info about any given rule, what’s violating it, docs on why and how of each rule
Defer loading JS as much as possibleInefficient CSS Selectors: based on David Hyatt’s post on inefficient selectors
Activity PanelBBC “has lots of room for optimization” gmail waterfall is hugely vertical!
Just released FF3.5 compatible version
2009.06.23 | Permalink | Comments (0) | TrackBack (0)
Justin Huff, Picknik (online photo editor)
Used AWS for 2 years, 1.5 in production
Hybrid app: small cage in Seattle + EC2/S3 for some parts of infra
Gives flexibility
Picnik has a spiky profile based on usage; EC2 allows to cover thatAt one point had nearly 1B objects in S3
1. Move old files to S3 2. Put some new files to S3 3. Put a lot more out there (had a knob to adjust, eventually reached S3) 4. Profit? Not so much ..
Most S3 objs short lived, needed fast deletion, and mostly didn’t have it
Mostly ignored this problem in favor of other more important problems (db sharding, scaling web frontends, expanding). Spend money on it.
They have 1.5 ops people.
“At some point we started getting free airline tickets from FF mileage on AWS CC”
Non-cloud apps have predictable, controllable latency, etc. Not so much in the cloud.
Be ready for failMostly, though, clouds help you ignore problems … until you can’t.
2009.06.23 | Permalink | Comments (0) | TrackBack (0)
752% growth in 2008, trend happens ~11/2008 and keeps climbing
Growth Pain Fear of what’s gonna happen
Mantra:Lots of caching is possible. Moving libmemcached to native C gem was bigtime helpful.
Nick’s CacheMoney AR plugin: readthru/writethru caching with memcached!
Caching everything not smart, eitherPut up a status blog on some other service—transparency stops armchair engineering
2009.06.23 | Permalink | Comments (0) | TrackBack (0)
Velocity 2009, the conference about performance, is very high-performance about getting people on and off stage (and high density around content)
Umang Gupta, Vik Chaudhary (Keynote)
(omitted: Keynote history, 15 years of continuous improvement, etc.)
Debuting Transaction Perpective 9.0 (TXP9)
(demo: reservations site for The Broadmoor in Colorado Springs, very flash-integrated with lots of client-side action. “Challenge of screen sensing what’s going on on the screen is non-trivial”. Also http://espn.go.com/video/ and Mini-Cooper flash site)
Using KITE platform/desktop environment to record what you’re doing. You click around, type, etc. and it records a script.
(This is somewhat like what they do at [DeviceAnywhere http://deviceanywhere.com/] for mobile device testing. They don’t focus on UX or perf; they’re more on QA testing side)
Script runs and collects UX and Network times. UX time is net time + client-side execution + rendering. Also shows augmented waterfall inclusive of client-side computation, etc.
2009.06.23 | Permalink | Comments (0) | TrackBack (0)
Eric Schurman (Microsoft/bing), Jake Brutlag (Google)
Experiments
They have platforms for experimentation which allow fractional experiments
Server Delays
Results
Google Search Delay Experiment
Results:
Page weight experiments
Progressive rendering experiment
HCI may state that 100-200ms isn’t perceptible; it still has effect.
Getting something to your user quickly is more important than when they receive their last byte
Experimentation platforms make all this research and hard numbers possible.
2009.06.23 | Permalink | Comments (1) | TrackBack (0)
Jonathan Heiliger, VP Tech Ops FB
FB Mission: give people the power to share and make the world more open and connected.
2004: launch in MZ’s dorm room 2004-5: new apps launched (events, photos, mobile) 2006: news feed and open reg 2007: platform launch 2008: crowdsourced translations; reached 30 langs quickly (spanish 2 wks, french ~24 hrs)
[nice map viz for growth: colorize market penetration]
Radio took ~150 yrs (?) to reach 150M TV: 13 yrs Computers 4 years FB: 3 yrs
How FB deals
Classic battle of Ops v. EngTuning the Operating Pipeline (Eng -> QA -> Ops aka Dev -> Test -> Deploy) (this isn’t how they did it)
Engineering is responsible for the efficacy and reliability of their code, writing their own tests, and full lifecycle of code including pushing it live.
Ops provides guard rails to keep eng safe from itself, prevent site downtime. Feature can go down, but rest of site is safe.
Complaints back in the day: Ops: Eng is way too unstructured, lobbing crap over the wall. Eng: Ops is not nimble
Make the problem joint; Eng owns the problem
Continuous build, code review, peer review, perf testing has kept things moving fast while moving to 200+ eng org.
Put engineers in operationsSoftware launch has warroom with PM, Eng, NetEng, SRE, Perf Eng, Site Integrity staff around. Always the right person on hand, physically present.
Getting it doneDistribute accountability Test with users “The only place success comes before work is in the dictionary” – Vince lombardi
Expects org to look different in a year—evolution is the key.
2009.06.23 | Permalink | Comments (0) | TrackBack (0)
More from Velocity 2009:
Jeremy Bingham, DailyKos.com
Before the flood
MySQL issues
IA Caucus first big night
Caching
Hardware
Traffic more than doubled at election peak over normal monthly, almost 3x
People liked to talk about Sarah Palin … a lot. How nice that she provided things to talk about.
Changes were in place by April 2008/Pennsylvania primary
2009.06.23 | Permalink | Comments (0) | TrackBack (0)