Gmail no longer beta? True dat.

I guess it's no longer under construction, or so sayeth the GMail blog which also shows those who can't live without that ubiquitous beta tag how to bring back their old friend:

Back2beta

Michael Jackson's Obit via Wordle

Story by Yahoo!, picture by Wordle.net:

MJ Obit Wordle

LiveBlog: 10+ Deploys A Day: Dev and Ops at Flickr


Update 1: Slides are now on SlideShare.

Update 2: Video now available on blip.tv

John Allspaw (Ops) & Paul Hammond (Eng), Twitter

“Actually work together and aren’t huge assholes to each other.”

(omitted: photo stats … that’s a lot of kittens)

Dev vs. Ops

  • It’s not my {machines,code} it’s your {code,machines}

Spock v. Scottie analogy

Ops as grumpy old man, says no all the time, cycle of “no all the time because no one tells them anything because they say no all the time”

CW: dev job to add features, ops job to keep site stable and fast

Flickr: Ops job is to enable the business (Dev’s, too)

Business requires change, otherwise you’ll be overtaken by the new guy … but change is the root cause of most outages.

Discourage change vs. Allow it to happen as often as it needs to (via tools and culture)

Lowering the risk of change via tools and culture.

  • Increase confidence in change goodness
  • Increase ability to react to those changes

You need {ops,devs} who think like {dev,ops}

1. Role and Config Mgmt

2. Shared Version Control: everyone looks in the same place for everything

  • Code and config in same place
  • Everyone has access—transparency
  • Everyone knows how to use it

3a. One-step build

  • Everything you need to do to convert svn co’d code into what goes to the site—one command
  • They have “Perform Staging” button at bottom of a page with stats on latest commit

3b. One-step deploy

  • Top of page is deploy log with notes: who, when, what (link to changes)
  • Bottom has “I’m feeling lucky!” button to deploy
  • Continuous deployment

You can’t pretend to deploy 10 times a day if you go down 10 times a day. That’s not being agile, that’s being retarded.

They use Hudson to generate packages which can be deployed by ops

Small frequent changes make it easier to see what went wrong and recover when needed

4. [missed that tag]

  • Branching is all about managing bugfixes
  • Always ship trunk
  • Branch in code instead: use conditionals to block out pre-release features and configure off/invis—provides an operational lever for adjustment
  • Makes these open for private beta in production on production hdwe, etc.
  • Allows dark launches, which allows you to size appropriately, fix major oversights, take the fear out of major new launches

They have a couple hundred “free contingency switches” to turn things off/throttle things down. Gives broad operational control to minimize effects on the site.

Tend to fail forward using these and fix the problem.

5. Shared metrics

  • You can see mine, I can see yours
  • Use ganglia as console
  • Devs know where dashboard is, and watch as obsessively as ops
  • Includes app-level metrics (which exposes them to Ops)
  • (helps drive accountability in both directions in the org—both can see and feel ownership)
  • This begins to create opportunity to gracefully collaborate to back off an oversub’d resource/degrade/throttle as needed
  • Show last site deploy info on every page/graph; you can corellate a change in the graph

6. IRC and IM bots

  • Heavy IRC users for ongoing dialog between dev/ops, remote/local
  • Last.fm wrote a tool to inject events into IRC (monitoring, events, deployments, builds)
  • Log it all and put it in a search engine

Culture

All the tools in the world won’t help if you have a contentious culture

1. Respect

  • No stereotypes: not all devs are lazy/cowboys, not all ops are obstructive
  • respect their expertise and opinions and responsibilities (they influence priorities)
  • Don’t just say “no”—it’s like saying “I don’t care about your problem/perspective”
  • Best solutions are collaborative. Memcache is an excellent example; written to solve the problem of DB overheat, which impacted both
  • Don’t hide things: share your solution even if you think they’ll say no; you deny their expertise and input

Talk about the impact of your code push

  • What metrics change
  • what are the risks
  • what are the symptoms of somethign going wrong
  • what are the contingencies (rollback)

2. Trust

  • When you walk up with all the above on hand, you demonstrate that you care enough about them and the business
  • “I don’t want to tell X …” == you’re a cowboy, and “cowboys are losers”
  • Have shared playbooks and contingency plans so all understand.
  • Provide as many knobs and levers as you can so Ops can tweak to match the env
  • Ops: be transparent, give devs insight and access to the systems. Playing telephone around shell commands is dumb.
  • It’s hard to help if you can’t directly see

3. Have a healthy attitude around failure—it’s going to happen.

  • Think about how you’ll respond more than you think about how you’ll prevent it
  • Would you rather be treated by a GP who deals with heart attacks infrequently or an EMT who handles them weekly?
  • Fire drills: when ops and sr engineers are fixing a problem, have others diagnose live in parallel (but make no changes!)

4. Avoiding blame

  • they have a rule of no-finger-pointing; it doesn’t need enforcement, folks step up
  • fixing blame wastes a ton of time, why not skip it? Feel guilty afterwards if you must.
  • They’ve got a bit of a potlatch culture as people try to assert responsibility in order to fix things.
  • Remember that when your code breaks, someone’s going to have to wake up to fix it. Own it and apologize, at least. Otherwise, you’re back to not respecting each other (“Screw you … aren’t you getting paid to do that?”)

Ops should provide constructive, continual feedback on how it’s going. Point out interesting things before they’re critical

LiveBlog: Q&A with Twitter's John Adams

A short pre-lunch session to absorb a few moment:

John Adams (Twitter) Q&A

Q: How do you log all the info from your APIs?

A: syslog, looking at scribe, generally summarize and toss

Q: How do you control abusive clients?

A: Rate limiting, apply feature limits to abusers, etc.

Q: What would you do differently?

A: Implemented change controls much sooner. Process is much better now with more control, predictability

Q: How does your on-call team work?
A: More people reduces length in rotation. Nagios with alerts and aggregation of alerts. Make alerts actionable (db fails? see one page for db down, not 500 webservers). Also prevents burnout

Q: Carry a real pager?

A: Some, mostly SMS. There are escalations if you don’t answer. Always someone from Ops and Eng on the pager chain.

LiveBlog: PageSpeed

Bryan McQuade (Google), Richard Rabbat (Google, Inc.)

What’s Page Speed
  • FF/Firebug addon
  • http://code.google.com/p/page-speed
  • Optimizes images, minifies JS, tells you what you should defer
  • 100K downloads in 10 days, 1000s of tweets, 100s of blog postings
How’d it start
  • wanted to help stop relearning lessons in new apps
  • keep you from unintended consequeces (e.g. vary header, squid won’t cache anything with params, etc.) Google map tiles used to have ? in URL; removing it gave huge boost in perf, huge reduction in requests
  • Sourced from many smart people on the web

Prioritizes according to importance/savings, gives easy way to see detailed info about any given rule, what’s violating it, docs on why and how of each rule

Defer loading JS as much as possible
  • Rule looks to see what’s not been invoked before OnLoad completes
  • Not enabled by default, has perf hit
  • Load multiple times in multiple scenarios; some JS is triggered in different

Inefficient CSS Selectors: based on David Hyatt’s post on inefficient selectors

Activity Panel
  • Show where most time is spent, where you should focus optimization
  • Shows DNS, network, connection, latency, data available for use, JS parse/exec, cache hits
  • Coming: paint events, screen snapshotting

BBC “has lots of room for optimization” gmail waterfall is hugely vertical!

Just released FF3.5 compatible version

LiveBlog: 2 Years Later, Loving and Hating the Cloud

More from Velocity 2009

Justin Huff, Picknik (online photo editor)

Used AWS for 2 years, 1.5 in production

Hybrid app: small cage in Seattle + EC2/S3 for some parts of infra

Gives flexibility

Picnik has a spiky profile based on usage; EC2 allows to cover that
  • They use a lot of
Capacity management (not planning)
  • easily repurpose between webserver and asynch jobs
  • Can buy hardware in batches, grow logically, get better deals

At one point had nearly 1B objects in S3

1. Move old files to S3 2. Put some new files to S3 3. Put a lot more out there (had a knob to adjust, eventually reached S3) 4. Profit? Not so much ..

Most S3 objs short lived, needed fast deletion, and mostly didn’t have it

Mostly ignored this problem in favor of other more important problems (db sharding, scaling web frontends, expanding). Spend money on it.

They have 1.5 ops people.

“At some point we started getting free airline tickets from FF mileage on AWS CC

Non-cloud apps have predictable, controllable latency, etc. Not so much in the cloud.

Be ready for fail
  • What if EC2 goes down? Have a knob for how much to go offline/reduce services
  • Be ready for hard debugging: lots of visibility/instrumentation

Mostly, though, clouds help you ignore problems … until you can’t.

LiveBlog: Fixing Twitter

John Adams, Twitter Ops

Ops
  • Small team
  • SW perf
  • availability is their primary focus
All on managed services with NTT
  • No clouds—too high latency
  • NTT runs the NOC
  • Frees them to deal with real thinking compsci probs

752% growth in 2008, trend happens ~11/2008 and keeps climbing

Growth Pain Fear of what’s gonna happen

Mantra:
  • Find the weekest point (metrics + logging + analysis)
  • Take corrective action (process)
  • Repeat
Find weak points
  • Collect metrics and graphs (individual metrics are irrelevant)
  • Logs
  • SCIENCE!
  • Instrument everything! More info is better
Monitoring
  • Keep critical metrics as close to realtime as possible
  • Using RRD, Ganglia + gMetrics, MRTG
  • Mostly on 10s interval, some 5s, some 60s
  • Everyone in company has access to dashboard
  • “Criticals” view
  • Use google analytics for failwhale and other err pages
Analyze
  • Turn data into info
  • Are things better/worse post-deploy
  • Create env of capacity planning, not firefighting—no more cowboys in the wild west
Deploys
  • Ganglia shows final deploy info for twitter, summize, and search
Whale-watcher
  • simple script with massive win
  • 503 is a whale, 500 is a robot
  • Whales per second exceeds whale threshold then “There’s whales!”
  • Darkmode: selectively disable portions of site with automatic notification to product and eng teams to let them know
Config Mgmt
  • You need an automated cfg mgmt system NOW. Else you won’t scale
  • It intros complexity, with multiple admins, unknown interactions
  • Peer review solves most of this; they use reviewboard with svn precommit hook requiring “reviewed by” note in comment and postcommit hook sends note about what changed to people
High communication
  • They use chat (campfire) with docs, graphs, logs, etc.
  • skitch into campfire is a frequent working methodology
Subsystems
  • Many limiting factors in request pipeline
  • Oversubscribe mongrel 2:1 vs. cores
  • Attack plan per ssytem (e.g. bandwidth? bottleneck: network, vector: http latency, solution: servers+; timeline? db, update delay, better algo; search? db, delays, dbs+ and code; etc.)
CPUs:
  • switched to Xeon +30% gain
  • replace 2x and 4x core with 8x core +40%
Rails:
  • Stop blaming rails
  • Analysis: caching/cache invalidation, AR makes bad queries, queue latency, memcache/page corruption, rep lag
  • Not so much about Rails
Disk is the new Tape
  • Social networks is very O(n^y) oriented
  • Disk is too slow
  • Need lots of RAM

Lots of caching is possible. Moving libmemcached to native C gem was bigtime helpful.

Nick’s CacheMoney AR plugin: readthru/writethru caching with memcached!

Caching everything not smart, either
  • Cache evictions
  • Cold cache after host failure/new host spinup
  • Cache smarter: get rid of cache busting behaviors, varnish with failover, etc.
RDBMS vs message queues
  • Not everything needs ACID
  • message queues help
  • Most MQs suck at high load
  • They wrote Kestrel for this; looks like memcache
  • Starling was earlier version
Asynch == Good
  • They lean on mongrel heavily (they know it well)
  • Keep external service requests out of the pipeline via daemons which process message queues
  • Size worker daemons appropriately, have them kill themselves off rather than long-run
DB replication
  • Multiple functional read/write masters
  • never read from the master—slows it down too much
  • watch your slow queries
  • use mkill to kill long-running queries before they kill you.

Put up a status blog on some other service—transparency stops armchair engineering

LiveBlog: Next Web Challenges: It's Still All About UX

Velocity 2009, the conference about performance, is very high-performance about getting people on and off stage (and high density around content)

Umang Gupta, Vik Chaudhary (Keynote)

(omitted: Keynote history, 15 years of continuous improvement, etc.)

Debuting Transaction Perpective 9.0 (TXP9)

  • Embeds real IE browser for monitoring
  • Adds “screen sensing” technology
  • Esp. useful for “next web” apps: flash, video, voice, SMS/mobile—composit transactions or flows

(demo: reservations site for The Broadmoor in Colorado Springs, very flash-integrated with lots of client-side action. “Challenge of screen sensing what’s going on on the screen is non-trivial”. Also http://espn.go.com/video/ and Mini-Cooper flash site)

Using KITE platform/desktop environment to record what you’re doing. You click around, type, etc. and it records a script.

(This is somewhat like what they do at [DeviceAnywhere http://deviceanywhere.com/] for mobile device testing. They don’t focus on UX or perf; they’re more on QA testing side)

Script runs and collects UX and Network times. UX time is net time + client-side execution + rendering. Also shows augmented waterfall inclusive of client-side computation, etc.

LiveBlog: The User and Business Impact of Server Delays, Additional Bytes, and HTTP Chunking in Web Search

Eric Schurman (Microsoft/bing), Jake Brutlag (Google)

Experiments

  • Server delays (MS and Google)
  • Page weight variance
  • Progressive rendering

They have platforms for experimentation which allow fractional experiments

  • Divide users into small buckets
  • use good methodology (control group, experimental group(s))
  • Way better than usability tests

Server Delays

  • Goal [missed all of this due to an IM. Lesson learned]

Results

  • No statistically significant change @ ~50ms delay
  • Observable and fairly linear impact on delays 200/500/1000/2000ms.
  • Time to first click took ~2x delay—theory: user has opportunity to get distracted

Google Search Delay Experiment

  • Varied type of delay, magnitude, and duration (number of weeks) per user group
  • Pre-header delay: pause server processing upon receipt of req
  • Post-header delay: pause after sending on header, but before sending results
  • Post-ads delay: (ads are structurally first in page, can render before search result) put ads in separate http chunk, delay between ads and search results

Results:

  • Measure average daily searches per user
  • 50ms pre-header delays show no significant impact
  • 100ms pre-head, 200ms post-heads, 400ms post-head, 200ms post-ads (and others) showed linear progression in decreased avg daily searches
  • Also saw increase in internally monitored “abandonment rate”
  • Active users are more sensitive
  • drop-off continued to trend down linearly beyond 4 weeks; effect becomes more pronounced over time, and additive—200ms and 400ms groups diverge more strongly
  • Stopped injecting delays at week 7; recovery was significant immediately, but not fully realized at week 12—there was still a drop in activity for these groups

Page weight experiments

  • injected incompressible comments into various places of page
  • varied size of comments from 5% of page to 500% (most of larger loads were below the fold)
  • small payloads weren’t worrisome (tho stat’ly significant)
  • perf suffered slightly, but was US only experiment; global exp planned, will likely show significantly larger drop in perf
  • Click metrics were hurt more than query metrics

Progressive rendering experiment

  • Goal: determine impace sending visual header before results
  • Build page in phases, send using HTTP 1.1 chunked transfer encoding
  • Results: Large improvement due to parallelization. Time to first click was ~9% faster, more likely to refine query, more clicks, more likely to page thru results

HCI may state that 100-200ms isn’t perceptible; it still has effect.

Getting something to your user quickly is more important than when they receive their last byte

Experimentation platforms make all this research and hard numbers possible.

LiveBlog: After the Click

More from Velocity 2009. Going really fast, sorry for all the sloppiness.

Jonathan Heiliger, VP Tech Ops FB

FB Mission: give people the power to share and make the world more open and connected.

2004: launch in MZ’s dorm room 2004-5: new apps launched (events, photos, mobile) 2006: news feed and open reg 2007: platform launch 2008: crowdsourced translations; reached 30 langs quickly (spanish 2 wks, french ~24 hrs)

[nice map viz for growth: colorize market penetration]

Radio took ~150 yrs (?) to reach 150M TV: 13 yrs Computers 4 years FB: 3 yrs

How FB deals

Classic battle of Ops v. Eng
  • Ops wants no change—stability
  • Eng wants lots of change—driven by users and site
  • Do you really want to fight it out? Teamwork is required
  • Enable individuals to reach goals, chase team success
  • Make it transparent to users and safe for employees to fail
  • Make it a point of pride: you don’t want to be the one who took down the site (but there’s some cache in that war story)
It’s the people
  • Everyone hires the smartest people
  • It’s about organizing and leading

Tuning the Operating Pipeline (Eng -> QA -> Ops aka Dev -> Test -> Deploy) (this isn’t how they did it)

Engineering is responsible for the efficacy and reliability of their code, writing their own tests, and full lifecycle of code including pushing it live.

Ops provides guard rails to keep eng safe from itself, prevent site downtime. Feature can go down, but rest of site is safe.

Complaints back in the day: Ops: Eng is way too unstructured, lobbing crap over the wall. Eng: Ops is not nimble

Make the problem joint; Eng owns the problem

Continuous build, code review, peer review, perf testing has kept things moving fast while moving to 200+ eng org.

Put engineers in operations
  • Site reliability team: stewards of the site
  • Operations engineering: tooling and glue apps (workflow/pipeline)
Put ops in engineering (consulting engineers)
  • Partners with backend service groups to think about architecture, scaling, reliability
  • Helps mentor into full SDLC responsibility—really understand complete DEV to PROD function of code

Software launch has warroom with PM, Eng, NetEng, SRE, Perf Eng, Site Integrity staff around. Always the right person on hand, physically present.

Getting it done
  • If you can’t work as a team, you’re done
  • Design is awesome, but it needs execution to succeed
Three things they did live expecting to break the site
  • See how the team worked, who would step up, etc.
CNN livefeed
  • Group of 20 some folks came together, marketing, eng, product, ops, etc.
  • Added much capacity, made warroom
  • Written from scratch in ~3 weeks
  • Replicated (and improved) for Oscars, etc.
  • Knew there would be point load much like DoS attack
  • Added throttles to direct features, as well as throttling things like chat, number of thumbnails shown on site, etc.
  • Friends had to be shown on the fly
  • Common content was cached in CDN; didn’t anticipate delay/latency from CDN
  • Didn’t expec users to maddly twiddle “Everyone” and “Friends” tabs (they did) – learned “cache everything”
  • During inauguraton 2M status updates, 8.5K spike at start
  • Dark launched everything with users exercising the stack without any visible UI to users
  • Also built perf framework to see what real user experience would be like
  • Used data from both to appropriately size
Like
  • Simple “I like this” on wall/status
  • Didn’t expect it to get a lot of traffic at first; totally wrong
    • 4.1m users liked 7.1 M times first day
    • 16.3/46.2 1st week
    • 39.6/226.8 1st month
Username allocations
  • Was initially to be auction (codename: hammer)
  • Decided to go first come first served, kept codename—it was going to hammer the site
  • Had to have blocked list of trademarks; didn’t block “asp.net”
  • Dark launch, found issues, delayed initial launch
  • Launched at 9p; huge cache hit within moments, no increase in idle latency (means they got it right, maybe a little overprovisioned)
  • Made pages as light as possible
  • Tiny blip in overall load
Datacenter infra/organization is hugely important
  • Untidiness reflects bad organization
  • DC/infra is 2nd biggest exp after people
  • Invest where appropriate

Distribute accountability Test with users “The only place success comes before work is in the dictionary” – Vince lombardi

Expects org to look different in a year—evolution is the key.