More from Velocity 2009. Going really fast, sorry for all the sloppiness.

Jonathan Heiliger, VP Tech Ops FB

FB Mission: give people the power to share and make the world more open and connected.

2004: launch in MZ’s dorm room 2004-5: new apps launched (events, photos, mobile) 2006: news feed and open reg 2007: platform launch 2008: crowdsourced translations; reached 30 langs quickly (spanish 2 wks, french ~24 hrs)

[nice map viz for growth: colorize market penetration]

Radio took ~150 yrs (?) to reach 150M TV: 13 yrs Computers 4 years FB: 3 yrs

How FB deals

Classic battle of Ops v. Eng

  • Ops wants no change—stability
  • Eng wants lots of change—driven by users and site
  • Do you really want to fight it out? Teamwork is required
  • Enable individuals to reach goals, chase team success
  • Make it transparent to users and safe for employees to fail
  • Make it a point of pride: you don’t want to be the one who took down the site (but there’s some cache in that war story)

It’s the people

  • Everyone hires the smartest people
  • It’s about organizing and leading

Tuning the Operating Pipeline (Eng -> QA -> Ops aka Dev -> Test -> Deploy) (this isn’t how they did it)

Engineering is responsible for the efficacy and reliability of their code, writing their own tests, and full lifecycle of code including pushing it live.

Ops provides guard rails to keep eng safe from itself, prevent site downtime. Feature can go down, but rest of site is safe.

Complaints back in the day: Ops: Eng is way too unstructured, lobbing crap over the wall. Eng: Ops is not nimble

Make the problem joint; Eng owns the problem

Continuous build, code review, peer review, perf testing has kept things moving fast while moving to 200+ eng org.

Put engineers in operations

  • Site reliability team: stewards of the site
  • Operations engineering: tooling and glue apps (workflow/pipeline)

Put ops in engineering (consulting engineers)

  • Partners with backend service groups to think about architecture, scaling, reliability
  • Helps mentor into full SDLC responsibility—really understand complete DEV to PROD function of code

Software launch has warroom with PM, Eng, NetEng, SRE, Perf Eng, Site Integrity staff around. Always the right person on hand, physically present.

Getting it done

  • If you can’t work as a team, you’re done
  • Design is awesome, but it needs execution to succeed

Three things they did live expecting to break the site

  • See how the team worked, who would step up, etc. CNN livefeed

  • Group of 20 some folks came together, marketing, eng, product, ops, etc.
  • Added much capacity, made warroom
  • Written from scratch in ~3 weeks
  • Replicated (and improved) for Oscars, etc.
  • Knew there would be point load much like DoS attack
  • Added throttles to direct features, as well as throttling things like chat, number of thumbnails shown on site, etc.
  • Friends had to be shown on the fly
  • Common content was cached in CDN; didn’t anticipate delay/latency from CDN
  • Didn’t expec users to maddly twiddle “Everyone” and “Friends” tabs (they did) – learned “cache everything”
  • During inauguraton 2M status updates, 8.5K spike at start
  • Dark launched everything with users exercising the stack without any visible UI to users
  • Also built perf framework to see what real user experience would be like
  • Used data from both to appropriately size

Like

  • Simple “I like this” on wall/status
  • Didn’t expect it to get a lot of traffic at first; totally wrong

    • 4.1m users liked 7.1 M times first day
    • 16.3/46.2 1st week
    • 39.6/226.8 1st month

Username allocations

  • Was initially to be auction (codename: hammer)
  • Decided to go first come first served, kept codename—it was going to hammer the site
  • Had to have blocked list of trademarks; didn’t block “asp.net”
  • Dark launch, found issues, delayed initial launch
  • Launched at 9p; huge cache hit within moments, no increase in idle latency (means they got it right, maybe a little overprovisioned)
  • Made pages as light as possible
  • Tiny blip in overall load

Datacenter infra/organization is hugely important

  • Untidiness reflects bad organization
  • DC/infra is 2nd biggest exp after people
  • Invest where appropriate

Distribute accountability Test with users “The only place success comes before work is in the dictionary” – Vince lombardi

Expects org to look different in a year—evolution is the key.



blog comments powered by Disqus

Published

23 June 2009

Categories