As many of you know, one of three favorite “must attend” conferences each year is the O’Reilly Velocity Conference. There is where you can learn what some of the largest and most exciting properties on the Internet are doing, and what they’re doing to survive and thrive.
- how a culture of continual innovation and improvement has allowed them to “scale to a billion users, one improvement at a time”
- how focusing on their deployment processes has enabled innovation and fast time to market (hundreds of deploys per day)
- how Gatekeeper (“like A/B testing on steroids” or “config flags”) enables constant experimentation (500M+ gatekeeper checks per second)
Talk notes below:
- At the time of he presented, 500 million people visit the site each day
- In the next 30 minutes, Facebook will accumulate…
- 10TB more log data into Hadoop
- 105TB data into Hive (a “petabyte scale data warehouse using Hadoop”)
- 6 million photos uploaded
- 5 billion real time mobile messages sent
- 160 million newsfeed items posted
- 108 billion MySQL queries
- 3.8 trillion memcache lookups
Jay says, “Normally, the number of code commits goes down over time: ours has gone up.” They now routinely do hundreds of code pushes per day. Shown below is a graph of the code commits over time:
Jay describes four key principles that he attributes to helping Facebook manage to keep everything running smoothly, and builds systems which can “keep pace with the imagination of the planet.”
- Focus on Impact
- Moving Fast
- Be Bold
- Be Open
Principle 1: Focus on Impact.
First, every engineer hired at Facebook, regardless of experience, goes through a program called “bootcamp.” Six weeks long, the program is designed to quickly allow new hires to make changes and fixes, often shipped out to hundreds of thousands of users on their second day of work.
Managed by senior engineer mentors, required to wear silly hats so they can be easily recognized in the “bootcamp cave,” the program also gives new engineers the chance to learn about opportunities across the company. At the end of their six week tenure, each bootcamper will choose which team they want to join. No hiring committees or assignments, engineers are free and encouraged to choose the team which they feel most excited about.
By focusing on impact and designing and implementing the bootcamp program, Facebook was able to centralize mentoring and onboarding of responsibilities while dramatically reducing hiring costs. As a result, leadership is developed internally, bonds are created amongst bootcamp classmates who will work on all different teams across the company, employees choose to work on what excites them, and the business saves money.
Principle 2: Moving Fast
Facebook’s second principle is seemingly simple: move fast. However, moving fast does not mean sacrificing quality. Facebook aims to quickly deliver high quality products while removing friction in the process. Their secret to moving fast? A few internally created programs which test, monitor, and allow Facebook teams to spot upcoming problems before they get out of hand.
Perflab tests every code change committed by engineers. Performing an average of 10,000 tests per week, this program allows engineers to easily spot bugs before the code hits production.
- “Perflab enables performance tests every commit against real traffic before exposing to prod”
Gatekeeper is described by Parikh as A/B testing on steroids. This program allows for rapid experimentation, sending new features out to targeted batches of users. After testing, Facebook uses Gatekeeper to phase out and roll out new features to gradually increasing percentages of users.
- “Gatekeeper is like A/B testing on steroids, which is how we phase-in features to users: there are 500M+ gatekeepr checks/sec, which enables our fast flow releases”
Claspin is a high density heat map viewer for large services. Visualizing a large amount of information in one convenient location, this program allows engineers to spot upcoming programs and drill down quickly as necessary.
- “Claspin: provides high density heatmap viewers for large services: quickly pattern match. As opposed to randomonly SSH’ing into random servers, trying to figure out what’s going on.”
Though these are just a few examples, it’s clear that Facebook is constantly striving to improve their ability make changes quickly and catch problems before they become outages.
Parikh says that scaling to a billion users is “not done by a single person or team.” Every engineer will contribute to these programs or build new ones to suit their needs in the future.
- “Here’s an example of a big problem we solved in one hackathon… The problem was that when doing rollouts, caches take days or even weeks to replenish on process restarts”
- “The solution: in 1 hackathon, we figured out a way to move them to shared mem, so cache replenish takes days, sometimes even merlely hours, dramatically speeding up flow”
- “The constant focus on automation makes life more fun: the repetitive work that takes up time is replaced by tools. We recently computed that the automation we’ve created is doing the work equivalent to 350 ops engr.”
What it all comes down to is, according to Parikh, “People, tools, and way way down the list, process.” Facebook works hard to find great people. They allow and encourage those great people to build the tools to fit their current and future needs. This combination allows them to move fast without sacrificing quality.
Principle 3: Be Bold
Facebook, says Parikh, needs to have the ability to iterate rapidly to 1 billion users over night. This requires flexibility and constant improvement.
- “Scaling to a billion users at facebook is not down to a single person, it’s a function of a large number of small tweaks.”
- “People, tools and way way down the list, process.”
Take for example Facebook’s new Prineville, OR data center. Built in 12 months, the company deemed it a success. But instead of simply copying the same process for their next data center in Forest City, NC, they decided to continue to make improvements. For Forest City, they ended up changing everything. Servers, network, software, nothing in the newly finished data center is quite the same as in Prineville.
Improvements included moving the hard drive from the back to the front of webservers, and upgrading to two motherboards instead of one among other things. As a result, Facebook experienced a 40% improvement in throughput on their web tier workload.
Simultaneously, Facebook is now building a third new data center in Sweden. “We can’t rest, can’t do them serially, or one at a time,” Parikh said. Facebook is constantly innovating, a culture which contributes largely to its success.
- “Facebook currently perates 15,000 servers per datacenter technician”
- “Facebook production infrastructure peaks at 5TB/sec”
- “To get mega-project like 3rd data ctr build completed in Sweden: requires lots of disciplines”
- “Capacity planning & perf engineering is the same thing at Facebook”
- “A challenge that we were sure was going to need 8K more servers by the next week was eventually solved by software, and required only ’16 new servers'”
In another illuminating moment, Parikh said “capacity planning and performance engineering are one in the same at Facebook.” It is therefore essential that the organization be aligned by a common set of values to accomplish such big, cross-functional efforts. “Everybody is focused on a core set of goals if you want to accomplish so many big things that span many different functions across the org,” Parikh stated. In my opinion, he couldn’t be more right.
Principle 4: Be Open
Citing the 2010 outage as an example, Parikh then explained the importance of being open. Facebook strives to learn from their mistakes instead of punishing them. The engineer responsible for releasing all of the secret projects to all of Facebook’s users on that fateful day on 2010 is actually still employed at the company. Parikh considers him one of their best engineers. Instead of punishing his mistake, they learned from it, created actionable follow ups, and continued to move forward.
Facebook’s process includes a weekly sev review meeting. This meeting is made up of various people from all across the infrastructure team. The goal is to talk about outages and problems, reduce recovery time, create actionable follow ups, and track those follow ups to completion. Adopting a motto of “Fix More. Whine Less” they focus on helping each other succeed instead of hanging one another out to dry. Facebook aims not to ad process and penalize employees, but instead use each outage as a learning experience which will inspire future improvements.
He then described what he called “the strangest incident” in my career, when we “Our 2010 outage was the strangest incident in my career: we accidentally launched every secret feature all at one time,” causing one
- “To recover, we pulled DNS to take entire site down; but we realized that we couldn’t turn the site off!”
- RT @xthestreams: “A bad day for change management at Facebook #velocityconf http://t.co/p3qsZGCI
- @davenolan: FB accidentally launched all their secret features to everyone at once “We tried to turn off the site and we couldn’t” #skynet
- RT @davenolan: FB accid launched all secret features to everyone at once “We tried to turn off site & we couldn’t” #skynet