Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
September 17, 2012
As many of you know, one of three favorite “must attend” conferences each year is the O’Reilly Velocity Conference. There is where you can learn what some of the largest and most exciting properties on the Internet are doing, and what they’re doing to survive and thrive.
This fantastic talk by Jay Parikh, VP of Infrastructure Engineering at Facebook (@jayparikh, LinkedIn), will likely blow your mind for the following reasons:
Talk notes below:
Jay says, “Normally, the number of code commits goes down over time: ours has gone up.” They now routinely do hundreds of code pushes per day. Shown below is a graph of the code commits over time:
Jay describes four key principles that he attributes to helping Facebook manage to keep everything running smoothly, and builds systems which can “keep pace with the imagination of the planet.”
First, every engineer hired at Facebook, regardless of experience, goes through a program called “bootcamp.” Six weeks long, the program is designed to quickly allow new hires to make changes and fixes, often shipped out to hundreds of thousands of users on their second day of work.
Managed by senior engineer mentors, required to wear silly hats so they can be easily recognized in the “bootcamp cave,” the program also gives new engineers the chance to learn about opportunities across the company. At the end of their six week tenure, each bootcamper will choose which team they want to join. No hiring committees or assignments, engineers are free and encouraged to choose the team which they feel most excited about.
By focusing on impact and designing and implementing the bootcamp program, Facebook was able to centralize mentoring and onboarding of responsibilities while dramatically reducing hiring costs. As a result, leadership is developed internally, bonds are created amongst bootcamp classmates who will work on all different teams across the company, employees choose to work on what excites them, and the business saves money.
Facebook’s second principle is seemingly simple: move fast. However, moving fast does not mean sacrificing quality. Facebook aims to quickly deliver high quality products while removing friction in the process. Their secret to moving fast? A few internally created programs which test, monitor, and allow Facebook teams to spot upcoming problems before they get out of hand.
Perflab Perflab tests every code change committed by engineers. Performing an average of 10,000 tests per week, this program allows engineers to easily spot bugs before the code hits production.
Gatekeeper Gatekeeper is described by Parikh as A/B testing on steroids. This program allows for rapid experimentation, sending new features out to targeted batches of users. After testing, Facebook uses Gatekeeper to phase out and roll out new features to gradually increasing percentages of users.
Claspin Claspin is a high density heat map viewer for large services. Visualizing a large amount of information in one convenient location, this program allows engineers to spot upcoming programs and drill down quickly as necessary.
Though these are just a few examples, it’s clear that Facebook is constantly striving to improve their ability make changes quickly and catch problems before they become outages.
Parikh says that scaling to a billion users is “not done by a single person or team.” Every engineer will contribute to these programs or build new ones to suit their needs in the future.
What it all comes down to is, according to Parikh, “People, tools, and way way down the list, process.” Facebook works hard to find great people. They allow and encourage those great people to build the tools to fit their current and future needs. This combination allows them to move fast without sacrificing quality.
Facebook, says Parikh, needs to have the ability to iterate rapidly to 1 billion users over night. This requires flexibility and constant improvement.
Take for example Facebook’s new Prineville, OR data center. Built in 12 months, the company deemed it a success. But instead of simply copying the same process for their next data center in Forest City, NC, they decided to continue to make improvements. For Forest City, they ended up changing everything. Servers, network, software, nothing in the newly finished data center is quite the same as in Prineville. Improvements included moving the hard drive from the back to the front of webservers, and upgrading to two motherboards instead of one among other things. As a result, Facebook experienced a 40% improvement in throughput on their web tier workload.
Simultaneously, Facebook is now building a third new data center in Sweden. “We can’t rest, can’t do them serially, or one at a time,” Parikh said. Facebook is constantly innovating, a culture which contributes largely to its success.
In another illuminating moment, Parikh said “capacity planning and performance engineering are one in the same at Facebook.” It is therefore essential that the organization be aligned by a common set of values to accomplish such big, cross-functional efforts. “Everybody is focused on a core set of goals if you want to accomplish so many big things that span many different functions across the org,” Parikh stated. In my opinion, he couldn’t be more right.
Citing the 2010 outage as an example, Parikh then explained the importance of being open. Facebook strives to learn from their mistakes instead of punishing them. The engineer responsible for releasing all of the secret projects to all of Facebook’s users on that fateful day on 2010 is actually still employed at the company. Parikh considers him one of their best engineers. Instead of punishing his mistake, they learned from it, created actionable follow ups, and continued to move forward.
Facebook’s process includes a weekly sev review meeting. This meeting is made up of various people from all across the infrastructure team. The goal is to talk about outages and problems, reduce recovery time, create actionable follow ups, and track those follow ups to completion. Adopting a motto of “Fix More. Whine Less” they focus on helping each other succeed instead of hanging one another out to dry. Facebook aims not to ad process and penalize employees, but instead use each outage as a learning experience which will inspire future improvements.
He then described what he called “the strangest incident” in my career, when we “Our 2010 outage was the strangest incident in my career: we accidentally launched every secret feature all at one time,” causing one
Gene Kim has been studying high-performing technology organizations since 1999. He was the founder and CTO of Tripwire, Inc., an enterprise security software company, where he served for 13 years. His books have sold over 1 million copies—he is the WSJ bestselling author of Wiring the Winning Organization, The Unicorn Project, and co-author of The Phoenix Project, The DevOps Handbook, and the Shingo Publication Award-winning Accelerate. Since 2014, he has been the organizer of DevOps Enterprise Summit (now Enterprise Technology Leadership Summit), studying the technology transformations of large, complex organizations.
No comments found
Your email address will not be published.
First Name Last Name
Δ
To achieve success and rise above competitors in 2025, organizations must focus on building…
Over the past decade, the DORA metrics shaped how much of the industry measures…
In a compelling analysis of modern enterprise platforms, former Amazon platform architect and enterprise…
The debate over in-office versus remote work misses a fundamental truth: high-performing teams succeed…