Skip to content

January 25, 2024

How to Lose Millions in a Microsecond – A Case Study using Deming’s System of Profound Knowledge

By John Willis

The following post has been adapted from Deming’s Journey to Profound Knowledge by John Willis.


When someone mentions Wall Street or the New York Stock Exchange (NYSE), we usually picture Gordon Gekko types directing people on the floor of the chaotic exchange, shouting orders to buy and sell stocks. Digital ticker tapes continuously scroll with the latest stock prices. Computer screens abound; papers are strewn everywhere.

While the NYSE still maintains its trading floor, the vast majority of stock trades happen in a quiet building across the Hudson River in Mahwah, New Jersey. This datacenter thirty miles west of Manhattan is the true heart of Wall Street these days.

When the financial firm Knight Capital was founded in 1995, most trades still happened on the floor. If you traded with Merrill Lynch, the firm had a trader physically stationed on the floor who would buy and sell stocks. Toward the late nineties, however, the founders of Knight Capital noticed the trend of more and more stock trades happening electronically based on computer algorithms. This trend accelerated when an antitrust lawsuit against the NYSE and NASDAQ mandated those stock exchanges to accept orders that used integrated electronic communication networks. This led to what’s called high-frequency trading (HFT), where computer programs buy and sell stocks based on specific parameters at lightning-fast speeds.

In 2002, the new CEO of Knight Capital, Thomas Joyce, decided to shift the firm’s focus to high-volume market making enabled by HFTs. “Market making” is where an entity (be it a company or individual) makes money on the difference between the selling price of a stock (the “ask”) and its buying price (the “bid”).

Let’s say the market ask for Tesla is $335.33. That’s the price the people and firms who currently own Musk’s stock are willing to sell their shares for. The people and firms who want to buy Tesla stock are willing to pay $335.36 per share. Knight Capital would buy the stock at $335.33, then turn around and sell it to others for $335.36. They would make $0.03 profit per share. That doesn’t sound like much until you know that at its height, Knight Capital traded 3.97 billion shares a day.

Under Joyce, Knight Capital became the stock market equivalent of a wholesaler. It targeted large-volume dealers, such as hedge funds, institutional investors, and electronic discount brokers like E*Trade, DT Ameritrade, and Vanguard. By 2011, it was worth $1.5 billion, trading approximately 17% of the NYSE and NASDAQ each, and employed 1,450 people around the world.

Much of this was done on the NYSE’s “dark trading pools.” These private electronic exchanges operated outside of the normal channels yet were often housed in the NYSE’s datacenter. Their purpose was to make it easier to conduct computer-generated trading.

If this seems a bit shady, read Flash Boys: A Wall Street Revolt by Michael Lewis. It tells the story of Brad Katsuyama, a day trader who managed a Royal Bank of Canada hedge fund. Katsuyama acted as a middleman between institutional investors (like the Royal Bank of Canada) and the public market. Say there were five million shares of Tesla available. Market demand at the moment is for only three million shares. Katsuyama could buy five million shares at a good price, immediately sell the three million shares demanded, then hold the other two million to sell a few minutes or hours later.

In 2007, Katsuyama started noticing a problem: By the time he could hit the buttons on his computer screen in Manhattan to buy some stocks, the price and volume available had already changed. His computer might take five milliseconds to execute a trade. That’s five-thousandths of a second. It’s hard for the human brain to even comprehend that tiny amount of time. The dark trading pools, however, could execute trades in about five microseconds. That’s five millionths of a second.

The problem: Katsuyama’s computer was too far away from Mahwah. A dark trading pool could buy and sell shares, then buy and sell them again before Katsuyama even finished pressing “Enter.”

Another advantage the NYSE’s dark trading pools had over their human competitors: they were allowed to buy and sell shares on margins of just a fraction of a cent. That is, computers could buy Tesla shares at $335.331 and sell them at $335.336, making a half-cent profit. Again, that doesn’t sound like much until you remember we’re talking about billions of trades per day. Knight Capital made money hand over fist for years  .  .  .  until one fateful forty-five minutes on August 1, 2012.

About a year before, the NYSE decided to develop a service called the Retail Liquidity Program (RLP) to make trading fairer for retail investors. In June of 2012, the Securities and Exchange Commission (SEC) approved it. The stock exchange planned to roll it out in August.

Thomas Joyce couldn’t decide if he wanted his firm to participate in the RLP or not. He’d spent the entire previous year on the fence, citing concerns. Once the SEC approved it, he changed his mind, saying not only that his firm would participate but that they’d be ready come August. This gave his tech developers just thirty days to ready their systems to be integrated with the RLP. I’ve been a programmer for over forty years, and I can tell you, thirty days on a project of this magnitude is ambitious, even for an agile organization  .  .  .  and Wall Street firms in 2012 were anything but agile.

RLP went live at 9:30 a.m. on August 1.

Forty-five minutes later, Knight Capital went bust.

In that short amount of time, the firm lost $460 million dollars—more than ten million dollars every minute. The next day, its stock price plummeted 75%. The next week, 70% of the company had to be sold just to stay solvent. The next year, the one-time king of Wall Street merged with an electronic trading company out of Chicago.

What the hell happened? While we’ll probably never know exactly what transpired, what follows is what I’ve been able to piece together.

So, so many things went wrong. For the purposes of this story, let’s start with Knight Capital’s order router. Basically, stock trades would hit the system and the router would route each order to one of its eight software servers. All eight servers had an old piece of testing software called Power Peg. You’ve heard the secret to making money is to buy low and sell high? Power Peg did the opposite of that. But that would have been okay because the software was never meant to be used in real life. It was originally built to test Knight Capital’s proprietary trading algorithms in a controlled environment. Although it hadn’t been used since 2003, Power Peg had been left sitting on the servers with its digital switch in the “off” position.

You can see where this is going.

For some unfathomable reason, Knight Capital’s developers reused Power Peg’s digital “on-off” switch. On top of that, they installed their new RLP-compliant code on only seven of the eight servers. Oops.

When 9:30 a.m. hit, the system began receiving trading orders. The router routed the trades to the eight different servers. The first seven performed as expected. The eighth, however, was running Power Peg. Buy high and sell low? Why, the opportunities for the code to do its job were practically endless!

There were supposed to be internal safeguards for this kind of out-of-control computer. The new RLP code included programs to check for excessive activity. However, those programs overlooked Power Peg’s trades because Power Peg was never supposed to run live in the first place.

There were also supposed to be external safeguards. Two years earlier, a software glitch caused the Dow Jones to lose about 9% of its value in the space of just thirty-six minutes, known as the Flash Crash. To prevent this from happening in the future, the SEC mandated electronic stock exchanges install “circuit breakers.” These would “trip” and halt all trading if the market changed more than 10% in five minutes. This is another example of the potential for systems thinking. Did it not cross anyone’s mind that if prices could spike, volumes could too? Nobody thought to also install circuit breakers for violent swings in trading volumes because that was never supposed to happen. So, when the NYSE trading volume suddenly doubled, there was nothing to stop Power Peg from continuing on. (Kind of like killing mosquitoes with DDT in Borneo.)

A software engineer at the NYSE noticed something was wrong and immediately tried to contact the CEO. Joyce, however, was undergoing knee surgery at the time. The engineer tried person after person before finally landing on the phone with Knight Capital’s CIO. All in all, it took the firm about twenty minutes from the time things went wrong until they attempted to remediate the issue.

The software developers immediately uninstalled the new RLP code on the seven servers. (This is referred to as a “rollback.”) However, they forgot that they had reused Power Peg’s on-off switch. The result: Power Peg was suddenly running on all eight servers—thousands of orders representing millions of stock worth billions of dollars. After forty-five minutes, they were finally able to shut their entire trading system off, but it was too late.

Who’s to blame?

Should we blame the developers for reusing Power Peg’s switch? For forgetting to copy the RLP code to the eighth server? For forgetting to turn the switch to Off before uninstalling the RLP code? Maybe it’s the SEC’s fault for not properly safeguarding the market as a whole. Maybe it’s the CEO’s fault for waiting until the last minute to go ahead with a massive software change.

Deming told managers that only 6% of problems were due to human error; 94% of problems were due to system error. And since the system is the responsibility of management, he meant that 94% of problems are caused by bad management. Let’s look at the Knight Capital software incident through the lens of Profound Knowledge.

(Note: These observations should be considered counterfactuals. Not the facts themselves, but my interpretation and surmising, given the second- and third-hand information available. This is strictly a thought exercise and should by no means be construed as a true case study.)

A Theory for Knowledge

In the SEC investigation, Knight Capital couldn’t demonstrate that it had performed adequate software testing. That is, according to the SEC, the developers thought the code would work, but they couldn’t prove they had tested their theory. It wasn’t clear that they could demonstrate to the SEC that they had any evidence to tell them that what they believed was, in fact, correct.

A good software product team should instill a software development culture that being wrong is a good thing; that’s how organizations learn. Without learning, an organization dies. A tool for learning is the PDSA loop, and it’s an essential part of testing software code these days. In most organizations, they Plan and Do. However, they miss the Study and Act portions of the cycle. If you never evaluate and refocus your efforts, how can you know that things are headed in the right direction? Like Steven Spear describing the Toyota environment as a community of scientists in The High-Velocity Edge, an organization that only Plans and Does can never really improve.

A Theory of Variation

It appears no one at Knight Capital truly understood variation. If they did, it certainly wasn’t reflected in their business practices. I’ve spoken to a number of IT professionals at high-frequency traders like Knight Capital. They all say that a glitch like Power Peg could happen. Few think it would have taken forty-five minutes to stop. Why? Because they watch for common- and special-cause variation. As a matter of fact, observing variation in high-scale IT operations is a common practice.

For example, I was once at a seminar where Amazon’s CTO, Werner Vogels, was asked about the company’s monitoring system. He said Amazon monitors many things, but the company cares about only one thing: the order rate. From years of data and records, Amazon’s software could tell common-cause variation apart from special-cause. If the order rate was relatively high on a Saturday afternoon in December but fell within their upper and lower control limits, then everything was fine. However, if the order rate suddenly displayed a pattern—like the example of the steadily rising home temperature readings from my iPhone app—even if it was within the control limits, Amazon’s systems would not only alert them but activate safeguards in place to prevent catastrophe.

At the very least, nearly all high-frequency traders have a “kill switch.” Where was Knight Capital’s? The fact that they attempted a rollback—a classic knee-jerk reaction—instead of simply stopping electronic trading leads me to believe they didn’t have a kill switch in the first place.

A Theory of Psychology

As I’ve already noted, the financial world isn’t exactly known for cultivating an empathetic, trusting, transparent, collaborative type of culture. Based on the post-incident analysis of the debacle, Knight Capital doesn’t seem to be any different. For example, a hallmark of cooperation and trust within product and development teams is the peer-review process. In the SEC’s opinion, the firm didn’t seem to have anything of the sort. Review methods might include informal walkthroughs, pair programming, and formal reviews. These types of collaborative code reviews foster trust and improve the overall quality of the product.

I once did a consulting engagement at a major high-frequency trader and was told a story about one code review process they had. Whenever changing the software on one of their trading applications, they would pair a software developer with an actual trader. Sitting side-by-side during one particular review, the developer noticed the trader didn’t use a button on the screen, opting instead to perform the function manually. He asked the trader why she didn’t use the button, as it would save her several keystrokes. She replied the button was hard to see, and she’d gotten used to her workaround anyway. While sitting there, the developer changed the code to move the button to a more convenient place and changed it to a more vivid color. The trader said, “Oh, that’s perfect! I’ll use this from now on.”

Did this type of thing ever happen at Knight Capital?

Another issue I’ll point out is that there probably wasn’t a culture of psychological safety. In a psychologically safe environment, a team encourages negative feedback and push back when someone has an issue and feels safe that they won’t be punished for speaking up. Every team member should have the assurance that they can pull the metaphorical Andon cord at any time without repercussion. Does it sound like the CEO’s rush-order directive for the RLP system enabled that kind of environment?

Systems Thinking

Deming said it is management’s responsibility to create a clear vision for a system. The CEO waffled on his decision to integrate with the RLP until the proverbial eleventh hour. That’s the opposite of a clear vision and clear communication. On the part of the NYSE, once they realized the SEC was taking its time approving their program, they could have moved the rollout date to weeks or even a couple of months later.

You might think the software developers realized how much of a challenge they’d been tasked with. Did management listen to their concerns? That is, did the culture exist for the developers to voice their misgivings to begin with? It would be easy for me to believe that a powerful Wall Street financial firm had a top-down, high-pressure, dog-eat-dog kind of culture. In these environments, people quickly learn not to challenge their managers. They keep their heads down and hope for the best.

One could argue the SEC should have never approved Knight Capital’s participation in the RLP, as the firm had failed its annual CEO certification with the SEC earlier that year. As such, Knight Capital’s risk-management controls and supervisory procedures were not documented. This and other SEC violations should have been red flags for everyone involved.

In an SEC investigation of the incident, the firm failed to provide:

  • an adequate written description of its risk-management controls as part of its books and records
  • technology governance controls and supervisory procedures sufficient to ensure an orderly installation of new code nor to prevent the activation of code no longer intended for use
  • controls and supervisory procedures reasonably designed to guide employees’ responses to significant technological and compliance incidents
  • a process in place to adequately review its business activity in connection with its market access to assure the overall effectiveness of its risk management controls and supervisory procedures
  • a second technician to review the code installation (a.k.a. a peer review)
  • a written procedure requiring such a review

A peer review is basic common sense. We have proofreaders because it’s hard to catch our own writing mistakes. How much more difficult is it to catch our own coding mistakes? Every regulatory compliance framework I’m aware of requires peer review, from processing credit cards (the Payment Card Industry Data Security Standard, or PCI DSS) to storing patients’ medical records (the Health Insurance Portability and Accountability Act of 1996, or HIPAA). Even the US military has to do peer reviews as mandated by the National Institute of Standards and Technology (NIST, Lola Deming’s erstwhile employer).

It appears there may have been a failure in the basic function of leadership. Setting aside the question of why the CEO would schedule knee surgery during the morning of such a significant event: Did Knight Capital have some emergency process in place? Why didn’t he forward his calls to someone in his office? Why did it take so much time for the NYSE to get someone from the firm on the phone in the first place? When the CIO finally learned of the problem, was there a contingency plan in place? It was general mayhem while the company tried to figure out how to fix the problem. When they did react, they uninstalled the software in what’s called a rollback. Rollbacks have to be done in a meticulous way or you wind up with situations like this. Did management consider simply halting trading altogether?

These types of cause-and-effect dominoes are exactly what Deming meant when he admonished managers to grasp systems thinking. You can’t look at problems in isolation; everything is connected to everything else.

We’ll probably never know what the real facts were behind Knight Capital’s operations and the SEC’s cease-and-desist order. However, looking at it from Dr. Deming’s System of Profound Knowledge perspective, these are certainly a number of questions he would have posed.

- About The Authors
Avatar photo

John Willis

John Willis has worked in the IT management industry for more than 35 years and is a prolific author, including "Deming's Journey to Profound Knowledge" and "The DevOps Handbook." He is researching DevOps, DevSecOps, IT risk, modern governance, and audit compliance. Previously he was an Evangelist at Docker Inc., VP of Solutions for Socketplane (sold to Docker) and Enstratius (sold to Dell), and VP of Training & Services at Opscode where he formalized the training, evangelism, and professional services functions at the firm. Willis also founded Gulf Breeze Software, an award winning IBM business partner, which specializes in deploying Tivoli technology for the enterprise. Willis has authored six IBM Redbooks for IBM on enterprise systems management and was the founder and chief architect at Chain Bridge Systems.

Follow John on Social Media

No comments found

Leave a Comment

Your email address will not be published.



Jump to Section

    More Like This

    Building an Automated Governance Architecture – Investments Unlimited Series: Chapter 5
    By IT Revolution , Helen Beal , Bill Bensing , Jason Cox , Michael Edenzon , Dr. Tapabrata "Topo" Pal , Caleb Queern , John Rzeszotarski , Andres Vega , John Willis

    Welcome to the fifth installment of IT Revolution’s series based on the book Investments…

    Addressing Burnout in Our DevOps Community Through Deming’s Lens
    By John Willis

    A Crucial Battle We Must Not Ignore Today, I'd like to pivot from our…

    The Ethical Tensions Between Bureaucracy and Digital
    By Summary by IT Revolution

    We live in an era of competing value systems—the lingering influence of impersonal, productivity-maximizing…

    The Path of Gracious Perseverance: Developing Leadership Courage for Business Impact 
    By Summary by IT Revolution

    We’ve all encountered situations at work where politics, opinions, and power dynamics seem to…