The following is an excerpt from a presentation by Zane Lackey, founder of Signal Sciences, titled “Lessons Learned Embracing DevOps & Security.”
You can watch the video of the presentation, which was originally delivered at the 2018 DevOps Enterprise Summit in London.
I’ll be sharing with you some of the key lessons that I learned around DevOps security, having gone through the shift really early on. And these are some of the core things that I wish I could have told myself on day one of that shift to save myself a bunch of pain.
I spent my career in the security side of the house— in security consulting and pen testing. If you know NCC group in the UK or ISECT partners in the US and all over, I started out there for a number of years. Then in 2010/2011, I was given a very incredible opportunity to go to Etsy and be their first head of security to build and run the security program from scratch, which was an incredible experience at the time.
Then after a number of years, I really saw…”Hey, the way that we’re creating software is changing, the way that we’re delivering software is changing, maybe the way that we need to secure and protect those web applications, APIs, and microservices, needs to change as well.” We learned a bunch of lessons about that, causing myself and my two co-founders to step out from Etsy to co-found Signal Sciences, where we turned those lessons learned into a product.
What I’ll be sharing today
This will really be about how it was to be at the forefront of the shift to DevOps and Cloud. At the time it was really only Netflix on the west coast and Etsy on the east coast of the US that were going through this, so there wasn’t much you could Google. I had to make a lot of mistakes along the way and learn about how security was going to change during this shift.
And so, I’m going spoil the ending right now, which is: security shifts from being a gatekeeper along the way, to focusing on enabling teams to be secure by default.
Now, if you paid me, I don’t think I could make a more cliché sounding sentence than that, but it doesn’t make it untrue. This is really the existential shift that is going on for security right now. It is shifting from being this blocker to actually thinking about how we make the organization able to move even quicker.
So, what is changed?
I think from a security perspective it really boils down to a couple things here.
- Changes happen multiple orders of magnitude faster than previously. Change now happens orders of magnitude faster than it used to, and in our old models of security there was an implicit assumption that changed happened very rarely, so we built our whole security programs and tooling around that. For me, this was made super apparent when I left my last day as a security consultant wrapping up a project at a big US healthcare company. They would make production deployments and changes once every 18 months. So, I left there on a Friday, started at Etsy on Monday morning, and they sat me down and said, “Right, so we deploy to production 20 times a day. Figure out security, go.”
- Decentralized Ownership of Deployment: The ownership of deployment and the journey that code went on to deployment for us to make changes to applications, it really changes. It used to be that the dev teams would write code, you’d throw that over the wall to QA, it would come back to development, you’d throw it over to security, it’d come back, it’d go to the CIS ops group, go to staging, come back, and eventually on to production. And that journey took 12-18 months to get to that point. Now today you see code being written, checked in, potentially deployed within minutes, maybe days, even weeks — it’s still multiple orders of magnitude faster.
From a security side that really changes two things, it changes the culture of how security has to interact with the organization, and how do we change the tools and the individual techniques there.
the real takeaway out of all this is that security can no longer be ‘outsourced to’
It can no longer be code thrown over the wall to security, and security trying to take a look at everything, finding all the bugs and shipping it back.
The existential shift in security right now, is that it’s going from that sort of model to focusing on how can it make other teams inside the business, (the development teams, the DevOps teams,) security self-sufficient. And ultimately, security is only successful if it can actually bake into the development and DevOps process moving forward.
What are the new things that security needs to focus on?
Well, it’s really visibility and feedback. Except these aren’t new concepts.
Security would love to think that it’s totally special and it needs to go invent entirely new things, but it really doesn’t. All of these things, whether it’s performance monitoring, data analytics, AV testing, etc. that helped make the whole DevOps movement really successful, these are all about the same core concepts.
It’s all about visibility and feedback in the complex systems and by getting that visibility and feedback we enable ourselves to actually move faster. With security, it’s really in the same spot where these same hard lessons are starting to shift to security.
I think that security doesn’t need to reinvent the wheel on this. Personally, I really like the comparison to the performance space, so like the APM tools, like Appdynamics, New Relic, and Datadog. One of the reasons these have been so successful is that they’ve allowed you to bring highly specialized capabilities into your general technology teams, and by owning those capabilities themselves you could move quickly as a result.
That’s the same lesson that is slowly trickling in and it’s where we really need to go with security which is, bring previously highly specialized skill sets into our core technology team so that they can own that and move faster as a result.
here is a story about these from the “bad old days”
This is how we used to have security visibility. This was an airline in the US in the late 90’s.
In the past, you’d have a marketing website which would say a bunch of great things about your company and your product etc, and everything was fine from the security side.
Then one day you’d come into work…
and suddenly your airline was on fire and the logo of the page was, “So we killed a few people, big deal.” This was a real defacement that happened, (probably one of the most hilarious ones.)
But, this is the way we had security/visibility or outage/visibility in the past. Which is to say it went from, “Hey, everything’s great.” to “Wait, why are all the phones ringing and customers are really angry that the service is down?”
We only had that very binary set of visibility. Either everything was great, or everything was on fire and totally down.
How can we actually improve?
I’ll give you an example of this which is, which of these things sort of scales?
And this is something that we had to learn from the operation side and apply to the security side.
Take, for example, how we have logs. Big, mountains and mountains of data about what’s going on with our systems. From the security side, that’s the starting point. We have some logs, let’s see, does that give us any visibility in our system? Well, the answer is this doesn’t scale at all.
If you’re trying to get visibility off of your logs, that’s the right first step, but you run into a bunch of challenges.
But what you start to get to is someone trying to look at the logs and alerting me from there may not scale, but how can I surface some visibility for our groups? (And actual visibly that’s consumable by the rest of the organization.) This is really something that we can scale because we can bring this visibility to development teams, to DevOp teams, to security teams, etc.
We can start to say, “Wait, why is there a giant spike in the attacks graph or the anomalies graph? What’s going on here? Let’s actually take a look into it.” It’s very hard to look at logs and say that.
A big one that we had to learn out of this is…
Surfacing security visibility for everyone, not just the security team. A mistake that we made early on was starting to invest in visibility and bringing it on just for the security team.
What we learned was that instead of facing this inward, you face this outward. Sometimes that’s actually in a very physical sense of having displays up on the wall in your engineering area and all that, other times its really focusing on publishing this out to the organization to really focus on, “This is data that is useful for everyone, not just by a security team.”
I’ll give you an actual practical example which was a really fascinating one for us to learn.
So, take like a standard HTTP 500 errors graph, just, “Hey there are errors happening in the application or the API or anything like that.” If you ask your different teams what this implies, it is really fascinating, you would get wildly different assumptions as to what this applies.
If you ask your particular development team, they’ll say “Oh, we just did a whole bunch of new engineers for boot camp probably one of the did a bad code deploy and we either rolled it back or we rolled it forward, and that’s what those issues were.”
If we asked out DevOps teams that they’re like, “Oh yeah it was probably this engineering team because they’ve shipped a bunch of bad code in the last three months and paged us every night for two weeks straight, I’m sure it was that.”
If you ask your security team what that is, they’re like, “Oh, hmm, that’s interesting. I wonder if that is somebody actually discovering a vulnerability and trying to figure out a working exploit payload before it actually succeeded against our systems.”
If you’re ever fortunate enough to ask your attacker or your security researchers, you’ll often hear, “Oh yeah, yeah that was me discovering a real vulnerability and figuring out an exploit payload before it actually succeeded.”
The key out of all of these is really they start to bring together context that any one of those groups can look at this data and make an informed decision off of.
That we can say, “Okay great, there are not just the errors going on, but let’s combine that with actual attacks that are happening against our given services.” So we can say, “Wait, why is the big error graphs spiking at the same time that the attack graph is spiking? That’s actually not what our assumption was. We should actually take a look at this” And we can page the right people and we can dig in it gives us much more actionable context here to actually see what’s going on.
Going on to the feedback loop side of things
The real thing about feedback from a security context is learning how to take the whole ‘game days’ and ‘operational exercises’ that we’ve learned from the DevOps side and applying that. Because the first time that we deal with a security incident we don’t want it to be because a real incident is happening. We want to have run this exercise as many times as possible ahead of time.
And so, bringing those concepts to security is tremendously useful for all different parts of the organization. Not just your security teams, but for your technology teams, for your legal teams, for your PR teams — involving as many folks in that as possible is actually tremendously useful.
But the way in which most folks are doing feedback loops are things called “Bug Bounties.” Bug Bounties are where you put out an SLA and you say, “Hey security researchers that we know are gonna be out there attacking stuff anyway, if you look at these predefined services of ours and you follow these rules and you act in good faith, and you report any issues that you discover to us, we will in good faith reward you in some way.” And so, that reward might be monetary, but it just be putting your name up on a hall of fame, it might be sending you a t-shirt, or just a thank you card. But it’s a way for what was previously a lose-lose situation for both sides to actually turn into win-win.
And so these have really been gaining a lot of steam over the last number of years. “You name it” of large-scale internet companies, all the way up through various government agencies, they’ve really been seeing a lot of traction.
Now pulling this back into the feedback side of things, in the past all we’ve had from security was pen testing. And the problem with pen testing, (and I say this as someone who’s been a pen tester,) is it’s a “point-in-time” activity. The dirty secret is we all do it once a year, usually for two weeks to satisfy different audit or compliance requirements or anything there.
With pen testers we have them come in, couple weeks, once a year and we faced a choice that was also lose-lose, which was that we could either ask for pen testers to try to cover a bunch of things and then just be super shallow, or to focus on one particular area where we think we have a lot of risks, and then they go super deep on that, but they ignore everything else. And so, either way, we’re really in a bad spot on that.
By having the combination of both pen tests and bug bounties, you can use your pen test to go super focused on one particular area, and you can use bounties as a much more real-time feedback loop. They each a play to each other’s strengths and each other’s weaknesses, and by combining them it’s not a replacement but it augments pen tests and so it really gives this real-time feedback. It starts to become the data source to give you a feedback look on the security side.
I want to start to close with an actual good news security story
And I think this is where we actually can go, it’s also why I think the shift to DevOps makes us more secure and not less secure.
At this time, we had started to invest a bunch in the visibility bit. We had really started to say, “Okay if we need to take those lessons of DevOps and what’s been successful there, and apply it to security. Let’s think about visibility, let’s think about feedback.” And we started investing in that.
Once we did, we had this really cool thing happen where at one point we were able to detect an attacker discovering a vulnerability and we were able to ship a fix for that before they even reported the issue to us.
They were sitting there using one of our services, they discovered an issue in it, they started working on an exploit payload and as they were confirming everything there— when, it suddenly stopped working out from under them.
We got this really cool email from someone saying “Hey, you know, I can’t imagine that is a coincidence that suddenly my vulnerability stopped working out from under me in this obscure part of the sight. I just wanted to let you know, I promise I was acting in good faith, here are the details, and oh, by the way, I was testing from my home IP so please don’t sue me.” They ended up doing a whole write up that they posted up to the Reddit.com/r/netsect thread. And it ended up being this amazing back and forth with them.
And the reason that I share this story is that this was a very crystallizing moment for me in recognizing that how the shift to DevOps, the shift to Cloud, this increase of velocity of our systems can actually make us more secure.
What I had to realize is, every development methodology that we’ve ever had is gonna have vulnerabilities, bugs are functionally infinite, we’re always going to have them. Systems that rely on saying, “let’s eliminate all the bugs” are not a reality.
If we recognize that bugs are always going to exist and they’re functionally infinite in that sense, the system that allows us to react the fastest is the one that actually makes us safest.
As we’re starting to embrace DevOps from our side, if we need to make an emergency deploy, that’s just another deploy. If we’re deploying once a week, or once a day, or once an hour and we need to push in a security change as part of that, it’s just another deploy. We never have to say the phrase, “out of band patch” ever again, which for those of us who have lived through it, is an absolute nightmare.
The thing is, it can only make us safer if we actually have something to react to which is why the shift to DevOps makes us safer because we can react quickly but only if we’ve started to think about, “how do we get visibility from a security perspective?” and then, “how do we actually test all of those systems with real feedback loops?”