The following is an excerpt from a presentation by Simmons Lough, IT Specialist, United States Patent and Trademark Office (USPTO), titled “If We Can Do It, You Can Do It!: DevOps Transformation at the US Patent and Trademark Office.”
You can watch the video of the presentation, which was originally delivered at the 2018 DevOps Enterprise Summit in Las Vegas.
USPTO at a glance
I’m a tech lead on a system at USPTO called FPNG. USPTO stands for the ‘United States Patent and Trademark Office.’ We are the agency that grants patents and registers trademarks, and in doing this, we are meeting a mandate, Article I, Section 8, Clause 8 in the United States Constitution. When I think about the Constitution and our mandate and how we’re helping to drive the entire U.S. economy, and to some extent the world economy, we think this is a pretty big deal.
I want to give a general experience report of how over the past three years we’ve started applying some DevOps principles to the agency. I’ll sprinkle in some architecture, some little how we dealt with the product owner, the business, the executives, etc. One thing that is a little unique is that I’m right dab in the middle of an organization of 15,000 employees, and you don’t have to be an executive to make change happen.
FPNG at a glance
A little bit about my system that I’m responsible for, FPNG which stands for Fee Processing Next Generation. It’s the replacement of a legacy system at USPTO called RAM which had been running since the early ’80s. One thing to note, USPTO doesn’t take money from Congress or the American taxpayer in the classic sense. We charge a fee for the goods and services that we provide. All told, what rolls through FPNG is a little south of three and a half billion dollars per year.
The way the workflow works is like this, if you have an idea for an invention anywhere in the globe, and you want a U.S. patent or a trademark, you submit your application, and then this system calculates how much in fees you owe. Then, of course, that goes into the USPTO bank.
In a 24-hour time slice in data, you’d have a large bubble over New York/D.C. That’s where we see a lot of patent and trademark activity happen. Of course, you’d also have Texas. I’m assuming that’s around Austin. Then, of course, there’d be a big bubble over Silicon Valley.
The other thing you’d notice is that as patents and trademarks come through and people make payments 24/7. It moves as the sun moves around the globe. My point is that since this is a real government system, and deals with three and a half billion dollars, it’s got to be up 24/7. During our heavy times, it’s about a million dollars an hour in collection, so outages or downtime is a big deal to us.
But we have a problem
A few years ago, my daughter drew for me a big dinosaur, which may look familiar as to how software has historically been built and delivered in the federal government. Let me explain it to you.
In the belly of the beast, you have Dev and Test. You may have a backlog, you may have a scrum master, and you may meet every morning, and you maybe even have a retrospective. But then after two weeks, everyone looks at each other, and says, “All right. We’re done.”
But they’re not really done, there is still all this important work that has to happen. The way this usually works is you fill out a form in SharePoint. It gets routed to a different group, in this case, it’s the security group.
That gets put in a queue, and a week later, they then say, “Okay, we’ll schedule to run security scans next Tuesday at 2:00.” They run those scans. You get your feedback. Then there’s a negotiation of what you fix and what you don’t fix.
The same thing happens with the coding standards review. This is a third-party group where you fill out a form and ask for a code review. It’s a completely different area in the agency with a completely different set of contractors. They come back, give you some negotiation. Of course, all this time you think you’ve been done. This goes on, and in a lot of agencies, this could be 15, 20, 30 different groups you’re dealing with.
Finally, you come to something called the production readiness review meeting, or this ORR, (operation readiness review,) and you’re usually in a room. It’s all 20 or 30 of these people from these different groups. They go around and basically vote. You’re thumbs up, so on and so forth, but it’s only at that point can you actually put your software in production.
These are obviously anti-patterns for DevOps. Guess what? Because there’s so much bureaucracy, and the dinosaur takes so much effort to get through, software releases to production are few and far between.
The building blocks were already there
We did have some good news— there were some building blocks already there. My hunch is that this is the case in other federal agencies and maybe even other large commercial companies. We did have an Agile Dev management tool. There was a place for user stories and tasks. We did have a scrum master.
One of the better things we did was to have what we called a CICM platform. We had a shared repo. This was a huge accomplishment in the federal government just to have source control. We had Jenkins stood up, your typical CICM thing.
We also had an automated infrastructure. My sister organization at NIST, however, has these five rules of what a cloud is, and I don’t think we met any of those rules. That said, there were some good parts to it that. If you compiled software and pushed it to the artifact library, there would be a robot that could pick up that artifact and install it. There wasn’t human intervention.
While these were all good things, we certainly couldn’t go from commit to production in any kind of fast manner, largely because of that dinosaur bureaucracy tail was getting in our way. We wanted to turn that tail around, and instead of having 20 or 30 third-party groups tell me with my software that I’m good to go, we wanted the app team to be responsible and be able to make that decision. With the help largely of a robot running some automated tests, they could click a button, or we could click a button, and that software would go to production.
What did we do?
We did the oldest sales trick in enterprise software: we created a pilot, with a pretty small scope. We thought the pilot was important because as people started to hear what we planned to do, they were getting pretty upset. There were a lot of people, all across the agencies and large commercial companies, whose job it was to fill out that questionnaire, to go to that meeting, and make it turn green.
In fact, when we started first talking about this, we were all in a meeting with the director in charge of that production readiness release, explaining to her what we wanted to do. I’m still to this day not exactly what I said, but I think it was something like, “I don’t want to fill out CRQs any more.” She literally left the room and stormed out. It was a contentious situation.
We thought the pilot would at least get our foot in the door. I’ll share with you the pilot: “Deploy, within a 24-hour timeframe, FPNG-approved software fixes for defects found in production.” Let me highlight three of these things there.
- A 24-hour timeframe: We didn’t really care whether it took one hour, or six hours or 36 hours to go from commit to production, but we wanted to say 24 hours because we needed to send a signal that there was no way we could do that whole dinosaur tail in 24 hours. This wasn’t a consultant coming in, and dilating down for us, at 5% or 10% improvements. This had to be a rip-and-replace, so we marketed this a lot.
- FPNG: This is the scope perspective, we were only talking about FPNG, so one system in the entire enterprise.
- Defects: Also, for scope, we were only talking about defects. If it was a new feature, that would still have to go through the dinosaur process. These were just defects. In fact, the federal government has these SDLC books that they write, which are about 400 pages long, which detail how you’re allowed to deliver software to production. But there are these asterisks in most of them at the very end that says, “If you have an emergency, you don’t have to do any of this, and you can just put it into production.” Very true. What we were trying to do was exploit that scenario, was, “Yeah, it’s a defect, and it’s production. Our product owner wants us to fix it. We want to go through this emergency change process, but increase the rigor.” No one was losing anything. We were only gaining something.
ATTEND THE DEVOPS ENTERPRISE SUMMIT
Folks around the agency knew what we were trying to do, but at some point, we needed to go and pitch it to the executive, the CIO and CFO. The main reason was we needed a signed document from them. This is very classic in the government, there are memorandums of understandings (MOU,) policy documents, procedure documents, etc, but we needed this document for two reasons.
One, there were a lot of folks not swimming the same way we were. We needed to be able to show folks the document and say, “Look, I have the authority to not go through the classic process. I can do this a different way.”
Secondly, we go through half a dozen different types of audits every year, and a large part of what the auditors are doing is reviewing our documentation to see how it matches how we make changes in production.
No deployment outages
During this pilot, a few things came out. We saw that we’re collecting fees around the clock or around the globe, so we came up with this ‘no outage deployment pattern’.
We used blue/green.
The way that worked is, let’s say, we have two instances of FPNG running in our data center. One we call “green”; one we call “blue”. Green is version 2.0 and we want to upgrade to version 3.0 — we deploy that software to the passive side, where we’re able to kick the tires on it a little bit, and which we can do during the day.
Let’s say, America’s most famous inventor, Thomas Jefferson, is making a payment on the active side. He’s filling out his application. When we’re ready with version 3.0, we can just flip a switch at the load balancer, and traffic is routed to the blue side. Thomas Jefferson has no idea this is happening. We’ve rolled out the code to production during the day, and of course, if there’s an issue, we can simply flip back.
The product owner
With the product owner, we actually kept him in the dark. Whether you call him the business or the product owner, by these I mean the accountants, or the finance folks in Patent and Trademark Office.
This was on purpose because I’ve seen so many failed attempts where you say, “Hey, we’re going to do DevOps,” or “We’ve got continuous integration,” to the product owner and the whole thing doesn’t really make sense to them. I think we wanted to do more of a show versus tell.
For me, I was actually presenting and our product owner was in the audience. He came up to me right afterward. I mean, I felt terrible. He was like, “Simmons, what is this blue/green thing you’re working on?” I said, “Okay, now is the time for the meeting.”
So, I wore my best suit, and I sold it. I explained the whole conversation to him, and he was like, “Wait, wait, wait, Simmons. Are you telling me that I can tell you what to do, and then you just do it the next day, and there’s no outage?” I’m like, “Yeah, exactly.”
I concentrated on the continuous delivery piece and the fact that he has control. He tells us what to do, and we’ll do it. So that made a big impact for sure.
Also important to keep mind, for the legacy system that used to be around, RAM, he was used to, “Hey, I want this change.” It would go in some backlog or however they did it, and then he wouldn’t see it in production for another six months, and when you did that, it was a 12-hour outage. This was a big change for him.
The pilot lasted nine months
We collected a ton of data, maybe even too much data. I’ll run through some of it here.
- We did 47 software changes. During that time, we had zero deployment downtime.
- We had also amassed 7,000 automated tests— function tests, not unit tests.
- Likewise, historically we had a group that’s like a glorified knock-or-sock that was responsible for operations of all the systems. We took that in-house. They sat with me and my team with the developers. Because of that, the MTTR that we measured went way down. If there was an issue, we knew about the system. We could fix it.
- Lead time, of course, was reduced. We don’t have to go through the dinosaur tail anymore.
- Change rate failure was zero, believe it or not. Which might have been a bad thing, but because we were doing a pilot, and I was so paranoid that it had to be perfect, it caused us to go a little too slow and maybe take not as many risks.
We took all that data after that nine months and we went to the top floor and met back up with the CIO and the CFO, and said, “Hey, we want to expand the scope of the pilot. We don’t want to do defects any more. We want to do everything related to FPNG.”
Two interesting things happened during that meeting.
The first was, that the enterprise started to trust us. We weren’t these rogues in the government. I think the enterprise thought that we were bad boys, and we’re trying to cheat the system, and not go through the rigor, the same rigor that’s in that dinosaur tail. In fact, even our biggest supporter ended up being the director in charge of that release process who stormed out on our meeting. I think it was originally a misunderstanding where she thought I just didn’t care about the rigor, and I thought she was just married to the process. The truth was we were both passionate about rigor, I just wanted to do it in an automated way. When she saw that, she was good with it.
The second interesting thing was the CIO at the time decided that he wanted to expand the scope of the policy, so that it wasn’t just FPNG. We were going to write this policy in such a way that all 100-plus systems in USPTO could do it. If they got to a level bar, they could opt in not and have to do the dinosaur anymore.
Audit defense and cybersecurity
It’s the federal government, so there will be no surprise here that we’re dealing with a lot of compliance, audit, assessments, etc. from NIST, Federal Information Security Management Act (FISMA), the Inspector General, OMB — you name it. But, we actually think doing DevOps and having the automation in this policy strengthens our posture. Some of the terms in these procedures are a little weird with the way the policy is written, because of how the auditors are used to seeing them. We think it’s going to make the audit easier because it’s not going about interviewing people, “Can you find the email where you said that you ran the security scans,” etc. It’s just going to be the run logs from the robot.
Of course, this particular financial audit is kind of like the government version of a SOX audit. It very much concentrates on production changes, who’s approving them, and how the approval is done to these production changes. Of course, DevOps isn’t a shortcut or a waiver. In fact, I think it’s harder to some extent, but it’s better.
A little bit about emerging architecture
I would say about a year and a half ago we started noticing that some of our build times were taking longer. The test runs were taking longer, and we needed a way to shrink down that time. We started to barely dip our toes into microservices. Our definition of microservices or the key term we use is ‘independence.’ We ended up having an independent repo, independent build, independent test, independent deployable artifact, and independent database schema.
We tried this just with one microservice. We had a new feature coming out where a customer could request a refund. After about a month I think, the developers started saying, “Why don’t we do this with everything?” At this point, we’re pushing 25 different microservices.
ATTEND THE DEVOPS ENTERPRISE SUMMIT
Before we were doing DevOps, in 2015, we were lucky to get a deployment to production once every quarter.
When we started the DevOps pilot policy in ’16. The numbers started increasing. Toward the mid part of the year FY17, we did phase two of the policy where we could do bug fixes and features. At this same point, we also did microservices. Last quarter we had 33 production deployments. If you knock out weekends, on average we’re doing two or three deployments a week, about every other day, and still moving in a good pattern.
Challenges and actions taken
Going through a few different challenges:
- Those monolithic applications take more time to do everything. The action we took was to really chunk up the applications into smaller, independently deployable artifacts, aka microservices.
- Another thing that helped us out with microservices is we have a number of teams assigned to each business domain, or each microservice. Some teams wanted to move faster than others. In the past we had issues where Team A wanted to deploy something on Friday, but Team B wasn’t ready until the next Tuesday. We had experimented with feature toggles to help with that, but moving to the microservices pattern easily solved all that for us. It was a big win.
- Documentation is also no fun to do, but since there’s a lot of documentation in the government, the action we took was to focus on self-discovery tools. I’ll give you an example. We installed an APM tool on our software, and it all of a sudden magically built my entire architecture. I had probably spent thousands of hours in Visio diagramming what database and which servers. But just through the instrumentation, this tool just built the diagram for me, and that was then referenced in our security documentation.
- We were Agile in name only. This is a little bit from our dinosaur tail times. We had a scrum master, but we weren’t actually doing scrum or being Agile. As part of the action, we took here was to agree upon a definition of “done.” Part of our “done” was in production and customers are using it.
- I work in the CFO’s office, and within that office, we actually did a reorg or realignment. We were historically set up where it one group was in charge of new projects, one group in charge of operations, one group in charge of audits/cybersecurity. They would do that for all of the financial systems, the accounting system, the EDW system, the revenue system. But then we did a reorg around the actual products.
- Background jobs running on blue and green — This was not a fun one, just telling the truth here. I won’t go quite into the details of it, but at one point we had jobs running on both sides in production. You can imagine what these jobs were doing, sending emails to customers, and processing money, and the list goes on and on. We ended up fixing that and then utilized our feature toggle function so that we can move jobs back and forth between blue and green.
- Automated tests were slow and created a lot of false alarms. I’m not talking about unit tests, but actual like functional and performance tests. The action we took, and these are fancy words, but we created atomic idempotent tests. Let me explain that for you. This is a real high-level view of our architecture for making a payment. When a customer goes to make a payment, they first have to put something in a shopping cart. It’s retrieved from the shopping cart. They put in their credit card number, and then we call a Treasury system to then actually pull the money from the credit card. The way we write the tests are that the test is responsible for doing all of this. The test first creates its own data, so it will actually seed data into the shopping cart. The test will then turn on a fake or a mock Pay.gov. I’m not interested in testing the Department of Treasury systems. I want to be able to run my test when their system is down. I want to be able to run my test when the network’s down. That test turns on that mock, and executes, and makes that payment. It then verifies the response to make sure, “Hey, yeah. The credit card went through.” It then turns off the mock and then deletes the data from the database. Automated testing is what really is what helps continuous delivery.
Automated testing results
In just one of our microservices, we’ve had close to 1500 tests. They ran in seven minutes and 14 seconds. We then can take those same tests, and turn a knob on it, to run a performance test on it.
Our SLA is at the 95th percentile. It has to be under a second. You can see that that’s well under. All this from functional to performance testing running in under 30 minutes. This historically would have taken us two months. Then we run security scans, and our policy states we’ve got to have zero criticals and zero highs.
- Start small and scale: This is the idea of just one system, just defects, and then scaling per system over time.
- Executive buy-in: Pitch it to the executives. You have to find the right time to do it, but I do think you need their buy-in.
- ITIL is okay: There’s something called a standard change, and that’s what we exploit so that every change that we make to the system, we do store it in our ITSM or ITIL system, but it’s a pre-approved change. We’re not going to CAB, and waiting for approval. It’s just more of a notification system of record that we did it.
- A-Team on automated testing: That’s so critical for continuous delivery. We’ve got to try to make it cool again.
- You’ve got to have a written, documented policy: You need this right from the audit and be able to show folks.
- Don’t wait until you have the perfect cloud: This maybe this is a little controversial, but I think there’s a lot that you can do before you have a perfect cloud, whether it’s automated testing, whether it’s blue/green, or security scans. There’s a lot of good stuff. Don’t wait until all of your systems are running on AWS.
- Monitoring tools now are amazing: If you would have told me about these APM and log aggregation tools 10 years ago when I was a developer, I would have never believed you, but they’re totally life-changing.
- Concentrate on continuous delivery: The reason I say this is I think sometimes I hear folks across the federal government saying like, “Oh, we’re doing DevOps because we’re working on our culture or because my dev team is talking to the ops team.” I think really the tangible outcome is to be able to push stuff into production frequently.
- Finally, for the product owner, explain it in plain language. Show, don’t tell.