My Experience as an Enterprise DevOps Engineer; and as a Startup SRE

By John Fahl, IOD Expert
I’ve seen tremendous changes in the industry during my two decades of experience. And although these changes have become faster and more intense with the merging of development and operations, I’m no longer new to the idea. I’ve worked DevOps in two large enterprises and multiple startups, and both enterprises wanted to go all in with what they considered “DevOps.” One even said, “We want to be like Netflix.” They failed. There is no way to make this transition work without a heavy paradigm shift in culture, process, technology, and talent. Not one or two of those things, but all of them.
DevOps is super hot now, just as it’s been for several years. Talk to any tech recruiter—they are always on the lookout for good DevOps folks. But, DevOps is widely misunderstood by organizations and engineers alike. Especially considering that although DevOps is a methodology, like Agile, people use the term to describe the job of an engineer who just “automates things.” Rather, there are defined levels of DevOps engineers, and recruiters and hiring managers use those roles to qualify and classify positions as they build their DevOps teams.
Enterprises are struggling to make this transition to DevOps, but they are starting to focus on the coveted abilities of the site reliability engineer, a more focused expression of what DevOps means as an actual position. Site Reliability Engineering (SRE) was created by Google to handle its growing cloud-based and containerized operations, and has became a widely adopted, sought-after, and defined skill set. To better understand this distinction, let’s dive into a few of my “DevOps” adoption and execution stories.

DevOps in the Enterprise: Retrofitting Junk Into Automation

My experience with enterprises is that they quickly identify the need to get DevOps talent. They try to recruit real DevOps engineers and convert internal folks to be “DevOps” engineers. But these are useless efforts when the engineers are put in charge of retrofitting automation into existing aging stacks—which is woefully wasteful and a big mess to manage. Worse yet, the leaders of these efforts (in every case I’ve seen) haven’t experienced a transition. So rather than bringing in experienced talent to lead the transformation for real, they rely on existing leadership to make it happen. Big mistake.
Ever try to create Chef cookbooks to retrofit automation of an Oracle RAC cluster? Or create automation so that those Windows 2003 servers can be kept in configuration management? Sounds horrible, right? I’ve seen it, and it really is.
I worked for an enterprise that wanted to move existing applications, such as Microsoft BizTalk, IBM Enterprise Service Bus, and SQL services, into AWS. They tried to rewrite and customize an existing (not cloud-friendly) commercial application and put it into the cloud. They sprinkled in a few native cloud services, and to make matters worse, wanted to bring in DevOps experts to “automate” all of this.
Rather than scraping this stack and building new stateless microservices or containers, or at least modern platforms to replace the aging stack, we spent the next two years building and assisting the automation efforts of this amalgamation. We, the DevOps team, wrote tests and configuration Chef cookbooks, but none of the other application teams did. There was no CI/CD for their applications, and the only real tests that occured at a project level were full end-to-end. That meant that every solution was required to be fully operational to perform a test—the complete opposite of test-driven development.
[iod-callout]
Despite our team making countless efforts to change our focus from automating the workload to creating a service or a self-service platform, the executive team and project drivers wanted no part in the change. They wanted obedience. Thus, we were just automation engineers, and weren’t creating any real value.
During this project, whenever there was an outage, it followed the same pattern: scramble to fix, a root-cause analysis, and a discussion of what the DevOps team was going to do to prevent the next one. There was never a reflection at the project-level. No one asked, “Are we designing this the right way?” Instead, they brought in AWS to consult. AWS analyzed the project and told the executives they were doing it wrong. AWS told them to listen to our team.
They still didn’t listen.
Instead, three years and $150 million later, the project was scrapped. All executives and architects were fired or “found other opportunities,” and our previous efforts were jettisoned into the proverbial bit bucket—right where they belong.
This story is more common than you think.

“Let’s Be like Netflix,” They Said

Back to the organization that wanted to be like Netflix. They converted some of their talent into “DevOps” engineers, who could sort of script, but didn’t understand automation, CI/CD, TDD, cloud-native architecture, 12-step application design, or any of those other constructs. This company wanted these “DevOps” engineers assisting us in building scripts to move and manage existing workloads to the cloud.
On one hand, the company wanted to move already expired Windows 2003 and RHEL 5 servers to the cloud. On the other, they wanted to dive into Chef, containers, and AWS services. There was no focus and no clear transformation in mind. The idea was: if we build the cool technology, the DevOps transformation will just happen.
Of course, I pushed back and advised them to leave the old workloads on premises, or to move them and plan for their cloud-native replacements. I advised them to start rethinking solutions like Oracle RAC and to start adopting databases like Amazon Aurora MySQL. They needed to get out of the crazy patch cycles and start building ephemeral workloads on gold images.
In the end, that wasn’t what the business wanted. They wanted their old workloads moved; we moved them. They wanted the old workloads retro-automated. They wanted to do it their way.
I wanted no part in that. I think most DevOps engineers would make the same decision.

SRE in the Cloud-Native Startup

In addition to enterprises, I’ve worked in several startups. A few have referred to their “ops” folks as DevOps engineers, but the positions were always more similar to SRE.
The scope of SRE can vary shop to shop, but the mission is the same—delivering scalable, highly available services in a repeatable way.
What makes this extremely different from the enterprise-style adoption of a “DevOps engineer” is that SRE has clear goals and ownership of the role:

  • Build highly available, scalable, redundant, etc. services.
  • Build as stateless as possible and move data to very controllable and auditable points (databases, caches, logging stacks).
  • Use cloud services (when the case is right) because they reduce management, risk, cost to automate, etc.
  • Be part of all sections of the business (i.e., a stakeholder in the solution they are going to automate and maintain).
  • Be responsible for the uptime of the service/platform.
  • Be responsible for the health and maintenance of delivery pipelines.

This list is not exhaustive, but there is a clear difference between the enterprise’s “just automating things” approach and SRE.


Read more from this expert:

10 Things I Learned Shipping an Ancient Data Center to AWS


I worked at a startup as an SRE manager. Everything was deployed as a microservice. Many were containers, while some used services such as AWS Elastic Beanstalk. All deployments went through CI/CD pipelines and were deployed to blue/green environments. And all workloads were brought into consistent management through Puppet.
We invested time in both unit and integration testing, which was written for each microservice and on the infrastructure code. We also wrote API tests. And our monitoring was valuable. In other words, we didn’t “monitor everything” and start ignoring the thousands of worthless AWS CloudWatch alarms we would have received each day. Our alerting wasn’t perfect, but we tried to reduce the noise as much as possible through additional automation and tuning alerts to actionable notifications.
The SRE team was evaluated on successful deployments, site uptime, and the level of test coverage. Also, our evaluations were realistic; people make mistakes, and sometimes, we took risks. Here’s an example:
During the famous Amazon S3 outage in 2017, 20% of our customer base was down because their web apps were backed by Amazon S3. Some on our team (including our interim VP of Engineering at the time) wanted us to rewrite our Amazon S3 calls to use MongoDB instead. But I’d been burnt by creating disparate sets of data and then trying to reconcile later. After some convincing, I won the argument, and we took the outage hit. Amazon S3 came back online, and everything was fine. There was an RCA, but we did it as a company, and there was no messy data cleanup later.
After that, we started to evaluate Kubernetes for our environment. We didn’t go that route after initial testing because it wasn’t our silver bullet. There were also efforts to evaluate moving into a different cloud. The SRE team was given a voice in driving business value and testing new solutions to problems, rather than sitting at the bottom of a decision hierarchy and just “automating” what we were told to. The mentality was totally different, and our skills brought value to the business.
Unfortunately, this company went under. It’s a long story, but AWS didn’t pull the plug, and the site remained online for five months with no one managing it. It might still be online. That’s automation done right.

Summary

Sure, this may read like a triumph and agony comparison. There are startups that get it all wrong and enterprises that get it right, but I’m willing to bet that ratio is not common. SRE teams, and companies that embrace what they bring to the table, will get it right more often. These companies realize that DevOps is a methodology, and that straight “cloud automaters” are not going to modernize their workloads or maximize their adoption through brute-force automation. There is a reason Google is so successful, and a reason that SRE is a clearly-defined discipline: SRE teams get sh*t done.

Related posts