By Piotr Gaczkowski, IOD Expert
What comes to mind when you think of great design? Is it craftsmanship? A certain elegance in the design itself? Do you admire the designer or the makers?
Whatever your answer, one thing is certain: you need to encounter the design to appreciate it.
Most of the great creations we admire today were designed for maintenance. Consider the European cathedrals built in the Middle Ages, for example. They didn’t crumble after a few hundred years. Rather, they are still standing majestically, welcoming visitors.
But, if their architects focused only on developing new ideas, chances are, we wouldn’t be able to visit them today. They’d be long forgotten, buried in the dust.
What does this have to do with software architecture? Quite a lot, really!
What Do You Optimize For?
Let’s stop for a moment to answer the question in this section’s heading: When developing software, what do you optimize for? Is it the development cost? The infrastructure cost? Or is it the maintenance cost?
In my experience, many companies tend to opt for the first two choices. It’s rare that the maintenance gets prioritized. What’s the problem with this approach? Well, even if you manage to ship the product on schedule and within budget, chances are that during its lifetime, it will incur significant costs.
Specifications change; compliance issues arise; and there are bugs, of course—some in your applications, some in the third-party components you use. Outages and crashes will make your life harder as well. Sound drastic? Well, you can prepare for all of these catastrophes! This article will give you some tips.
Nobody Likes Maintenance
Kurt Vonnegut wrote, “Another flaw in the human character is that everybody wants to build and nobody wants to do maintenance.” This is especially true in software development. Making software work—i.e., preparing the architecture and the implementation—is seen as a glorious job. Yet everything that comes after—QA, deployment, maintenance, bug fixing, etc.—doesn’t share the glory. Sometimes these tasks are even seen as “dirty.”
But the less we focus on maintenance, the more difficult and costly changes to the production system will be. And those changes will almost certainly be necessary. There is nothing more permanent in the world than temporary solutions.
I’ve taken part in struggling to maintain a service that was supposed to be up for no more than two months. It was well over a year since the original deployment, and we lacked documentation, working deployment scripts, and some of the external services whose trial periods had expired. Nobody cared about them, knowing that the project would bite the dust in a few weeks. Only it didn’t.
An extreme case of waste? Having to reverse engineer a project your team created because nobody remembers how it was supposed to work. Please, don’t let that happen. It may actually be you doing the maintenance down the line. Make it easier for your future self. How? I’ll show you in a moment.
A Few Weird Tricks to Make Maintenance Easier
Unfortunately, there is no one perfect trick to make maintenance easier. The best you can do is follow some best practices that will reduce future costs.
Perhaps the main thing software engineers struggle with is debugging. To find why something isn’t working as expected, you may have to test multiple hypotheses and carefully watch the flow of the algorithms.
It’s easy when you have a debugger connected to your development environment, and when you control the environment yourself. But in a production system serving thousands of clients each minute, stopping the execution just to watch it is not an option.
[iod-callout]
Observability
One popular way to make sure things run as planned is observability. If you haven’t heard the term, you may want to read “So What is Observability Anyway” by fellow IOD Expert John Fahl. In short, it’s monitoring, logging, tracing, analyzing, and alerting blended together.
Still, having each one of these is not enough; it’s important that they work together. This way, at any point in time, you can check whether things are going as planned or if there’s a fire starting that you still have time to extinguish.
What’s so great about observability? If you have it, you’re the first to know about potential problems—not the client. Usually, you can fix the issues before they become destructive. Or before anyone notices.
Back to our cathedral analogy: you can fix the leaking roof before the paying tourists notice.
Automated Deployments
Once you figure out what needs to be done, you have to ship as well. Manual deployment processes are risky and error prone. The more automation your release process features, the easier it will be to push the necessary changes into production systems or to roll them back in case things go the wrong way. This is where Continuous Integration/Continuous Deployment (CI/CD) comes into play. A good deployment pipeline will save you time during every update, and a very good one will do so without causing downtime.
In our cathedral, this means fixing the leaking roof while the tourists are visiting—without interruptions!
Chaos Engineering
This goes further than the previous tips. While observability focuses on early discovery, and CI/CD focuses on uninterrupted deployment, chaos engineering focuses on problem prevention.
Unlike our previous countermeasures, there is no simple analogy to explain this in the world of architecture. Rather, it’s a bit like segmented military bases. An explosion in one segment can only do harm to the contents of that particular segment. It doesn’t impact any other segments. People can carry on their tasks in the segments that haven’t been hit.
Even better, chaos engineering openly courts the dangers and plays with them. You might have heard of Netflix’s Simian Army. Netflix believes that when its infrastructure is constantly under various attacks, the design gets stronger each time an incident occurs.
Chaos engineering not only requires various levels of redundancy, but also clever tests to make sure responses to possible worst-case scenarios are actually practiced every day. The result? As Netflix performs this self-destruction regularly, no external threats have been able to harm it so far.
Try What Works for You
In this article, I’ve shown you why it’s important to focus on maintenance from day one. I’ve also proposed some solutions that can help you help yourself in achieving this. But as I have mentioned, there is no simple trick that works every time.
Each service is different, with different requirements and use cases. Some can handle downtime, while for others, this means thousands of lost revenue dollars. When discussing the scope of work with your client (external or internal), make sure to plan appropriate maintenance strategies. After all, developing a proper self-healing cluster may be a bit too much if your application serves 100 requests each month.
Unless you do it for fun (and to learn), of course.