While the main goal of monitoring applications has not changed, the way we approach monitoring has changed drastically because we’ve reinvented the way we build applications. We’ve moved from a monolithic architecture to self-contained distributed services that allow us to build more reliable systems. But this shift comes with a new set of problems, one of which is monitoring. Rather than monitoring one giant system, we now have multiple systems within our purview to monitor.
Having helped teams migrate from a monolith to a microservice architecture, I realize it takes little time before a team realizes the traditional way of monitoring their system no longer works when services are distributed.
In this post, you’ll learn about the concept of monitoring and why it’s a must for modern enterprises. Moving on, you’ll discover two types of monitoring, the golden signals for monitoring distributed systems, and factors that determine what you should monitor.
Why Monitoring?
Monitoring is about gathering, processing, aggregating, and displaying quantitative data about a system to understand its behavior and track its performance. Monitoring is an invaluable tool for building highly available, reliable systems and providing customer-oriented, responsive service that meets end-users’ demands and requirements.
Monitoring is essential to identify system failures before they lead to actual problems. Even when something goes wrong, or the system eventually starts behaving abnormally, monitoring provides useful information that supports dynamic documentation, performance evaluation, testing, and debugging of distributed systems. Despite various software techniques used in ensuring software quality and preventing defective code from hitting production, bugs and performance issues still creep into production environments. It’s through monitoring that modern enterprises can proactively inspect and react to issues.
We’re in an age where understanding the customer is crucial to beating the competition since it ensures you are building the right product for your end users. Monitoring is not only about detecting issues but also customer understanding. Monitoring business-specific metrics can help provide operations and development teams with all the information needed to understand end-user needs and deliver a product they love to use. The benefits of building the right product include an increase in product adoption, increase in brand loyalty and revenue, and better customer satisfaction.
Typically, monitoring comes in two forms, which I will discuss here below.
Are you a tech blogger?
Black Box Monitoring
Black box monitoring is a symptom-oriented way of observing and identifying system problems from an end user’s perspective. It lets you monitor an application as though you’re a user: you don’t know how the application works, you lack visibility, and you don’t have control over the application.
Black box monitoring can help you validate whether or not all the services and nodes are working correctly. But, you will lack the understanding of how the system may have gotten to its current state. The only information available is what you can deduce by examining the system’s externally visible behaviors.
Black box monitoring tools rarely offer the meta-information needed to understand why certain things happened in a system. A typical example of black box monitoring would be a disk space check that reactively alerts you whenever your system’s disk utilization exceeds a certain threshold.
White Box Monitoring
White box monitoring, also known as transparent testing, monitors a system whose application structure, code internals, and design are known. In white box monitoring, you are able to understand what’s happening inside an application. It provides all the information (via logs, metrics, and traces) you need to answer questions that might arise about your system.
White box monitoring is suitable when you have control and visibility over an application: you understand how it works and know what to expect. White box monitoring provides product and operational insights that help you build the right thing and build it right.
Unlike the black box monitoring approach, white box monitoring allows you to understand why certain things happened in a system. In the disk space check example, a white box monitoring solution would let you know the rate at which the disk is filling up and the specific service(s) that are causing the increase.
“Which of the monitoring techniques should I adopt?” you might ask.
You’ll get more benefits by combining both options. After all, it won’t help to identify system problems using black box monitoring when you can’t understand why the issues occured. Black box monitoring allows you to figure out what’s broken, while white box monitoring helps you figure out why it’s broken.
By implementing both white box and black box monitoring, operations and development teams can synergize their plans appropriately for capacity enhancements, build resilient systems, and solve system issues that may arise. For instance, Google combines modest but critical black box monitoring with heavy white box monitoring.
In the next section, you’ll discover the essential metrics organizations need to monitor to keep their systems reliable and fast as they scale.
Metrics for Monitoring Distributed Systems
Whenever a team plans to monitor a distributed system, the first step is to select the best set of metrics to track. But because of the inherently complex nature of distributed systems and the tons of metrics available, choosing the right metric is always tricky.
You can get started with metrics like latency, error, traffic, and saturation. These four metrics (a.k.a. the golden signals) act as the basic building blocks for an effective monitoring strategy.
Latency
Latency tracks the time it takes an application to successfully process or service a request. It’s a metric that allows you to detect performance degradation problems in an application. When measuring latency, you need to define a threshold for all latency associated with a successful request then monitor the results against the latency of failed requests. That way, you can quickly identify which services aren’t performing well, detect incidents faster, and respond to incidents on time.
Error
Error tracks the rate of failed requests in a system. There are several types of errors, but the most common are explicit error (e.g., HTTP 500), implicit error (an HTTP 200 success response that doesn’t provide the right content), and policy violation (occurs whenever a request takes longer than a specified timeout period). Because no system is 100% error-free, you need to always track errors to identify when something is wrong. When tracking errors, it’s essential to report the server errors and client errors separately. That way, engineers can properly understand different error types and get to the root of any problem faster.
Saturation
Saturation tracks the load on a server or network resources. Saturation measures how full your service is, and it’s usually considered an early warning indicator of system slowdowns and failures. System metrics like CPU, disk space, and memory utilization are typical indicators for determining the saturation of production systems. When measuring saturation, endeavor to choose metrics that constrain the performance of a system. For instance, you can use CPU load for processor-intensive applications and memory for memory-intensive applications. You need to set a utilization benchmark, as every service has a limit after which performance degrades.
Traffic
Traffic tracks the number of requests flowing across an application per time unit. Traffic can have several values, depending on the application you’re tracking. For instance, you can use HTTP requests per second, transactions per second, or even bandwidth consumption to measure the traffic for a web application. One way to monitor traffic in a distributed system is by viewing the number of network conversations. By monitoring the traffic in an application, you can observe how the app behaves during a surge and plan for a spike in demand.
The above four metrics are referred to as “golden” because they allow you to track things that directly affect the work-producing or end-user parts of production systems. The golden signals also serve as a foundation for actionable alerting and monitoring for IT teams and DevOps.
Now that you understand the various metrics to monitor, how do you select the right metric?
Choosing the Right Metrics to Monitor
When choosing what to monitor, you need to cut through the clutter and select only what matters the most to your organization. You should limit the scope of what your organization monitors based on your system’s objectives, budget, infrastructure, and human resources. It’s always advisable to select what you can conveniently implement and manage. It can be tempting to track all the metrics you can think of, but tracking everything imaginable leads to a noisy monitoring and alerting system that no one pays attention to. You have to understand and clearly see the need to monitor a metric. If you can’t ascertain its purpose and why it needs to be monitored, you probably do not need it—at least not for the time being. Avoid getting caught in “we might need it in the future syndrome.”
Going Beyond Monitoring
As your organization’s system grows in complexity and usability, you’ll have more components to monitor, more failures to identify and resolve, more data and logs to sift through, and more metrics to track. It’s usually very easy to get lost trying to grasp what’s going on in a system, and traditional methods no longer provide effective monitoring because they’re suitable for identifying mostly predictable problems in a giant system.
Plus, as systems have become distributed, so have the teams responsible for building them. The year 2020 saw many companies adopt a remote-work culture. This means that your traditional approach of hanging monitoring screens in offices no longer cuts it, and you’ve had to re-invent the way you monitor and alert to proactively react to issues.
Latency, errors, traffic, and saturation are a good place to start with monitoring distributed systems. You also need a modern approach that allows you to stay one step ahead in terms of understanding your system. Examples of such techniques include game days, chaos engineering, and observability.
By leveraging these modern tools and techniques, operations and development teams can take their monitoring to the next level, build more resilient systems, and tactically identify and resolve both known and unknown system issues.