In the past few years, many organizations have adopted distributed systems, aiming to deliver highly available applications with the ability to quickly release new features and services to their customers. However, since distributed systems are a collection of smaller, loosely coupled services that emit a vast amount of telemetry data, it can be challenging to continuously monitor them.
Knowing first what to monitor in your distributed system can help simplify this process. This will give you a better understanding of the health of your application, allowing you to more quickly diagnose, fix, and even prevent problems, in turn ensuring the reliability of your services before any issue affects your end users.
Of course, getting a handle on the general health and reliability of your system is far different than taking action to improve your system’s reliability. In this post, I’ll introduce you to how Site Reliability Engineering can add value to your organization and how important the four golden signals are for deciding which metrics you should be monitoring on your distributed system.
How Organizations Ensure the Reliability of Their Services
There are several different concepts today, each complete with its own set of given practices that, when implemented, help you stay on top of your system’s health and reliability. One such concept that has been growing in popularity for the past few years is Site Reliability Engineering.
What Is Site Reliability Engineering?
Site Reliability Engineering originated at Google in 2003 with Benjamin Treynor, an experienced software engineer who was hired to head up a production team. Treynor decided to implement software engineering principles for managing and running his operations, eventually leading to what is now known as Google’s SRE team. These principles were later written down in a guide detailing how Google applies its SRE strategy with a methodological blueprint for other organizations to follow.
Through their collaboration with other engineers, product owners, and customers, site reliability engineers define targets and measures to implement service level indicators (SLIs) and service level objectives (SLOs). This allows them to easily know when to take action to ensure a system’s reliability.
Are you a tech blogger?
The Bigger Picture
With a holistic understanding of your operations, systems, and components, as well as the connections between them, SREs make sure they discover problems early on that could occur either in your systems or between other teams, reducing the cost of failure. By accepting that failures will occur and that 100% availability is unnecessary, site reliability engineers help you measure risk. This lets you balance availability and feature development, giving teams much-needed flexibility when delivering updates and improvements to a system without having to worry about setbacks or downtime; this elimination of the fear of failure also potentially leads to greater innovations.
One key aspect of SRE is the automation that reduces toil; that is, the manual, repetitive production tasks that are devoid of any long-term value. Toil tends to scale linearly as your service grows, so by eliminating it, SREs can free up their time to focus on other tasks like adding service features for scalability, reliability, and improving performance across all systems; this, in turn, also saves other teams from having to handle tedious tasks as well.
Another indispensable component of SRE is monitoring. A comprehensive and up-to-date view of a system’s behavior and health demands a continuous monitoring strategy so that SREs can improve availability, uncover and fix performance issues, and quickly respond to incidents. For each incident, SREs are responsible for writing a blameless postmortem, complete with detailed documentation of the incident, root-cause analysis, how the incident was resolved, and effective preventive actions to avoid recurrence.
However, modern distributed systems give off hundreds of metrics—it’s just not practical to constantly monitor all of them. Deciding what you should monitor is crucial for delivering a highly available and reliable service, and this is where the “golden signals” serve as the perfect place to start.
The Four Golden Signals of Monitoring
The four golden signals of monitoring were introduced by Google in the SRE guide I refer to above. These four metrics are latency, traffic, errors, and saturation and are the essential building blocks for implementing an effective monitoring strategy.
Latency is the time taken to serve a request, widely known as response time. It is frequently measured on the server side where you have the most control, but you should also measure client-side latency since it’s more relevant to your customers. Increased latency is a key indicator of degradation in an application. Once you define a threshold suitable for your application, you should monitor the latency of successful and failed requests separately, allowing you to quickly identify performance issues and achieve faster incident response.
Traffic is the number of requests flowing across your network. What you consider to be traffic will depend on the characteristics of your application; some examples of traffic include the number of HTTP requests to an API or web server or the number of connections to an application server. Monitoring traffic in your application can help you identify capacity problems due to improper system configurations and plan ahead for future demand.
Errors indicate the rate of requests that fail. Whether errors are explicit, such as failed HTTP requests, or based on manually defined logic, you need to monitor them. You also must define which errors are critical and which are less dangerous, helping you take rapid action to fix those that pose the most risk, as well as the ones that occur most frequently.
Saturation is the overall capacity of your services. It measures the utilization of your service, how full it is, and how much more capacity it has. To measure saturation, you have to choose the utilization metrics for components that cause the greatest constraint on your applications (e.g., CPU for CPU-intensive applications, memory for memory-intensive applications, and disk I/O for databases and streaming applications). Defining a healthy percentage of utilization is essential since most systems usually start to degrade before a metric reaches 100% utilization—you want to be able to adjust capacity before performance degrades. Increased latency is often an early indicator of saturation and can be tracked by measuring your 99th percentile response time over one minute.
RED and USE Method
Two supporting methods also worth mentioning are RED and USE. The RED method (Rate, Errors, and Duration) focuses on monitoring your services, leaving their infrastructure aside and giving you an external view of the services themselves—in other words, from the client’s point of view. The USE method (Utilization, Saturation, and Errors) focuses on the utilization of resources to quickly identify common bottlenecks; however, this method only uses request errors as an external indicator of problems and is thus unable to identify latency-based issues that can affect your systems as well.
The golden signals try to get the best from both of these methods, but in the end, they all have the common goal of streamlining your complex distributed system to improve incident response.
Choosing the Right Monitoring Tools
After defining the right metrics to monitor through the golden signals, you need to then leverage the proper monitoring tools that will best suit your needs.
Open-source tools are advantageous for companies with a limited tooling budget. They often offer complete customization, allowing you to integrate them into your distributed system; but customizing such tools requires dedicated time and specialized knowledge, plus you are responsible for guaranteeing their availability, security, and updates. A popular combination of open-source tools for monitoring is Prometheus and Grafana.
As to managed monitoring tools, they come with costs but offer a robustness that open-source tools simply do not have. You are no longer responsible for their availability, security, and updates, and you get professional support for integrating them into your distributed systems. Some of the managed tools you should consider are New Relic, Datadog, Thundra, and Epsagon.
Whichever monitoring tools you ultimately use, they typically come with dashboards for default metrics in your systems as well as the capability to define alerts and notifications yourself. Customizing your own dashboards allows you to implement metrics that better suit your application; also, you can set up detailed and well-structured alerts with proper policies and metric thresholds that reduce alert noise. Customization allows you to benefit the most from your monitoring tools—enabling you to quickly resolve issues or even avoid them. Testing your monitoring solution and alert configurations is a good practice to ensure everything is working as it should before implementing it on a production system.
The four golden signals are a good starting point when defining your service level objectives and deciding on a monitoring strategy to ensure your application’s reliability.