Three Phases of Monitoring on the Path to Observability

Three Phases of Monitoring on the Path to Observability

It’s now widely accepted that monitoring is only a subset of observability. Monitoring shows you when something is wrong with your IT infrastructure and applications, while observability helps you understand why, typically by analyzing logs, metrics, and traces. In today’s environment, a variety of data streams are needed to determine the “root cause” of performance issues, the holy grail of observability, including availability data, performance metrics, custom metrics, events, logs/traces, and incidents. The observability framework is built from these data sources, and it allows operations teams to explore this data with confidence.

Observability can also determine what prescriptive actions to take, with or without human intervention, to respond to or even prevent critical business disruption scenarios. Reaching advanced levels of observability requires an evolution of monitoring from reactive to proactive (or predictive) and finally prescriptive monitoring. Let’s discuss what this evolution includes.

It's not an easy thing

First, a look at the current state of federated IT operations reveals the challenges. Infrastructure and applications are scattered across staging, pre-production, and production environments, both on-premises and in the cloud, and IT operations teams are constantly engaged to ensure these environments are always available and meet business needs. Operations teams must deal with multiple tools, teams, and processes. There is often confusion about how many data flows are required to implement an observability platform and how to align business and IT operations teams within the enterprise to follow a framework that will improve operational optimization over time.

In order for monitoring efforts to mature beyond indicator dashboards and into this observable posture, it typically develops in three phases. Reactive, proactive (predictive), and prescriptive. Let’s look at what these are.

Phase 1: Reactive monitoring.

These are monitoring platforms, tools or frameworks that set performance baselines or norms and then detect if these thresholds are breached and raise the corresponding alerts. They help determine the required optimization configurations to prevent performance thresholds from being reached. Over time, as more hybrid infrastructure is called upon or deployed to support an increasing number of business services and an expanding enterprise scope, the pre-defined baselines may change. This can lead to poor performance becoming normalized, not triggering alerts, and causing the system to completely break down. Enterprises then look to proactive and predictive monitoring to alert them in advance of performance anomalies that may indicate an impending incident.

Phase 2: Proactive/predictive monitoring.

Although the two words sound different, predictive monitoring can be considered a subset of active monitoring. Active monitoring enables enterprises to look at signals from the environment that may or may not be the cause of a business service disruption. This enables enterprises to prepare remediation plans or standard operating procedures (SOPs) to overcome priority zero incidents. One of the common ways to implement active monitoring is to provide a unified user interface for "managers of managers" where operations teams can access all alerts from multiple monitoring domains to understand the "normal" behavior and "performance bottleneck" behavior of their systems. When a certain pattern of behavior matches an existing machine learning model, indicating a potential problem, the monitoring system triggers an alert.

Predictive monitoring uses dynamic thresholds for technologies that are newer to the market, without first-hand experience of how they should perform. These tools then learn the behavior of indicators over time and send alerts when they notice deviations from the standard, which could result in outages or performance degradation that end users would notice. Appropriate actions can be taken based on these alerts to prevent business-impacting incidents from occurring.

Phase 3: Normative monitoring.

This is the final stage of the observability framework where the monitoring system can learn from the events and remediation/automation packages in the environment and understand the following.

  • Which alerts are occurring most frequently and what remedial actions are being performed from the automation package for those alerts?
  • Whether some of the resources being triggered belong to the same data center, or the same issue is seen in multiple data centers, this can lead to understanding the wrong configuration baseline.
  • If an alert is seasonal, it can be ignored at a later stage without executing unnecessary automation.
  • What remedial actions are performed on new resources introduced as part of vertical or horizontal scaling.
  • The IT operations team needs appropriate algorithms to correlate and formulate these scenarios. This can be a combination of ITOM and ITSM systems feeding back to the IT operations analytics engine to build a prescriptive model.

Looking ahead

Monitoring is not observability, but a key part of it, starting with reactive monitoring that tells you when pre-defined performance thresholds are breached. As you bring more infrastructure and application services online, monitoring needs to move toward proactive and predictive models that analyze larger monitoring data sets and detect anomalies that could indicate potential problems before service levels and user experience are impacted.

The observability framework then needs to analyze a series of data points to determine the most likely cause of a performance issue or outage scenario within the first few minutes of detecting an anomaly, and then start working to remediate that performance issue before it reaches a war room/situation analysis call. The end result is a better user experience, an always-available system, and improved business operations.

<<:  What will the future world look like under the 5G technology revolution?

>>:  Why did Facebook insist on changing its name when it was clearly taboo? Two reasons for the change

Blog    

Recommend

A Deep Dive into Data Link Layer Devices

In computer networks, there are multiple layers t...

5G edge computing is here, it will make supercomputers ubiquitous

AT&T was the first to propose the concept of ...

Newbie Science: A Brief Discussion on the Changes in Home Router Security

Routers are the entrance to home networks. In the...

COVID-19 impacts industries, 5G and broadband will become a top priority

Biden is hoping to finalize an infrastructure bil...

Example: How to plan IP addresses for a large-scale monitoring network system?

For monitoring projects, many faults are caused b...

What role does a switch play in a network?

Switches play an important role in increasing the...

It's over! Something big has happened to TCP!

= [[335538]] This article is reprinted from the W...

Performance Agreement: API Rate Limit

Rate limiting is a key control mechanism used to ...