It’s now widely accepted that monitoring is only a subset of observability. Monitoring shows you when something is wrong with your IT infrastructure and applications, while observability helps you understand why, typically by analyzing logs, metrics, and traces. In today’s environment, a variety of data streams are needed to determine the “root cause” of performance issues, the holy grail of observability, including availability data, performance metrics, custom metrics, events, logs/traces, and incidents. The observability framework is built from these data sources, and it allows operations teams to explore this data with confidence. Observability can also determine what prescriptive actions to take, with or without human intervention, to respond to or even prevent critical business disruption scenarios. Reaching advanced levels of observability requires an evolution of monitoring from reactive to proactive (or predictive) and finally prescriptive monitoring. Let’s discuss what this evolution includes. It's not an easy thingFirst, a look at the current state of federated IT operations reveals the challenges. Infrastructure and applications are scattered across staging, pre-production, and production environments, both on-premises and in the cloud, and IT operations teams are constantly engaged to ensure these environments are always available and meet business needs. Operations teams must deal with multiple tools, teams, and processes. There is often confusion about how many data flows are required to implement an observability platform and how to align business and IT operations teams within the enterprise to follow a framework that will improve operational optimization over time. In order for monitoring efforts to mature beyond indicator dashboards and into this observable posture, it typically develops in three phases. Reactive, proactive (predictive), and prescriptive. Let’s look at what these are. Phase 1: Reactive monitoring.These are monitoring platforms, tools or frameworks that set performance baselines or norms and then detect if these thresholds are breached and raise the corresponding alerts. They help determine the required optimization configurations to prevent performance thresholds from being reached. Over time, as more hybrid infrastructure is called upon or deployed to support an increasing number of business services and an expanding enterprise scope, the pre-defined baselines may change. This can lead to poor performance becoming normalized, not triggering alerts, and causing the system to completely break down. Enterprises then look to proactive and predictive monitoring to alert them in advance of performance anomalies that may indicate an impending incident. Phase 2: Proactive/predictive monitoring.Although the two words sound different, predictive monitoring can be considered a subset of active monitoring. Active monitoring enables enterprises to look at signals from the environment that may or may not be the cause of a business service disruption. This enables enterprises to prepare remediation plans or standard operating procedures (SOPs) to overcome priority zero incidents. One of the common ways to implement active monitoring is to provide a unified user interface for "managers of managers" where operations teams can access all alerts from multiple monitoring domains to understand the "normal" behavior and "performance bottleneck" behavior of their systems. When a certain pattern of behavior matches an existing machine learning model, indicating a potential problem, the monitoring system triggers an alert. Predictive monitoring uses dynamic thresholds for technologies that are newer to the market, without first-hand experience of how they should perform. These tools then learn the behavior of indicators over time and send alerts when they notice deviations from the standard, which could result in outages or performance degradation that end users would notice. Appropriate actions can be taken based on these alerts to prevent business-impacting incidents from occurring. Phase 3: Normative monitoring.This is the final stage of the observability framework where the monitoring system can learn from the events and remediation/automation packages in the environment and understand the following.
Looking aheadMonitoring is not observability, but a key part of it, starting with reactive monitoring that tells you when pre-defined performance thresholds are breached. As you bring more infrastructure and application services online, monitoring needs to move toward proactive and predictive models that analyze larger monitoring data sets and detect anomalies that could indicate potential problems before service levels and user experience are impacted. The observability framework then needs to analyze a series of data points to determine the most likely cause of a performance issue or outage scenario within the first few minutes of detecting an anomaly, and then start working to remediate that performance issue before it reaches a war room/situation analysis call. The end result is a better user experience, an always-available system, and improved business operations. |
<<: What will the future world look like under the 5G technology revolution?
In computer networks, there are multiple layers t...
AT&T was the first to propose the concept of ...
Routers are the entrance to home networks. In the...
Biden is hoping to finalize an infrastructure bil...
For monitoring projects, many faults are caused b...
With the popularity of smart terminals, people ha...
At the 2020 China Radio Conference which opened y...
Switches play an important role in increasing the...
= [[335538]] This article is reprinted from the W...
It has been four years since China issued 5G comm...
HostKvm is a Hong Kong VPS provider founded in 20...
Rate limiting is a key control mechanism used to ...
2G will be completely withdrawn from the network ...
The State Council recently issued the "14th ...
Many friends asked me, what is the capacity of a ...