Monitor infrastructure to prevent unexpected downtime

Monitor infrastructure to prevent unexpected downtime

[[258649]]

【51CTO.com Quick Translation】Infrastructure monitoring is an integral part of infrastructure management. It is the first line of defense for IT administrators to prevent unexpected downtime. Serious problems can cause a lot of downtime in the infrastructure, sometimes resulting in serious financial losses.

Monitoring systems collect time series data from your infrastructure so that it can be analyzed to predict upcoming problems with the infrastructure and its underlying components. This gives IT administrators or support staff time to prepare and apply solutions before problems occur.

A good monitoring system has the following functions:

1. Measuring infrastructure performance over the long term

2. Node-level analysis and alerting

3. Network-level analysis and alerting

4. Downtime analysis and alerting

5. Answer the Five Ws of Incident Management and Root Cause Analysis (RCA):

○What is the actual problem?

○When did it happen?

○Why does it happen?

○What system or component is down?

○What needs to be done to avoid it in the future?

Build a strong monitoring system

There are many tools available to build a viable and robust monitoring system. The only decision is which tool to use; the answer lies in what you want to achieve with monitoring and the various financial and business factors to consider.

While some monitoring tools are proprietary, many open source tools (unmanaged software or community-managed software) can perform as well or better than closed source tools.

This article will introduce open source tools and how to use them to build a powerful monitoring architecture.

Log collection and analysis

Logs can help a lot. Not only do logs help debug issues, they also provide a wealth of information to help predict upcoming problems. When you encounter a problem with a software component, the first thing you should analyze is the logs.

Both Fluentd and Logstash can be used to collect logs; the only reason I chose Fluentd over Logstash is because it is independent of Java processes; it is written in C + Ruby and is widely supported by container runtime environments such as Docker and orchestration tools such as Kubernetes.

Log analytics refers to analyzing the log data collected over time and generating real-time log metrics. Elasticsearch is a powerful tool for this purpose.

Finally, you need a tool to collect log metrics so that you can visually display log trends using easy-to-understand charts and graphs. Kibana is my favorite choice in this regard.

Figure 1. Log workflow

Because logs may contain sensitive information, there are a few security points to keep in mind:

•Always transmit logs over a secure connection.

•Logging/monitoring infrastructure should be implemented within a restricted subnet.

•Access to monitoring user interfaces (such as Kibana and Grafana) should be limited to stakeholders only.

Node-level metrics

Not everything is logged!

Yes, logs monitor software or processes, not every component in your infrastructure.

Operating system disks, externally mounted data disks, Elastic Block Store, CPU, I/O, network packets, inbound and outbound connections, physical memory, virtual memory, buffer space, and queues are some of the major components that rarely appear in logs unless they fail.

So, how do you collect this kind of data?

Prometheus is the answer. You simply install the exporter for the specific software on the VM nodes and configure Prometheus to collect time-based data from these unattended components. Grafana uses the data collected by Prometheus to visually display the current status of the node in real time.

If you’re looking for a simpler solution for collecting time series metrics, consider Etricbeat, Elastic.io’s internal open source tool that can be used with Kibana as a drop-in replacement for Prometheus and Grafana.

Alerts and Notifications

Without alerts and notifications, you can’t get the most out of monitoring. Unless stakeholders (regardless of where they are) are notified of issues, they can’t analyze and resolve the problem, prevent customers from being impacted, and avoid it in the future.

Prometheus uses its internal Alertmanager and Grafana to create predefined alert rules, which can send alerts based on the configured rules. Sensu and Nagios are other open source tools that provide alerting and monitoring services.

The only problem people have with open source alerting tools is that the configuration time and process can sometimes seem laborious, but once set up, these tools work better than proprietary tools.

However, the greatest advantage of open source tools is that we can control their behavior.

Monitoring workflow and architecture

A good monitoring architecture is the backbone of a strong and stable monitoring system. It might look like this diagram.

Figure 2. Devops monitoring architecture

Finally, you have to choose the tool based on your needs and infrastructure. Many enterprise organizations use the open source tools discussed in this article to monitor the infrastructure and ensure high uptime.

Infrastructure monitoring: Defense against surprise downtime

[Translated by 51CTO. Please indicate the original translator and source as 51CTO.com when reprinting on partner sites]

<<:  Picture | Someone finally explains 5G clearly...

>>:  Google and Qualcomm join forces to fight against Apple. Will the global 5G competition landscape be further changed?

Recommend

How to protect remote workers from cyber attacks?

[[400945]] During the coronavirus outbreak around...

Three misconceptions about 5G

In late 2019, IDC predicted that the number of 5G...

8 essential skills for network engineers in 2017

The current average job responsibilities of a net...

The ransomware incident is a microcosm of global cybersecurity

On May 12, more than 75,000 computer virus attack...

5G is not about mobile phones, but about the Internet of Things.

[[321085]] Recently, new infrastructure has conti...

New opportunities brought by 5G millimeter wave fixed wireless

The broadband industry’s new mission is to extend...

Juniper Networks' Shaowen Ma: The best SDN controller for cloud computing

[51CTO.com original article] The interview with M...

Problems that edge computing needs to solve urgently

At present, edge computing has been widely recogn...