Monitor infrastructure to prevent unexpected downtime

[[258649]]

【51CTO.com Quick Translation】Infrastructure monitoring is an integral part of infrastructure management. It is the first line of defense for IT administrators to prevent unexpected downtime. Serious problems can cause a lot of downtime in the infrastructure, sometimes resulting in serious financial losses.

Monitoring systems collect time series data from your infrastructure so that it can be analyzed to predict upcoming problems with the infrastructure and its underlying components. This gives IT administrators or support staff time to prepare and apply solutions before problems occur.

A good monitoring system has the following functions:

1. Measuring infrastructure performance over the long term

2. Node-level analysis and alerting

3. Network-level analysis and alerting

4. Downtime analysis and alerting

5. Answer the Five Ws of Incident Management and Root Cause Analysis (RCA):

○What is the actual problem?

○When did it happen?

○Why does it happen?

○What system or component is down?

○What needs to be done to avoid it in the future?

Build a strong monitoring system

There are many tools available to build a viable and robust monitoring system. The only decision is which tool to use; the answer lies in what you want to achieve with monitoring and the various financial and business factors to consider.

While some monitoring tools are proprietary, many open source tools (unmanaged software or community-managed software) can perform as well or better than closed source tools.

This article will introduce open source tools and how to use them to build a powerful monitoring architecture.

Log collection and analysis

Logs can help a lot. Not only do logs help debug issues, they also provide a wealth of information to help predict upcoming problems. When you encounter a problem with a software component, the first thing you should analyze is the logs.

Both Fluentd and Logstash can be used to collect logs; the only reason I chose Fluentd over Logstash is because it is independent of Java processes; it is written in C + Ruby and is widely supported by container runtime environments such as Docker and orchestration tools such as Kubernetes.

Log analytics refers to analyzing the log data collected over time and generating real-time log metrics. Elasticsearch is a powerful tool for this purpose.

Finally, you need a tool to collect log metrics so that you can visually display log trends using easy-to-understand charts and graphs. Kibana is my favorite choice in this regard.

Figure 1. Log workflow

Because logs may contain sensitive information, there are a few security points to keep in mind:

•Always transmit logs over a secure connection.

•Logging/monitoring infrastructure should be implemented within a restricted subnet.

•Access to monitoring user interfaces (such as Kibana and Grafana) should be limited to stakeholders only.

Node-level metrics

Not everything is logged!

Yes, logs monitor software or processes, not every component in your infrastructure.

Operating system disks, externally mounted data disks, Elastic Block Store, CPU, I/O, network packets, inbound and outbound connections, physical memory, virtual memory, buffer space, and queues are some of the major components that rarely appear in logs unless they fail.

So, how do you collect this kind of data?

Prometheus is the answer. You simply install the exporter for the specific software on the VM nodes and configure Prometheus to collect time-based data from these unattended components. Grafana uses the data collected by Prometheus to visually display the current status of the node in real time.

If you’re looking for a simpler solution for collecting time series metrics, consider Etricbeat, Elastic.io’s internal open source tool that can be used with Kibana as a drop-in replacement for Prometheus and Grafana.

Alerts and Notifications

Without alerts and notifications, you can’t get the most out of monitoring. Unless stakeholders (regardless of where they are) are notified of issues, they can’t analyze and resolve the problem, prevent customers from being impacted, and avoid it in the future.

Prometheus uses its internal Alertmanager and Grafana to create predefined alert rules, which can send alerts based on the configured rules. Sensu and Nagios are other open source tools that provide alerting and monitoring services.

The only problem people have with open source alerting tools is that the configuration time and process can sometimes seem laborious, but once set up, these tools work better than proprietary tools.

However, the greatest advantage of open source tools is that we can control their behavior.

Monitoring workflow and architecture

A good monitoring architecture is the backbone of a strong and stable monitoring system. It might look like this diagram.

Figure 2. Devops monitoring architecture

Finally, you have to choose the tool based on your needs and infrastructure. Many enterprise organizations use the open source tools discussed in this article to monitor the infrastructure and ensure high uptime.

Infrastructure monitoring: Defense against surprise downtime

[Translated by 51CTO. Please indicate the original translator and source as 51CTO.com when reprinting on partner sites]

<<: Picture | Someone finally explains 5G clearly...

>>: Google and Qualcomm join forces to fight against Apple. Will the global 5G competition landscape be further changed?

Ericsson signs 5G SA agreement with Spanish telecom operator Masmovil

Blog

LisaHost: Taiwan ISP residential native IP host 10% off monthly payment starting at 43 yuan, unlock streaming media, 1Gbps bandwidth

Blog

The three major operators were forced to delist from the US: the impact was not significant but the intention was obvious

Blog

The winner of Huawei Ecosystem Partner Elite Competition is finally revealed, and the cloud ecosystem embarks on a new journey

Blog

Teach you how to accurately calculate the I2C pull-up resistor value

Blog

Five reasons why the Internet of Things needs its own network

Blog

my country will promote 5G national standards in 2017 to speed up research and development and seize the right to speak in the industry

Blog

Hosteons: AMD Ryzen free upgrade, 512M-1G package bandwidth doubled, Los Angeles/Dallas data center

Chen Peizhen, General Manager of DYXnet, the first-line: Analyzing the innovative architecture and development strategy of "AI + Cloud Network Security" service

On November 16, 2024, a grand event focusing on c...

Comprehensive understanding of TCP/IP knowledge system structure summary

1. TCP Knowledge System We analyze the TCP knowle...

UCloud: Shanghai/Beijing cloud server annual payment starts from 62 yuan, Hong Kong/Taiwan cloud server annual payment starts from 150 yuan

The tribe once shared information about UCloud, U...

Monitor infrastructure to prevent unexpected downtime

Ericsson signs 5G SA agreement with Spanish telecom operator Masmovil

LisaHost: Taiwan ISP residential native IP host 10% off monthly payment starting at 43 yuan, unlock streaming media, 1Gbps bandwidth

The three major operators were forced to delist from the US: the impact was not significant but the intention was obvious

The winner of Huawei Ecosystem Partner Elite Competition is finally revealed, and the cloud ecosystem embarks on a new journey

Teach you how to accurately calculate the I2C pull-up resistor value

Five reasons why the Internet of Things needs its own network

my country will promote 5G national standards in 2017 to speed up research and development and seize the right to speak in the industry

Hosteons: AMD Ryzen free upgrade, 512M-1G package bandwidth doubled, Los Angeles/Dallas data center

iWebFusion: Starting from $7/month - 4GB/20GB/1.5TB@1Gbps/5 data centers including Los Angeles and North Carolina

spinservers: 10Gbps bandwidth high-end server starting at $89 per month, Dallas/San Jose data center

Recommend

LOCVPS: Hong Kong Confederation/Cloud VPS 40% off, 2GB memory package starting at 33 yuan per month

Ms. Meng: Being brave is not about not being afraid, but about holding on to the ideals in your heart.

Have you ever been cheated by your cell phone plan? It’s time to say “no” to your carrier

Prosperity through action, win-win ecological era | Huawei China Ecosystem Partner Conference 2018 grand opening

HostKvm: Hong Kong CTG bandwidth/traffic upgrade 20% off $7.6/month-2G memory/40G hard disk/30M bandwidth

SDN network architecture: three layers and three interfaces

I have no resistance to these 6 excellent computer software

Shi Kai: ThoughtWorks creates a competitive advantage for you

To prevent 5G from the barrel effect, both Sub-6GHz and millimeter wave are indispensable

Chen Peizhen, General Manager of DYXnet, the first-line: Analyzing the innovative architecture and development strategy of "AI + Cloud Network Security" service

Comprehensive understanding of TCP/IP knowledge system structure summary

UCloud: Shanghai/Beijing cloud server annual payment starts from 62 yuan, Hong Kong/Taiwan cloud server annual payment starts from 150 yuan

20 lines of Python code to achieve encrypted communication

An article reviews the top 10 wireless network technologies that connect tens of billions of terminals

Increasing Adoption of 5G Technology to Drive Cellular IoT Module Market