As containers become more widely used, how should we monitor them?

As containers become more widely used, how should we monitor them?

With the booming development and implementation of container technology, more and more companies are running their businesses in containers. As one of the mainstream deployment methods, containers separate the tasks and concerns of the team. The development team only needs to focus on application logic and dependencies, while the operation and maintenance team only needs to focus on deployment and management, without having to worry about application details such as specific software versions and application-specific configurations. This means that the development team and the operation and maintenance team can spend less time debugging and going online, and more time delivering new features to end users. Containers make it easier for companies to improve application portability and operational resilience. According to a CNCF survey report, 73% of respondents are using containers to improve production agility and accelerate innovation.

Why do we need container monitoring?

In the process of large-scale use of containers, in the face of highly dynamic containerized environments that require continuous monitoring, establishing a monitoring system is of great significance for maintaining a stable operating environment and optimizing resource costs. Each container image may have a large number of running instances. Due to the rapid introduction of new images and new versions, failures can easily spread through containers, applications, and architectures. This makes it critical to locate the root cause of the problem immediately after the problem occurs in order to prevent the spread of anomalies. After a lot of practice, we believe that the monitoring of the following components is critical during the use of containers:

Host server;
Container runtime;
Orchestrator control plane;
Middleware dependencies;
An application that runs inside a container.
With a complete monitoring system, by gaining an in-depth understanding of metrics, logs, and links, the team can not only understand what is happening in the cluster, in the container runtime, and in the application, but also provide data support for the team to make business decisions, such as when to scale up/down instances/tasks/Pods, and change instance types. DevOps engineers can also improve troubleshooting and resource management efficiency by adding automated alerts and related configurations, such as actively monitoring memory utilization and notifying the operation and maintenance team to add additional nodes before available CPU and memory resources are exhausted when resource consumption approaches the set threshold. The value of this includes:

Detect problems early to avoid system outages;
Analyze container health across cloud environments;
Identify clusters that are over/under allocated with available resources and tune applications for better performance;
Create intelligent alarms to improve alarm accuracy and avoid false alarms;
Use monitoring data to optimize and obtain the best system performance and reduce operating costs.
However, in the actual implementation process, the operation and maintenance team may feel that the above values ​​are relatively superficial, and it seems that the existing operation and maintenance tools can achieve the above purposes. However, for container-related scenarios, if the corresponding monitoring system cannot be built, as the business continues to expand, it will have to face the following two very difficult targeted problems:

1. Troubleshooting time is prolonged and SLA cannot be met.

It was difficult for development and operations teams to understand what was running and how well it was performing. Maintaining applications, meeting SLAs, and troubleshooting were extremely difficult.

2. Scalability is hindered and elasticity cannot be achieved.

The ability to quickly scale application or microservice instances on demand is an important requirement for containerized environments. The monitoring system is the only visual way to measure demand and user experience. Scaling up too late will result in a decrease in performance and user experience; scaling down too late will result in a waste of resources and costs.

Therefore, as the problems and value of container monitoring continue to accumulate and surface, more and more operation and maintenance teams begin to pay attention to the construction of container monitoring systems. However, in the process of actually implementing container monitoring, various unexpected problems are encountered.

For example, the difficulty in tracking caused by short-lived features. Due to the complexity of containers themselves, containers contain not only the underlying code, but also all the underlying services required for the application to run. As new deployments are put into production and the code and underlying services are changed, containerized applications are frequently updated, which increases the possibility of errors. The characteristics of rapid creation and rapid destruction make it extremely difficult to track changes in large-scale complex systems.

For example, due to the difficulty of monitoring caused by shared resources, since resources such as memory and CPU used by containers are shared between one or more hosts, it is difficult to monitor resource consumption on the physical host, which also makes it difficult to get a good indication of container performance or application health.

Finally, traditional tools are not able to meet the needs of container monitoring. Traditional monitoring solutions often lack the tools required for metrics, tracking, and logging required for virtualized environments, especially for container health and performance metrics and tools.

Therefore, considering the above values, problems, and difficulties, we need to consider and design the following dimensions when establishing a container monitoring system:

Non-intrusiveness: Monitor whether the SDK or probe integrated into the business code is intrusive and affects business stability;
Holistic: Is it possible to observe the performance of the entire application in terms of business and technical platforms?
Multi-source: whether relevant metrics and log sets can be obtained from different data sources for summary display, analysis and alerting;
Convenience: Whether it is possible to associate events and logs, discover anomalies, and proactively and passively troubleshoot and reduce losses; whether the relevant alarm strategy configuration is convenient.
In the process of clarifying business requirements and designing monitoring systems, there are many open source tools for the operation and maintenance team to choose from, but the operation and maintenance team also needs to evaluate possible business and project risks. These include:

There are unknown risks that may affect business stability. Can the monitoring service be "traceless"? Does the monitoring process itself affect the normal operation of the system?
The human and time investment in open source or self-development is difficult to predict. The associated components or resources need to be configured or built by themselves, and there is a lack of corresponding support and services. As the business continues to change, will it cost more manpower and time? And in the face of performance issues in large-scale scenarios, can the open source or enterprise-owned teams respond quickly?

Alibaba Cloud Kubernetes Monitoring: Making container cluster monitoring more intuitive and simpler

Therefore, based on the above insights and a lot of practical experience, Alibaba Cloud launched the Kubernetes monitoring service . Alibaba Cloud Kubernetes Monitoring is a one-stop observability product developed for Kubernetes clusters. Based on the indicators, application links, logs, and events in the Kubernetes cluster, Alibaba Cloud Kubernetes Monitoring aims to provide IT development and operation personnel with a holistic observability solution. Alibaba Cloud Kubernetes Monitoring has the following six features:

Non-intrusive code: By using bypass technology, network performance data can be obtained without code embedding.
Multi-language support: Network protocol parsing is performed through the kernel layer, supporting any language and framework.
Low consumption and high performance: Based on eBPF technology, network performance data can be obtained with extremely low consumption.
Automatic resource topology: Through network topology and resource topology, the association status of related resources is displayed.
Multi-dimensional data presentation: supports various types of observable data (monitoring indicators, links, logs, and events).
Create a closed loop: fully correlate the observable data of the architecture layer, application layer, container operation layer, container management layer, and basic resource layer.
At the same time, compared with open source container monitoring, Alibaba Cloud Kubernetes monitoring has differentiated value that is closer to business scenarios:

Unlimited data volume: Indicators, links, logs and other data are stored independently, using cloud storage capabilities to ensure low-cost, large-capacity storage.
Efficient resource association and interaction: By monitoring network requests and building a complete network topology, it is easy to view service dependency status and improve operation and maintenance efficiency. In addition to the network topology, the 3D topology function supports viewing the network topology and resource topology at the same time, which speeds up problem location.
Diversified data combination: indicators, links, logs and other data are visualized and freely combined to explore operation and maintenance optimization points.
Build a complete monitoring system: Together with other sub-products of the application real-time monitoring service, build a complete monitoring system. Application monitoring focuses on the application language runtime, application framework, and business code; Kubernetes monitoring focuses on the container runtime, container management layer, and system calls of containerized applications. Both monitoring services serve applications and focus on different levels of applications. The two products complement each other. Prometheus is the infrastructure for indicator collection, storage, and query. The indicator data of application monitoring and Kubernetes monitoring both rely on Prometheus.

Based on the above product features and differentiated value, we apply it in the following scenarios:

Through the default or customized inspection rules of Kubernetes monitoring system, abnormalities of nodes, services and workloads are discovered. Kubernetes monitoring performs abnormal inspections on nodes, services and workloads from the three dimensions of performance, resources and control, and intuitively displays the analysis results through normal, warning, severe and other states with specific colors to help operation and maintenance personnel intuitively perceive the operating status of user nodes, services and workloads.

Use Kubernetes monitoring to locate the root cause of service and workload response failures. Kubernetes monitoring stores failed requests in detail by analyzing network protocols, and uses the failed request details associated with failed request indicators to locate the cause of failure.
Use Kubernetes monitoring to locate the root cause of slow service and workload response. Kubernetes monitoring captures the indicators of the key path of the network link to view indicators such as DNS resolution performance, TCP retransmission rate, and network packet RTT. Use the indicators of the key path of the network link to locate the cause of slow response and optimize related services.

Use Kubernetes monitoring to explore application architecture and discover unexpected network traffic. Kubernetes monitoring supports viewing the topology map built by global traffic and supports configuring static ports to identify specific services. Use the intuitive and powerful interaction of the topology map to explore the application architecture and verify whether the traffic meets expectations and whether the architecture is reasonable.

Use Kubernetes to monitor and discover uneven node resource usage, allocate node resources in advance, and reduce business operation risks.

Currently, Kubernetes monitoring has been launched for public beta, and it is free to use during the public beta period. Let Kubernetes monitoring help you get rid of mechanical and repetitive operation and maintenance work~

<<:  It’s time to promote 5G applications

>>:  AnalyticDB PostgreSQL teaches you how to implement distributed consistent backup and recovery

Blog    

Recommend

Changing WiFi channels can help you enjoy faster network speeds

In our daily life, when we connect our computers,...

Huawei's Meng Wanzhou: 5.5G is the inevitable path for 5G network evolution

On June 28, 2023 MWC Shanghai opened, and Huawei ...

Juniper CEO: The strategy driving Juniper's general direction is cloud

Juniper announced its first quarter 2017 revenue ...

Five-minute technical talk | HTTP evolution history

Part 01 Protocol Introduction HTTP is the most po...

Six key trends in network management

We live in an era of rapid development of IT tech...

What exactly are big and small ends in communication protocols?

In IoT application development, the communication...

Outlook on the Next Generation of Enterprise Wireless Technology - CBRS

The shortage of wireless spectrum has always been...