Pull or Push? How to choose a monitoring system?

Pull or Push? How to choose a monitoring system?

[[421126]]

A variety of monitoring systems

Monitoring has always been a core component of IT systems, responsible for problem discovery and auxiliary location. Whether it is traditional operation and maintenance, SRE, DevOps, or developers, they all need to pay attention to the monitoring system and participate in the construction and optimization of the monitoring system. From the beginning of the mainframe operating system and Linux basic indicators, the monitoring system has begun to appear and gradually evolve. At present, there are no less than hundreds of monitoring systems that can be searched, and there are also many ways to divide them according to different categories, such as:

Monitoring objects: general type (general monitoring method, suitable for most monitoring objects), specific type (customized for a certain function, such as Java's JMX system, CPU high temperature protection, hard disk power failure protection, UPS switching system, switch monitoring system, dedicated line monitoring, etc.);
Data acquisition method: Push (CollectD, Zabbix, InfluxDB); Pull (Prometheus, SNMP, JMX);
Deployment mode: coupled (deployed together with the monitored system); stand-alone (single-machine single-instance deployment); distributed (can be horizontally expanded); SaaS (many commercial companies provide SaaS, no deployment required);
Data acquisition method: interface type (can only be obtained through certain APIs); DSL (can have some calculations, such as PromQL, GraphQL); SQL (standard SQL, SQL-like);
Commercial attributes: open source and free (such as Prometheus, InfluxDB stand-alone version); open source commercial (such as InfluxDB cluster version, Elastic Search X-Pack); closed source commercial (such as DataDog, Splunk, AWS Cloud Watch);

2. Pull or Push

There are relatively many options for building a monitoring system platform for internal use in a company, whether it is self-built with an open source solution or using a commercial SaaS product. However, whether it is an open source solution or a commercial SaaS product, the actual implementation requires consideration of how to provide data to the monitoring platform, or how the monitoring platform obtains the data. This involves the selection of data acquisition methods: Pull or Push mode?

As the name implies, the monitoring system based on the Pull type actively obtains indicators, and the monitored object needs to be able to be accessed remotely; the monitoring system based on the Push type does not actively obtain data, but the monitored object actively pushes indicators. The two methods are different in many places. For the construction and selection of monitoring systems, it is necessary to understand the advantages and disadvantages of these two methods in advance and choose the appropriate solution to implement. Otherwise, if it is implemented blindly, the subsequent stability of the monitoring system and the deployment and operation and maintenance costs will be disastrous.

3. Pull vs Push Overview

The following will be introduced from several aspects. In order to save readers' time, here is a table for a general discussion, and the details will be expanded later:

Four principles and architecture comparison

As shown in the figure above, the core of Pull model data acquisition is the Pull module, which is usually deployed together with the monitoring backend, such as Prometheus. The core components include:

Service discovery systems, including host service discovery (generally relying on the company's own CMDB system), application service discovery (such as Consul), and PaaS service discovery (such as Kubernetes); the Pull module needs to be able to connect to these service discovery systems
The Pull core module, in addition to the service discovery part, generally uses a common protocol to pull data remotely. It generally supports configuration of pull intervals, timeout intervals, indicator filtering/rename/simple process capabilities. The application-side SDK supports listening to a fixed port to provide the ability to be pulled. Since various middleware/other systems are not compatible with the Pull protocol, it is necessary to develop corresponding Exporter Agents to support pulling indicators of these systems and provide standard Pull interfaces.
The Push model is relatively simple:

Push Agent supports pulling indicator data of various monitored objects and pushing them to the server. It can be deployed in conjunction with the monitored system or deployed separately.
ConfigCenter (optional) is used to provide centralized dynamic configuration capabilities, such as monitoring targets, collection intervals, indicator filtering, indicator processing, remote targets, etc. Application-side SDK supports sending data to the monitoring backend or to the local Agent (usually the local Agent also implements a set of backend interfaces)
Summary: From the perspective of deployment complexity, the Pull model is too complex and expensive to maintain for monitoring middleware or other systems. It is more convenient to use the Push model. The cost of providing a Metrics port or actively Push deployment is not much different.

Five Pull's distributed solution

In terms of scalability, Push-based data collection is naturally distributed, and can be infinitely expanded horizontally when the monitoring backend capabilities can keep up. In contrast, Pull-based expansion is more cumbersome and requires:

The Pull module is decoupled from the monitoring backend, and Pull is deployed separately as an Agent
Pull Agent needs to do distributed collaboration. Generally, the simplest way is to do Sharding. For example, get the list of monitored machines from the service discovery system, hash these machines and then take the modulus of Sharding to decide which Agent is responsible for Pull.
Add a configuration center (optional) to manage each PullAgent
I believe that students who react quickly have already seen that this distributed approach still has some problems:

Single point bottleneck still exists, all agents need to request the service discovery module
After Agent capacity expansion, the monitoring target will change, which may easily lead to data duplication or loss.

Comparison of six monitoring capabilities

1. Monitoring target survivability

Survival is the first and most basic task that needs to be done in monitoring. It is relatively simple to monitor the survival of the target in Pull mode. You can directly know whether the indicators of the target end can be requested at the center of Pull. If it fails, you can also know some simple errors, such as network timeout, peer refusal to connect, etc.

The Push method is relatively troublesome. If the application does not report, it may be that the application has crashed, it may be a network problem, or it may have been migrated to another node. Because the Pull module can interact with the service discovery in real time, but the Push module does not, the server can only know the specific cause of the failure by interacting with the service discovery.

2 Data completeness calculation

The concept of data completeness is still very important in large-scale monitoring systems. For example, if you monitor the QPS of a trading application with 1,000 copies, this indicator needs to be combined with 1,000 data. If you don't have the concept of data completeness, if you configure an alarm for a 2% decrease in QPS, due to network fluctuations, the data reported by more than 20 copies will be delayed for a few seconds, which will trigger a false alarm. Therefore, when configuring alarms, you also need to consider the data completeness data comprehensively.

The calculation of data completeness also depends on the service discovery module. The Pull method pulls data in rounds, so the data is complete after a round of pulling. Even if some pulls fail, the percentage of incomplete data is known.

The Push method is actively pushed by each agent and application. The Push interval and network delay of each client are different. The server needs to calculate the completeness of the data based on historical conditions, which is relatively costly.

3 Short Lifecycle/Serverless Application Monitoring

In actual scenarios, there are many short-lifecycle/Serverless applications. Especially in cost-friendly scenarios, we will use a large number of jobs, elastic instances, and serverless applications. For example, when a rendering task arrives, an elastic computing instance is started, and it is destroyed and released immediately after execution; machine learning training jobs, event-driven serverless workflows, and regularly executed jobs (such as resource cleanup, capacity checks, and security scans). These applications usually have a very short life cycle (possibly in seconds or milliseconds), and the Pull periodic model is extremely difficult to monitor. Generally, the Push method is required, and the application actively pushes monitoring data.

In order to deal with such short-life applications, pure Pull systems will provide an intermediate layer (such as Prometheus's Push Gateway): accept active Push requests from applications and then provide the Pull port to the monitoring system. However, this requires the management and operation costs of multiple additional intermediate layers, and because Pull simulates Push, the reporting delay will increase and these indicators that disappear immediately need to be cleaned up immediately.

4 Flexibility and Coupling

In terms of flexibility, the Pull mode has some advantages. You can configure the indicators you want in the Pull module and do some simple calculations/secondary processing on the indicators. However, this advantage is relative. The Push SDK/Agent can also configure these parameters. With the help of the configuration center, configuration management is also very simple.

In terms of coupling, the Pull model has a much lower degree of coupling with the backend. It only needs to provide an interface that the backend can understand. It does not need to worry about which backend to connect to or which indicators the backend needs. The division of labor is relatively clear. Application developers only need to expose their own application indicators, and SRE (monitoring system manager) will obtain these indicators. The Push model has a relatively higher degree of coupling. The application needs to configure the backend address and authentication information, etc., but if the local Push Agent is used, the application only needs to push the local address, and the cost is relatively low.

7. Operation and maintenance and cost comparison

1 Resource Cost

In terms of overall cost, the difference between the two methods is not big, but from the perspective of the attributable party:

The core consumption of the Pull mode is on the monitoring system side, and the cost on the application side is lower
The core consumption of Push mode is on the push and Push Agent side, and the consumption on the monitoring system side is much smaller than that of Pull.

2. Operation and maintenance costs

From the operation and maintenance perspective, the cost of the Pull mode is relatively higher. The components that require operation and maintenance in the Pull mode include: various exporters, service discovery, PullAgent, and monitoring backend; while the Push mode only requires operation and maintenance: Push Agent, monitoring backend, and configuration center (optional, generally deployed together with the monitoring backend).

One thing that needs to be noted here is that in Pull mode, since the server actively initiates a request to the client, the network needs to consider cross-cluster connectivity and network protection ACL on the application side. Compared with Push, the network connectivity is relatively simple, and the server only needs to provide a domain name/VIP that can be accessed by each node.

8. How to choose Pull or Push

Among the current open source solutions, the Pull mode represents the Prometheus family solution (the reason why it is called a family is that the default single-point Prometheus has limited scalability. The community has many Prometheus distributed solutions, such as Thanos, VictoriaMetrics, Cortex, etc.), and the Push mode represents the InfluxDB TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) solution. Both solutions have their own advantages and disadvantages. In the context of cloud native, with the popularity of Prometheus led by CNCF and Kubernetes, many open source software have begun to provide Pull ports in the Prometheus mode; but at the same time, many systems are difficult to provide Pull ports from the beginning of their design. In comparison, it is more reasonable to use the Push Agent method to monitor these systems.

There is still no clear conclusion on whether the application should use Pull or Push. The specific selection needs to be based on the actual scenarios within the company. For example, if the network of the company's cluster is very complex, it is simpler to use the Push method; if there are many applications with short life cycles, the Push method needs to be used; mobile applications can only use the Push method; the system itself uses Consul for service discovery, and it can be easily implemented by simply exposing the Pull port.

Therefore, considering all factors, the best solution for the company's internal monitoring system is to have both Pull and Push capabilities:

Push Agent is used for host, process and middleware monitoring;
Kubernetes and other platforms that directly expose the Pull port use the Pull mode;
The application selects Pull or Push based on the actual scenario;

9. SLS’s strategy on Pull and Push

SLS currently supports unified storage and analysis of logs, time series monitoring (Metric), and distributed link tracking (Trace). For time series monitoring solutions, it is compatible with the format standard of Prometheus, and also provides standard PromQL syntax. Faced with hundreds of thousands of SLS users, the application scenarios may vary greatly, and it is impossible to use a single Pull or Push to meet all customer needs. Therefore, SLS does not take a single route in the selection of Pull and Push, but is compatible with Pull and Push models. In addition, for the open source community and Agent, SLS's strategy is to be fully compatible with the open source ecosystem, rather than creating a closed ecosystem on its own:

Pull model: fully compatible with Prometheus's Pull Scrap capability. You can use Prometheus's Remote Write to let Prometheus act as the Pull Agent; VMAgent with the same capabilities as Prometheus Scrap can also be used in this way; SLS's own Agent Logtail can also implement Prometheus's Scrap capability.
Push model: Telegraf is the most comprehensive monitoring PushAgent ecosystem in the industry. SLS's Logtail has built-in Telegraf and can support all of Telegraf's hundreds of monitoring plug-ins.

Compared with pull agents such as VMAgent and Prometheus, and native Telegraf, SLS additionally provides the most urgent agent configuration center and agent monitoring capabilities. It can manage the collection configuration of each agent and monitor the running status of these agents on the server side, thereby reducing the operation and maintenance management costs as much as possible.

Therefore, it is very simple to build a monitoring solution using SLS:

In the SLS console (web page), create a MetricStore to store monitoring data.
Deploy the Logtail Agent (one-line command);
Configure the collection configuration of monitoring data on the console (Pull or Push is acceptable);

10. Conclusion

This article mainly introduces the most tangled Pull or Push selection problem in the monitoring system. The author compares the various directions of Pull and Push based on several years of practical experience and various customer scenarios encountered. It is only for your reference in the process of monitoring system construction. You are also welcome to leave comments and discuss.

<<:  18 pictures tell you: 10 key technical points that a 90-point network engineer should master

>>:  Imitate Spring to implement a class management container

Recommend

Gartner Report: Enterprise Network Services Market Trends for SD-WAN and NFV

As SD-WAN and Internet adoption in enterprise WAN...

Talk about Kerberos kinit command and ccache mechanism

1. Introduction Hello everyone, I recently encoun...

5G has just arrived, and 6G is on the way...but what will it look like?

[[352946]] On November 13, the website of the U.S...

It shouldn’t just be the packages that are being pushed up for 5G

"The 4G package was inexplicably upgraded to...

Woman connected to WiFi and received a huge bill: Some WiFi is actually charged

Nowadays, surfing the Internet with mobile termin...

...

Net neutrality dies at 2 years old

The Federal Communications Commission (FCC) voted...