A variety of monitoring systems Monitoring has always been a core component of IT systems, responsible for problem discovery and auxiliary location. Whether it is traditional operation and maintenance, SRE, DevOps, or developers, they all need to pay attention to the monitoring system and participate in the construction and optimization of the monitoring system. From the beginning of the mainframe operating system and Linux basic indicators, the monitoring system has begun to appear and gradually evolve. At present, there are no less than hundreds of monitoring systems that can be searched, and there are also many ways to divide them according to different categories, such as: Monitoring objects: general type (general monitoring method, suitable for most monitoring objects), specific type (customized for a certain function, such as Java's JMX system, CPU high temperature protection, hard disk power failure protection, UPS switching system, switch monitoring system, dedicated line monitoring, etc.); 2. Pull or Push There are relatively many options for building a monitoring system platform for internal use in a company, whether it is self-built with an open source solution or using a commercial SaaS product. However, whether it is an open source solution or a commercial SaaS product, the actual implementation requires consideration of how to provide data to the monitoring platform, or how the monitoring platform obtains the data. This involves the selection of data acquisition methods: Pull or Push mode? As the name implies, the monitoring system based on the Pull type actively obtains indicators, and the monitored object needs to be able to be accessed remotely; the monitoring system based on the Push type does not actively obtain data, but the monitored object actively pushes indicators. The two methods are different in many places. For the construction and selection of monitoring systems, it is necessary to understand the advantages and disadvantages of these two methods in advance and choose the appropriate solution to implement. Otherwise, if it is implemented blindly, the subsequent stability of the monitoring system and the deployment and operation and maintenance costs will be disastrous. 3. Pull vs Push Overview The following will be introduced from several aspects. In order to save readers' time, here is a table for a general discussion, and the details will be expanded later: Four principles and architecture comparison As shown in the figure above, the core of Pull model data acquisition is the Pull module, which is usually deployed together with the monitoring backend, such as Prometheus. The core components include: Service discovery systems, including host service discovery (generally relying on the company's own CMDB system), application service discovery (such as Consul), and PaaS service discovery (such as Kubernetes); the Pull module needs to be able to connect to these service discovery systems Push Agent supports pulling indicator data of various monitored objects and pushing them to the server. It can be deployed in conjunction with the monitored system or deployed separately. Five Pull's distributed solution In terms of scalability, Push-based data collection is naturally distributed, and can be infinitely expanded horizontally when the monitoring backend capabilities can keep up. In contrast, Pull-based expansion is more cumbersome and requires: The Pull module is decoupled from the monitoring backend, and Pull is deployed separately as an Agent Single point bottleneck still exists, all agents need to request the service discovery module Comparison of six monitoring capabilities 1. Monitoring target survivability Survival is the first and most basic task that needs to be done in monitoring. It is relatively simple to monitor the survival of the target in Pull mode. You can directly know whether the indicators of the target end can be requested at the center of Pull. If it fails, you can also know some simple errors, such as network timeout, peer refusal to connect, etc. The Push method is relatively troublesome. If the application does not report, it may be that the application has crashed, it may be a network problem, or it may have been migrated to another node. Because the Pull module can interact with the service discovery in real time, but the Push module does not, the server can only know the specific cause of the failure by interacting with the service discovery. 2 Data completeness calculation The concept of data completeness is still very important in large-scale monitoring systems. For example, if you monitor the QPS of a trading application with 1,000 copies, this indicator needs to be combined with 1,000 data. If you don't have the concept of data completeness, if you configure an alarm for a 2% decrease in QPS, due to network fluctuations, the data reported by more than 20 copies will be delayed for a few seconds, which will trigger a false alarm. Therefore, when configuring alarms, you also need to consider the data completeness data comprehensively. The calculation of data completeness also depends on the service discovery module. The Pull method pulls data in rounds, so the data is complete after a round of pulling. Even if some pulls fail, the percentage of incomplete data is known. The Push method is actively pushed by each agent and application. The Push interval and network delay of each client are different. The server needs to calculate the completeness of the data based on historical conditions, which is relatively costly. 3 Short Lifecycle/Serverless Application Monitoring In actual scenarios, there are many short-lifecycle/Serverless applications. Especially in cost-friendly scenarios, we will use a large number of jobs, elastic instances, and serverless applications. For example, when a rendering task arrives, an elastic computing instance is started, and it is destroyed and released immediately after execution; machine learning training jobs, event-driven serverless workflows, and regularly executed jobs (such as resource cleanup, capacity checks, and security scans). These applications usually have a very short life cycle (possibly in seconds or milliseconds), and the Pull periodic model is extremely difficult to monitor. Generally, the Push method is required, and the application actively pushes monitoring data. In order to deal with such short-life applications, pure Pull systems will provide an intermediate layer (such as Prometheus's Push Gateway): accept active Push requests from applications and then provide the Pull port to the monitoring system. However, this requires the management and operation costs of multiple additional intermediate layers, and because Pull simulates Push, the reporting delay will increase and these indicators that disappear immediately need to be cleaned up immediately. 4 Flexibility and Coupling In terms of flexibility, the Pull mode has some advantages. You can configure the indicators you want in the Pull module and do some simple calculations/secondary processing on the indicators. However, this advantage is relative. The Push SDK/Agent can also configure these parameters. With the help of the configuration center, configuration management is also very simple. In terms of coupling, the Pull model has a much lower degree of coupling with the backend. It only needs to provide an interface that the backend can understand. It does not need to worry about which backend to connect to or which indicators the backend needs. The division of labor is relatively clear. Application developers only need to expose their own application indicators, and SRE (monitoring system manager) will obtain these indicators. The Push model has a relatively higher degree of coupling. The application needs to configure the backend address and authentication information, etc., but if the local Push Agent is used, the application only needs to push the local address, and the cost is relatively low. 7. Operation and maintenance and cost comparison 1 Resource Cost In terms of overall cost, the difference between the two methods is not big, but from the perspective of the attributable party: The core consumption of the Pull mode is on the monitoring system side, and the cost on the application side is lower 2. Operation and maintenance costs From the operation and maintenance perspective, the cost of the Pull mode is relatively higher. The components that require operation and maintenance in the Pull mode include: various exporters, service discovery, PullAgent, and monitoring backend; while the Push mode only requires operation and maintenance: Push Agent, monitoring backend, and configuration center (optional, generally deployed together with the monitoring backend). One thing that needs to be noted here is that in Pull mode, since the server actively initiates a request to the client, the network needs to consider cross-cluster connectivity and network protection ACL on the application side. Compared with Push, the network connectivity is relatively simple, and the server only needs to provide a domain name/VIP that can be accessed by each node. 8. How to choose Pull or Push Among the current open source solutions, the Pull mode represents the Prometheus family solution (the reason why it is called a family is that the default single-point Prometheus has limited scalability. The community has many Prometheus distributed solutions, such as Thanos, VictoriaMetrics, Cortex, etc.), and the Push mode represents the InfluxDB TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) solution. Both solutions have their own advantages and disadvantages. In the context of cloud native, with the popularity of Prometheus led by CNCF and Kubernetes, many open source software have begun to provide Pull ports in the Prometheus mode; but at the same time, many systems are difficult to provide Pull ports from the beginning of their design. In comparison, it is more reasonable to use the Push Agent method to monitor these systems. There is still no clear conclusion on whether the application should use Pull or Push. The specific selection needs to be based on the actual scenarios within the company. For example, if the network of the company's cluster is very complex, it is simpler to use the Push method; if there are many applications with short life cycles, the Push method needs to be used; mobile applications can only use the Push method; the system itself uses Consul for service discovery, and it can be easily implemented by simply exposing the Pull port. Therefore, considering all factors, the best solution for the company's internal monitoring system is to have both Pull and Push capabilities: Push Agent is used for host, process and middleware monitoring; 9. SLS’s strategy on Pull and Push SLS currently supports unified storage and analysis of logs, time series monitoring (Metric), and distributed link tracking (Trace). For time series monitoring solutions, it is compatible with the format standard of Prometheus, and also provides standard PromQL syntax. Faced with hundreds of thousands of SLS users, the application scenarios may vary greatly, and it is impossible to use a single Pull or Push to meet all customer needs. Therefore, SLS does not take a single route in the selection of Pull and Push, but is compatible with Pull and Push models. In addition, for the open source community and Agent, SLS's strategy is to be fully compatible with the open source ecosystem, rather than creating a closed ecosystem on its own: Pull model: fully compatible with Prometheus's Pull Scrap capability. You can use Prometheus's Remote Write to let Prometheus act as the Pull Agent; VMAgent with the same capabilities as Prometheus Scrap can also be used in this way; SLS's own Agent Logtail can also implement Prometheus's Scrap capability. Compared with pull agents such as VMAgent and Prometheus, and native Telegraf, SLS additionally provides the most urgent agent configuration center and agent monitoring capabilities. It can manage the collection configuration of each agent and monitor the running status of these agents on the server side, thereby reducing the operation and maintenance management costs as much as possible. Therefore, it is very simple to build a monitoring solution using SLS: In the SLS console (web page), create a MetricStore to store monitoring data. 10. Conclusion This article mainly introduces the most tangled Pull or Push selection problem in the monitoring system. The author compares the various directions of Pull and Push based on several years of practical experience and various customer scenarios encountered. It is only for your reference in the process of monitoring system construction. You are also welcome to leave comments and discuss. |
<<: 18 pictures tell you: 10 key technical points that a 90-point network engineer should master
>>: Imitate Spring to implement a class management container
In order to actively respond to the national stra...
As SD-WAN and Internet adoption in enterprise WAN...
Labs Guide This article starts with analyzing the...
[[386853]] From dial-up to ADSL, and then to fibe...
1. Introduction Hello everyone, I recently encoun...
From July 5 to 7, 2021, the "9th China Comma...
[[434629]] I didn't expect that in 2021, ther...
[[352946]] On November 13, the website of the U.S...
"The 4G package was inexplicably upgraded to...
The last time I shared information about Ramnode ...
Nowadays, surfing the Internet with mobile termin...
During the Double 11 period this year, ZJI launch...
The Federal Communications Commission (FCC) voted...
Have you adapted to the online working mode durin...