·Introduction· With the rapid development of cloud native, cloud-network integration, and big data, network automation operation and maintenance has also emerged, and the characteristics of Telemetry network telemetry technology have become increasingly important. Telemetry, as the name suggests, is a technology for obtaining network measurement data over a long distance. For example, in the fields of aerospace, geology, and oceanography, satellite sensor data can be obtained through telemetry technology. When Telemetry technology is applied to the network, it can also remotely collect and obtain network data from physical and virtual network devices at high speed, providing reliable, real-time, and high-precision data for network analysis. According to different network Telemetry data sources, it can be divided into management plane, control plane and data plane Telemetry. The management plane collects network management data based on GRPC/NETCONF, the data plane collects data plane data based on IOAM, and the control plane reports control protocol data based on BMP (BGP Monitoring Protocol). This article focuses on a detailed introduction to Telemetry technology. Telemetry AdvantagesComparison with traditional network monitoring technologyWhen it comes to network device management and monitoring, the first thing that comes to mind is SNMP. SNMP is a simple network management protocol and is also a widely used network monitoring technology. Taking the collection of device CPU usage as an example, the interaction principle of SNMP and Telemetry to collect network device data is as follows: As shown in the figure, there are two main differences in the interactive process of collecting data between the two: 1. From the collection model, Telemetry occupies very little network device performance. The SNMP collector and the device use a question-and-answer interaction method. Each time the collector collects data, it sends an SNMP get request, and the device needs to respond to each get request. However, Telemetry only needs one subscription and parsing request to complete the subscription. The subsequent device continues to push data to the collector according to the collection cycle specified by the subscription, which has little performance loss on the network device. 2. In terms of collection cycle, Telemetry has higher collection accuracy and frequency. The collection cycle of SNMP get request depends on the overall time of the network polling all monitoring objects in the network. Usually, the shortest interval is recommended to be 5 minutes, while the collection interval of Telemetry can be 1 second, with the highest accuracy reaching sub-second level and the precision granularity reaching 300-30000 times. Therefore, when covering the same monitoring object, the SNMP protocol adopts the "pull mode", and the device CPU needs to respond to more get requests, while Telemetry adopts the "push mode" and only needs to make one subscription request, so Telemetry consumes less performance of the device CPU. At the same time, Telemetry has a higher collection frequency, with an accuracy from 5 minutes to sub-second level, and can obtain more accurate monitoring data, with controllable consumption of device performance, to achieve accurate monitoring of network status. The differences between Telemetry and common network monitoring technologies in terms of working mode and collection accuracy are shown in the following table: As can be seen from the table, Telemetry works in push mode, where the device actively pushes data and provides sub-second accuracy. In addition, a key point is that Telemetry data uses a standard structure and encoding, which is convenient for connecting to third-party devices and helps improve network monitoring efficiency and quality. Although SNMP Trap and SYSLOG are also push modes, the data range they push is limited, and monitoring data such as interface traffic cannot be collected in real time. Technical advantagesThe two main advantages of Telemetry are low consumption of device performance and high data collection accuracy, which can solve some of the pain points that traditional network monitoring technology has always faced. According to the Nyquist sampling principle, the information in the original signal can only be completely retained when the sampling frequency is greater than twice the signal frequency. When traditional technologies such as SNMP use a 5-minute collection cycle, there will be a problem of loss of detailed information, as shown in the following figure: When traditional operation and maintenance methods such as SNMP use faster data collection cycles to solve this problem, due to the use of the "pull mode", more intensive collection and pulling may cause the CPU of network devices to continue to increase, or even risk paralysis. Therefore, operation and maintenance technologies represented by SNMP cannot meet the current needs of real-time and full-process monitoring of IT operation and maintenance, nor can they detect network problems caused by a large number of microbursts in the network. Microburst refers to the phenomenon that a lot of burst data is received in a very short time (millisecond level), so that the instantaneous burst rate reaches tens or hundreds of times the average rate, or even occupies the port bandwidth. Network management equipment or network performance monitoring software is usually based on a longer period of time (several minutes), and the average value during this period is calculated as the real-time network bandwidth. In this case, the traffic rate is usually "peak-shaving and valley-filling", presenting a relatively stable curve, but the actual device may have caused packet loss due to microbursts, and affected the application system. The following figure shows a comparison curve of data collected using minute-level SNMP and sub-second Telemetry. As can be seen from the figure, the port traffic statistics queried by SNMP get are relatively smooth, while the traffic statistics by Telemetry clearly show microbursts. Through the high-precision sampling of Telemetry, these microbursts and the port packet loss caused by microbursts can be detected. Telemetry automated operation and maintenance system composition and mechanismOperation and maintenance system componentsIn a narrow sense, Telemetry is a network device feature. In a broad sense, Telemetry can be understood as a closed-loop automated operation and maintenance system, which consists of four components: network devices, collectors, analyzers, and controllers, as shown in the following figure: 1. Collector Used to receive and store the original monitoring data reported by network devices. Based on the configuration requirements of the collector, the network devices report the collected second-level or sub-second-level monitoring data to the collector for storage. 2. Analyzer It is used to analyze the monitoring data received by the collector, process the data, and present the analysis results to the user intuitively in the form of a graphical interface. 3. Controller The controller sends configurations to the devices through NETCONF and other methods to control the network devices. The controller sends configurations to the network devices based on the analysis data provided by the analyzer, adjusts the forwarding behavior of the network devices, and controls which data the network devices sample and report. Operation and maintenance system working mechanismThe collector, analyzer, and controller are all located on the network management side, and the network management side and the network equipment side work together, as shown in the following figure: On the network device side, Telemetry organizes data according to the YANG model, encodes it in GPB (Google Protocol Buffer) format, and transmits data through the gRPC (Google Remote Procedure Call Protocol) protocol. On the network management system side, Telemetry completes the data collection, analysis, and storage functions, and uses the analysis results to provide a basis for network configuration adjustments, as shown in the following figure: The following are some explanations of the concepts and terms involved in the network device side: Raw data: The raw data sampled by Telemetry can come from the forwarding plane, control plane, and management plane of the network device. Currently, it supports collecting information such as interface traffic statistics, CPU or memory data of the device. Data model: Telemetry organizes collected data based on the YANG model. YANG is a data modeling language used to design configuration data models, state data models, remote call models, and notification mechanisms that can be used as various transmission protocol operations. Encoding format: Supports GPB (Google Protocol Buffer) and JSON (JavaScript Object Notation) encoding formats. Telemetry uses the GPB encoding format (the file name suffix of the GPB encoding format is .proto) to provide a flexible, efficient, and automatic serialization mechanism for structured data. GPB is a binary encoding with good performance and high efficiency. Transport protocol: supports gRPC (Google Remote Procedure Call Protocol) and UDP (User Datagram Protocol). The gRPC protocol is a high-performance, general-purpose RPC open source software framework released by Google based on the HTTP2 protocol. Both parties in communication conduct secondary development based on this framework, so that both parties in communication can focus on the business without paying attention to the underlying communication implemented by the gRPC software framework. One thing to note is that the gRPC protocol can be used for Telemetry static subscription or dynamic subscription, while UDP can only be used for Telemetry static subscription. Telemetry Application ScenariosIntelligent operation and maintenance scenario of bank data center networkAt present, it is understood that ZS Bank, BJ Bank, GS Bank and other peers have piloted the deployment of intelligent operation and maintenance analysis systems in the production and test network environments of their data centers, using Telemetry technology to collect second-level operation and maintenance data, solve the problem of low SNMP collection accuracy, and monitor the operation status of network equipment in real time. Deploy an intelligent operation and maintenance system in the data center network. The collector collects equipment performance data through Telemetry, and the analyzer receives the uploaded data and performs statistics, analysis and presentation. In conjunction with technologies such as ERSPAN remote traffic mirroring, the 1-3-5 intelligent operation and maintenance of the data center network can be achieved together. Traffic optimization scenarios for mobile bearer networks of operatorsIn the operator's mobile city network, when the traffic path needs to be optimized, Telemetry technology is used to collect device data and send it to the analyzer for comprehensive analysis and decision-making. The analyzer then sends the decision to the controller, which then adjusts the device control and thus the traffic forwarding path. The detailed deployment process is as follows: 1. Configure the Telemetry function. 2. Each device actively establishes a gRPC channel with the intelligent operation and maintenance system and configures subscription on the device. 3. Each device reports the subscribed data to the collector through the gRPC channel. 4. The collector receives, stores, and processes the data reported by each device. 5. The analyzer performs analysis based on the big data analysis system. 6. The controller sends tuning instructions to tune the network. SummarizeWith the continuous implementation of cloud-native platforms and big data applications, the widespread application of Telemetry network telemetry technology has brought significant advantages, such as smaller performance loss, real-time and high-precision monitoring of network data, and discovery and location of network problems caused by microbursts, providing new ideas for network operation and maintenance under cloud-network integration. |
<<: Interview question: What happens when you enter a URL in the browser and press Enter?
>>: Why can't I access my home computer from work?
1. Types of interference sources (1) Interference...
Strictly speaking, the Singapore node is not a ne...
DediPath has released a Christmas promotion plan,...
"Smart Park" is not a new concept. In t...
The latest news is that BandwagonHost has opened ...
Do you remember the last time you expressed your ...
Recently, ICO has attracted a lot of attention. F...
In IT operation and maintenance, data backup is v...
Author: Zhang Zhe and Chen Juanjuan, unit: Hebei ...
The 400 MHz spectrum, with its long range, excell...
Choosing the right software-defined WAN vendor ca...
SD-WAN deployments are quickly becoming a major f...
Starting from July 1, the mobile data roaming cha...
From the perspective of package value, the curren...
A few days ago, we shared information about RAKsm...