[51CTO.com original article] The author has joined WiFi***Key since 2016 and is currently a senior architect at WiFi***Key. He has 10 years of experience in Internet R&D and likes to tinker with technology. He mainly focuses on: distributed monitoring platform, call chain tracking platform, unified log platform, application performance management, stability assurance system construction and other fields. In this article, I will share with you some practical experience in the field of real-time monitoring, and introduce how WiFi***Key builds an APM end-to-end full-link monitoring platform, so as to achieve the goals of improving fault detection rate, shortening fault handling cycle, reducing user complaint rate, and establishing a good company brand image. WiFi***Key Development and Operation Team's Troubles WiFi***Key originated from Shanda Innovation Institute. As of the end of 2016, our total number of users has exceeded 900 million, with 520 million monthly active users in 223 countries and regions around the world. There are 400 million hotspots that can be connected worldwide, and the average number of connections per day exceeds 4 billion times. With the massive growth of daily active users, the service teams of WiFi***Key's product lines are engaged in a war without gunpowder. More and more application services are facing problems such as traffic surge, architecture expansion, and performance bottlenecks. In order to cope with and support the rapid development of the business, we have entered the era of componentization and service-oriented development such as SOA, Microservice, and API Gateway. With the evolution of microservices in various systems, the number of services and machine scales are growing, and the online environment is becoming increasingly complex. Engineers face many troubles every day. For example, they cannot perceive the real time when online applications fail; they are at a loss when troubleshooting the massive logs generated by online applications; and it is difficult to locate the faults in the call links within the application system and between systems. In summary, performance issues and abnormal errors of online applications have become the biggest challenge for developers and operation and maintenance personnel, and troubleshooting such problems often takes several hours or even days, seriously affecting efficiency and business development. WiFi***Key urgently needs to improve the monitoring system to help developers and operation and maintenance personnel get rid of troubles and improve application performance. Based on the company's product form and business development, we found that the monitoring system needs to solve a series of problems: ◆Faced with the WiFi connection requests from massive users in multiple regions around the world, how to ensure the user connection experience? ◆How to improve the success rate of users connecting to WiFi through full-link monitoring? ◆With the large-scale promotion and implementation of microservices, the service-side system of WiFi key products is becoming more and more complex, and the difficulty of discovering, locating and handling online faults is also increasing. How to improve the fault handling speed through full-link monitoring? ◆Mobile overseas expansion has entered the second half of in-depth development. How can full-link monitoring cope with the company's global business development? ◆…… Full-link monitoring In the early days, in order to quickly support business development, we mainly used open source monitoring solutions to ensure the stability of online systems: Cat and Zabbix. As business development requires, open source solutions can no longer meet our business needs. We urgently need to build a full-link monitoring system that meets our current situation: ◆Multi-dimensional monitoring (system monitoring, business monitoring, application monitoring, log search, call chain tracking, etc.) ◆Multi-instance support (meet the needs of online applications to deploy multiple application instances on a single physical machine, etc.) ◆Multi-language support (monitoring support to meet the needs of multiple development language scenarios of each team, including Go, C++, PHP, etc.) ◆Multi-computer room support (meet the monitoring support of applications in multiple computer rooms at home and abroad, data synchronization between computer rooms, etc.) ◆Multi-channel alarm (meeting multi-channel alarm support, internal system docking, email, mobile phone, SMS, etc.) ◆Call chain tracking (meet the needs of call chain tracking within and between applications, internal middleware upgrade and transformation, etc.) ◆Unified log search (realize centralized log search and management of online application logs, Nginx logs, etc.) ◆…… Monitoring Target From the perspective of "application", we divide the monitoring system into: outside the application, inside the application, and between applications. As shown in the following figure: Outside the application: mainly monitor the runtime environment of the application (hardware, network, operating system, etc.) In-application: mainly from user requests to different aspects within the application (JVM, URL, Method, SQL, etc.) Between applications: mainly monitoring from the perspective of distributed call chain tracking (dependency analysis, capacity planning, etc.) The birth of Roman surveillance According to their actual needs, the WiFi***Key R&D team built the Roma monitoring system. The monitoring system was named Roma because: 1. Rome was not built in a day (online monitoring target related indicators need to be gradually improved); 2. All roads lead to Rome (Rome collects data from various monitoring targets through a variety of data collection methods); 3. According to mythology, after the Trojan War, some descendants of the Trojans founded the ancient Roman Empire (the continuation of a story, the birth of a new project). A complete monitoring system will cover all aspects of monitoring targets in the IT field. From the current monitoring development of Internet companies at home and abroad, many companies divide different monitoring targets into different R&D teams for processing, but this will bring some problems: waste of human resources, duplicate system construction, inconsistent data assets, and difficulty in implementing full-link monitoring. At present, the solutions adopted by various companies in the monitoring field are shown in the figure below: As shown in the figure, the Rome monitoring system hopes to absorb excellent architectural design concepts from all parties and integrate different monitoring dimensions to achieve "integration" and "full-link" of the monitoring system. High availability architecture With more than 4 billion WiFi connection requests per day, each request will go through dozens of internal microservice systems. The monitoring dimensions of each microservice will involve multiple monitoring indicators such as outside the application, within the application, and between applications. Currently, the Roma monitoring system needs to process nearly 100 billion indicator data and nearly 100 TB of log data every day. How does Roma deal with the massive amount of monitoring data? Next, the author will analyze them one by one from the perspective of system architecture design. Architecture Principles A monitoring system needs to meet the following five requirements for access to user applications: • Performance impact: Minimize the performance impact on business systems (CPU, Load, Memory, IO, etc.) • Low intrusion: easy access to business systems (no coding or very little coding required to access the system) • No internal dependency: no reliance on the company’s internal core systems (avoiding interdependence caused by failure of the dependent system) • Unitized deployment: The monitoring system needs to support unitized deployment (supporting unitized deployment in multiple computer rooms) • Data centralization: centralized processing, analysis, and storage of monitoring data (for data statistics, etc.) Overall architecture The Roma system architecture is shown in the figure below: The functions, responsibilities and uses of each component in the Roma architecture are as follows: The overall architecture of Roma is divided into different processing links: data collection, data transmission, data synchronization, data analysis, data storage, data quality, data display, etc. The main technology stack used in different stages of data flow processing is shown in the figure below: Data collection In-application monitoring is mainly handled by establishing a TCP long connection between the client and the agent on the machine. The agent also needs to be able to obtain system performance indicator data through script scheduling. Faced with massive amounts of monitoring indicator data, Rome Monitoring aggregates and calculates by pre-aggregating at each layer. For example, the indicator data of the same URL request in the client is aggregated and calculated within one minute, and the statistical result is a record (the same request within a minute is accumulated and calculated, which takes up very little memory and reduces the amount of data transmission). For a system that accesses and uses Rome, it is entirely possible to perform statistical calculations on the scale of monitoring data based on its number of instances, indicator dimensions, collection frequency, etc. Through hierarchical pre-aggregation at each layer, the data transmission of massive data in the network is reduced, the data storage cost is reduced, and network bandwidth resources and disk storage space are saved. The implementation principle of in-application monitoring (as shown in the figure below) is mainly through client collection, interception and statistics at various levels within the application: indicator data of different dimensions such as URL, Method, Exception, SQL, etc. The data collection process of various dimension indicators of in-application monitoring is shown in the figure below: Different counters are defined for different monitoring dimensions, and data is finally collected through the JMX specification. Data Transfer Data transmission TLV protocol, supporting multiple types such as binary, JSON, XML, etc. An agent is deployed on each machine (to establish a long TCP connection with the client). The main responsibilities of the agent are data forwarding and data collection (log file reading, system monitoring indicator acquisition, etc.). After obtaining the performance indicator data, the agent will send it to the Kafka cluster. In each computer room, a Kafka cluster is independently deployed to monitor the sending buffer of indicator data, which is convenient for the back-end nodes to consume data and store data. In order to achieve efficient data transmission, we compared and analyzed the compression methods of message processing and finally chose the GZIP method with a high compression ratio, mainly to save network bandwidth and avoid the network bandwidth in the computer room being occupied by the massive amount of monitored data. The timing diagram for data communication between each node is shown in the figure below: establish connection->read configuration->collection scheduling->report data, etc. Data Synchronization There are many overseas operators, and the quality of public network coverage varies. In addition, the different interconnection strategies of operators mean that the price paid will be high latency and high packet loss network quality. When key products go overseas, we must first have a correct expectation of the overall network quality. For example, if it is necessary to monitor applications in overseas computer rooms, it depends on establishing sites (main computer rooms) overseas and interconnecting overseas master stations with domestic master stations. In addition, it is necessary to process monitoring indicator data in a hierarchical manner, such as classifying and dividing indicator data for different needs such as real-time, quasi-real-time, and offline (adjusting sampling strategies for indicator data with different needs and different data scales) Since each product line application is deployed in multiple computer rooms, in order to meet the demand that each application can be monitored in multiple computer rooms, the Rome monitoring platform needs to support the scenario of application monitoring in multiple computer rooms. In order to avoid repeated deployment of Rome components in each computer room and facilitate unified storage and analysis of monitoring indicator data, the monitoring indicator data in each computer room will eventually be synchronized to the main computer room, and finally data analysis and data storage will be performed in the main computer room. In order to achieve data synchronization in multiple computer rooms, we mainly use the high-availability solution of Kafka deployed across data centers. The overall deployment diagram is shown in the following figure: After comparing and analyzing MirrorMaker and uReplicator, we decided to conduct secondary development based on uReplicator, mainly because when a MirrorMaker node fails, the data replication delay is large, the process needs to be restarted for dynamically adding topics, and the blacklist and whitelist management is completely static. Although uReplicator has made a lot of optimizations for MirrorMaker, we still encountered many problems after a lot of testing. We need to have the ability to dynamically manage the MirrorMaker process, and we don’t want to restart the MirrorMaker process every time. Data storage In order to meet the storage requirements of different monitoring indicator data, we mainly use data storage frameworks such as HBase, OpenTSDB, and Elasticsearch. We have encountered many pitfalls in data storage, which can be summarized as follows: • Cluster division: Rationally divide online storage resources based on the data scale of each product line application. For example, our ES cluster is planned and divided according to product line, core system, data size, etc.; • Performance optimization: Linux system layer optimization, TCP optimization, storage parameter optimization, etc.; • Data operation: Batch data storage (avoid single record storage). For example, for HBase data storage, data caching and batch submission can be performed on the client to avoid frequent connections between the client and the RegionServer (reduce the number of RPC requests). Data quality Our system continuously generates a lot of events, inter-service link messages, and application logs, and these data need to pass through Kafka before being processed. So, how does our platform audit these data in real time? In order to monitor the health of Kafka data pipelines and audit each message flowing through Kafka, we investigated and analyzed Uber's open source audit system Chaperone. After various tests, we decided to develop our own system to meet the needs. This is mainly because we hope to meet the data audit requirements in any code block of any node. At the same time, we need to combine the characteristics of our own data pipeline to design and implement a series of goals: data integrity and latency; data quality monitoring needs to be near real-time; data problems can be quickly located (providing diagnostic information to help solve problems); monitoring and auditing are highly reliable; monitoring platform services are highly available and ultra-stable; In order to meet the above goals, the implementation principle of the data quality audit system is: aggregate audit data according to time windows, count the amount of data within a certain period of time, and detect data loss, delay and duplication as early as possible and accurately. At the same time, there are corresponding logical processes to remove duplicate, late and non-sequential data, and various fault-tolerant processes are performed to ensure high availability. Data display In order to realize data visualization of monitoring indicators, we developed our own front-end data visualization project. At the same time, we also integrated external third-party open source data visualization components (grafana, kibana). The problems we encountered in the integration process: permission control problems (internal system SSO integration) were mainly solved by the self-developed permission proxy system, removing the relevant plug-ins provided by kibana officially, and improving and developing the ES cluster monitoring plug-in. Core functions and implementation practices System Monitoring Our system monitoring mainly uses OpenTSDB as data storage and Grafana as data display. We use the read-write separation method to reduce the pressure on the TSDB data storage layer. During the integration of TSDB and Grafana, we also encountered the problem of data grouping display (querying the group field value under massive indicator data and querying the data by establishing independent indicator items). The following figure shows the monitoring effect of a machine system: Application Monitoring For each Java application, we provide different monitoring types for measuring the indicator data within the application. Business Monitoring For business monitoring, we can collect business monitoring indicators through different methods such as coding, log output, HTTP interface, etc., and support multi-dimensional data report display, as shown in the following figure: Our business monitoring allows each application party to easily access it in a self-service manner, as shown in the following figure: Log Search In order to support R&D personnel in online troubleshooting, we have developed a unified log search platform to facilitate R&D personnel to locate problems in massive logs. Future Outlook With the rapid development of emerging IT technologies, the future evolution of Rome's monitoring system is as follows: • Multi-language support: meet multi-language monitoring needs (performance monitoring, business monitoring, log search, etc.) • Intelligent monitoring: Improve the timeliness and accuracy of alarms to avoid alarm storms (ITOA, AIOps) • Containerized monitoring: With the verification and implementation of containerized technology, containerized monitoring has begun to be deployed; Summarize Roma is a full-link monitoring platform that can deeply monitor applications, mainly covering monitoring targets in different dimensions such as outside the application, inside the application, and between applications, such as application monitoring, business monitoring, system monitoring, middleware monitoring, unified log search, call chain tracking, etc. It can help developers to quickly diagnose faults, locate performance bottlenecks, organize architectures, analyze dependencies, and evaluate capacity. [51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites] |
<<: Software testing requires understanding of these network knowledge points
>>: Blockchain is the foundation, digital currency is the end
PIGYun has released a special promotion for Septe...
One of the fascinating things about technology is...
On August 2, the 2017 Xiaoyu Yilian E=mc² new pro...
Ultra-fast "fifth-generation 5G" mobile...
In the article "Bitcoin Prequel", it is...
In terms of network construction scale, the numbe...
What does an intelligent world where everything i...
The 2018 Yunnan-Huawei Software Industry Summit w...
This is the first time that Linveo has appeared o...
Since 2015, the regulatory authorities have vigor...
Since the advent of mobile communications, humans...
In 2019, the global market for SD-WAN grew by 70%...
[[242081]] Munich, 2013 Nowadays, many programmer...
In the ever-evolving world of technology, 2024 pr...