How to choose between open source self-built/hosted and commercial self-developed trace?

How to choose between open source self-built/hosted and commercial self-developed trace?

[[419923]]

With the rise of microservice architecture, the call dependencies on the server side are becoming more and more complex. In order to quickly locate abnormal components and performance bottlenecks, accessing distributed link tracing has become a consensus in the IT operation and maintenance field. However, what are the differences between open source self-built, open source hosted or commercial self-developed trace products, and how should I choose? This is a question that many users will encounter when researching trace solutions, and it is also the most confusing misunderstanding.

In order to understand this problem, we need to start from two aspects. First, we need to sort out the core risks and typical scenarios of online applications. Second, we need to compare the capabilities of the three trace solutions: open source self-built, hosted, and commercial self-developed. As the saying goes, "know yourself and know your enemy, and you will never be defeated in a hundred battles." Only by combining your actual situation can you choose the most suitable solution.

1. “Two types of risks” and “ten typical problems”

Online application risks are mainly divided into two categories: "errors" and "slowness". The reason for "errors" is usually that the program does not run as expected, such as the JVM loading the wrong version of the class instance, the code entering an abnormal branch, the environment configuration error, etc. The reason for "slowness" is usually insufficient resources, such as sudden traffic causing the CPU to be fully utilized, the microservice or database thread pool being exhausted, memory leaks causing continuous FGC, etc.

Whether it is a "wrong" problem or a "slow" problem. From the user's perspective, they all hope to quickly locate the root cause, stop the loss in time, and eliminate hidden dangers. However, based on the author's more than five years of experience in Trace development, operation and maintenance, and preparation for the Double Eleven promotion, most online problems cannot be effectively located and solved only through the basic capabilities of link tracing. The complexity of online systems determines that an excellent Trace product must provide more comprehensive and effective data diagnostic capabilities, such as code-level diagnosis, memory analysis, thread pool analysis, etc.; at the same time, in order to improve the usability and stability of Trace components, it is also necessary to provide dynamic sampling, lossless statistics, automatic convergence of interface names, and other capabilities. This is why the mainstream Trace products in the industry are gradually upgrading to the fields of APM and application observability. For ease of understanding, this article still uses Trace to uniformly describe the observability capabilities of the application layer.

To sum up, in order to ensure the ultimate business stability of online applications, when selecting a link tracing solution, in addition to the general basic capabilities of Trace (such as call chain, service monitoring, and link topology), you can also refer to the "Top Ten Typical Problems" listed below (taking Java applications as an example) to comprehensively compare the differentiated performance of open source self-built, open source hosted, and commercial self-developed Trace products.

1. [Code-level automatic diagnosis] The interface occasionally times out. The call chain can only see the name of the timed-out interface, but not the internal methods. The root cause cannot be located and it is difficult to reproduce. What should I do?

Those who are responsible for stability should be familiar with this scenario: the system will occasionally have interface timeouts at night or during big promotions. By the time the problem is discovered and investigated, the abnormal scene has been lost, and it is difficult to reproduce and cannot be diagnosed through manual jstack. The current open source link tracking implementation can generally only see the timed interface through the call chain. The specific cause and which code caused the abnormality can never be located, and finally nothing can be done. The above scenario is repeated until it causes a failure, and ultimately suffers huge business losses.

In order to solve the above problems, an accurate and lightweight slow call automatic monitoring function is needed. It can truly restore the first scene of code execution without prior tracking, and automatically record the complete method stack of slow calls. As shown in the figure below, when the interface call exceeds a certain threshold (for example, 2 seconds), the monitoring of the thread where the slow request is located will be started until the request ends at the end of the 15th second. The snapshot set of the thread in the life cycle of the request is accurately retained, and the complete method stack and time consumption are restored.

2. [Pooled Monitoring] The microservice/database thread pool is often full, causing service timeouts. It is very difficult to troubleshoot. How to solve this problem?

The microservice/database thread pool is full, causing business requests to time out. This type of problem occurs frequently every day. Students with rich diagnostic experience will consciously check the corresponding component logs. For example, Dubbo will output relevant exception records when the thread pool is full. However, if the component does not output thread pool information, or the operation and maintenance students are not experienced enough in troubleshooting, this type of problem will become very difficult. At present, the open source version of Trace generally only provides JVM overview monitoring, and it is impossible to specifically view the status of each thread pool, let alone determine whether the thread pool is exhausted.

The pooled monitoring provided by the commercial self-developed Trace can directly show the maximum number of threads, current number of threads, number of active threads, etc. of the specified thread pool, and the risk of thread pool exhaustion or high water level can be clearly seen at a glance. In addition, you can also set the thread pool usage percentage alarm, for example, set the Tomcat thread pool current thread number to exceed the maximum thread number by 80%, and set a phone alarm when it reaches 100%.

3. [Thread Analysis] After a major promotion stress test/release change, it was found that the CPU water level was very high. How to analyze the application performance bottleneck and optimize it in a targeted manner?

When we do stress testing for big promotions or major version changes (including a lot of code logic), we may encounter a sudden increase in the CPU level, but we cannot clearly locate which section of the code is causing it. We can only do jstack over and over, compare thread state changes with the naked eye, and then continuously try to optimize based on experience. In the end, we consume a lot of energy, but the effect is mediocre.

So is there a way to quickly analyze application performance bottlenecks? The answer must be yes, and there is more than one. The most common way is to manually trigger a ThreadDump for a period of time (such as 5 minutes), and then analyze the thread overhead and method stack snapshots during this period. The disadvantage of manually triggering ThreadDump is that the performance overhead is relatively large, it cannot be run normally, and it cannot automatically retain the on-site snapshots that have occurred. For example, during the stress test, the CPU is high, and when the stress test is finished and reviewed, the scene is no longer there, and it is too late to manually ThreadDump.

The second is to provide a normalized thread analysis function, which can automatically record the status, number, CPU time consumption and internal method stack of each type of thread pool. In any time period, click to sort by CPU time consumption to locate the thread category with the largest CPU overhead, and then click the method stack to see the specific code stuck point. As shown in the figure below, there are a large number of BLOCKED state methods stuck in database connection acquisition, which can be optimized by increasing the database connection pool.

4. [Abnormal Diagnosis] After executing the release/configuration change, the interface reports a large number of errors, but the cause cannot be located immediately, resulting in business failure?

The biggest culprit affecting online stability is change. Whether it is an application release change or a dynamic configuration change, it may cause abnormal program operation. So, how can we quickly determine the risk of change, discover problems at the first time, and stop losses in time?

Here, I would like to share an exception release interception practice of Alibaba's internal release system. One of the most important monitoring indicators is the comparison of the number of Java Exception/Error exceptions. Whether it is NPE (NullPointException) or OOM (OutOfMemoryError), based on the monitoring and alerting of all/specific exception numbers, online exceptions can be quickly discovered, especially before and after the change timeline.

On the independent exception analysis and diagnosis page, you can view the change trend and stack details of each type of exception, and further view the associated interface distribution, as shown in the following figure.

5. [Memory Diagnosis] The application frequently FGCs and a memory leak is suspected, but the abnormal object cannot be located. What should I do?

FullGC is definitely one of the most common problems in Java applications. Various reasons such as object creation too fast and memory leaks can cause FGC. The most effective way to troubleshoot FGC is to perform a HeapDump of the heap memory. The memory usage of various objects is clear and visible at a glance.

The white screen memory snapshot function can specify a machine to perform one-click HeapDump and analysis, greatly improving the efficiency of troubleshooting memory problems. It also supports automatic dumping and saving of abnormal snapshots in memory leak scenarios, as shown in the following figure:

6. [Online Debugging] For the same code, the online running state and local debugging behavior are inconsistent. How to troubleshoot?

The code that passed local debugging reports various errors once it is sent to the production environment. What went wrong? I believe that all developers have experienced this nightmare. There are many reasons for this problem, such as Maven dependency conflicts, inconsistent dynamic configuration parameters in different environments, and differences in dependency components in different environments.

In order to solve the problem that the online running code does not meet expectations, we need an online debugging and diagnostic tool that can view the source code, input and output parameters, execution method stack and time consumption, static object or dynamic instance values, etc. of the current program running state in real time, making online debugging as convenient as local debugging, as shown in the following figure:

7. [Full-link tracking] Users report that the website opens very slowly. How can we implement full-link call trace tracking from the web end to the server end?

The key to connecting the front-end and back-end links is to follow the same set of transparent transmission protocol standards. Currently, open source only supports back-end application access and lacks front-end tracking points (such as Web/H5, mini programs, etc.). The front-end and back-end full-link tracking solution is shown in the figure below:

Header transparent transmission format: Jaeger format is used uniformly, the Key is uber-trace-id, and the Value is {trace-id}:{span-id}:{parent-span-id}:{flags}.
Front-end access: You can use CDN (Script injection) or NPM, two low-code access methods, to support Web/H5, Weex and various mini-program scenarios.
Backend access: Java applications are recommended to use ARMS Agent first. Non-intrusive tracking does not require code modification and supports advanced functions such as edge diagnosis, lossless statistics, and precise sampling. User-defined methods can be used to actively track data through the OpenTelemetry SDK. Non-Java applications are recommended to access through Jaeger and report data to the ARMS Endpoint. ARMS will be perfectly compatible with link transmission and display between multi-language applications.
Alibaba Cloud ARMS's current full-link tracing solution is based on the Jaeger protocol, and the SkyWalking protocol is being developed to support lossless migration of SkyWalking self-built users. The call chain effect of full-link tracing of front-end, Java application and non-Java application is shown in the following figure:

8. [Lossless Statistics] The call chain log cost is too high. After client sampling is enabled, the monitoring chart is inaccurate. How to solve this problem?

The call chain log is positively correlated with the traffic. The traffic of To C business is very large, and the cost of reporting and storing the full call chain will be very high. However, if client sampling is enabled, the problem of inaccurate statistical indicators will arise. For example, if the sampling rate is set to 1%, only 100 out of 10,000 requests will be recorded. The statistical data aggregated from these 100 logs will lead to serious sample skew problems and cannot accurately reflect the actual service traffic or time consumption.

In order to solve the above problems, we need to support lossless statistics on the client agent. For the same indicator, only one piece of data will be reported no matter how many times it is requested within a period of time (usually 15 seconds). In this way, the statistical results of the indicator are always accurate and will not be affected by the call chain sampling rate. Users can adjust the sampling rate with confidence, and the call chain cost can be reduced by more than 90%. The larger the traffic and cluster size of the user, the more significant the cost optimization effect.

9. [Automatic interface name convergence] Due to parameters such as timestamps and UID, the URL names of RESTFul interfaces diverge, and the monitoring charts all have meaningless breakpoints. How to solve this problem?

When there are variable parameters such as timestamps and UIDs in the interface names, the names of the same type of interfaces will be different and appear very rarely, which is not worth monitoring and will cause hot spots in storage/computing, affecting cluster stability. In this case, we need to classify and aggregate the divergent interfaces to improve the value of data analysis and cluster stability.

At this point, we need to provide an automatic convergence algorithm for interface names, which can actively identify variable parameters, aggregate interfaces of the same type, and observe the trend of category changes, which is more in line with user monitoring requirements; at the same time, it avoids data hot spots caused by interface divergence and improves overall stability and performance. As shown in the following figure: /safe/getXXXInfo/xxxx will be classified into one category, otherwise each request will be a chart with only one data point, and user readability will become very poor.

10. [Dynamic Configuration Delivery] Sudden online traffic leads to insufficient resources, and non-core functions need to be downgraded immediately. How can dynamic downgrade or optimization be achieved without restarting the application?

Accidents always come unexpectedly. Traffic bursts, external attacks, and computer room failures may all lead to insufficient system resources. In order to protect the most important core business, we often need to dynamically downgrade some non-core functions to release resources without restarting the application, such as lowering the client call chain sampling rate and shutting down some diagnostic modules with high performance overhead. On the contrary, sometimes we need to dynamically enable some high-overhead deep diagnostic functions to analyze the current abnormal scene, such as memory dump.

Whether it is dynamic downgrade or dynamic activation, dynamic configuration pushdown is required without restarting the application. However, open source trace usually does not have such capabilities, and it is necessary to build a metadata configuration center and perform corresponding code transformation. Commercial trace not only supports dynamic configuration pushdown, but also can be refined to each application for independent configuration. For example, if application A has occasional slow calls, the automatic slow call diagnosis switch can be turned on for monitoring; while application B is sensitive to CPU overhead, this switch can be turned off; the two applications each take what they need without affecting each other.

2. Open source self-built vs. open source hosting vs. commercial self-developed

<<:  The number of Internet users in my country has reached 1.011 billion, and the Internet penetration rate has reached 71.6%.

>>:  Five ways to establish effective communication in remote teams

Recommend

With the advent of SD-WAN, are branch routers no longer necessary?

Routing software and software-defined WAN technol...

Essential HTTP knowledge for front-end developers! Just read this article! !

HTTP Origin HTTP was initiated by Tim Berners-Lee...

Who will win the global competition for 5G?

The 5G era is getting closer and closer to us. Fr...

Inspur Network Electronics Range Training Base officially launched

Recently, the "Inspur Network Electronic Tar...

Is 5G connectivity the future of IoT?

The three major US mobile operators AT&T, T-M...

Huawei releases a full range of 5G-A solutions to make 5G-A a reality

[ Dubai , UAE , October 11, 2023 ] During the Glo...

The evolution of the Internet, the inevitability of blockchain

In the article "Bitcoin Prequel", it is...

Let's talk about viewing ServiceEntry injection information in Envoy

[[431019]] introduction Istio provides ServiceEnt...

Encryption makes enterprise data no longer "naked"

At present, the network security environment is d...