Business Background As the mobile development industry enters the stock era, the load capacity of the overall App architecture and the optimization of each link have gradually become the focus of attention of developers. Stress testing is the main solution to achieve the above functions. Generally, stress testing can be based on: Test the load bottleneck of backend business; Today, we will introduce the realization principle and implementation path of the full-link stress testing solution. Full-link stress testing and principles Usually we can simply apply the formula load performance = single-machine performance * total number of machines to the estimated plan. However, in actual scenarios, a large number of business nodes are often involved, such as DNS, gateway, database and other links, which may be bottlenecks leading to overall business performance. Therefore, the actual service capacity may be significantly different from expectations. Generally, users will use solutions such as loadrunner to implement server performance stress testing in production environments. However, in mPaaS applications, complex deployment cannot pass the MGS gateway, and high costs and other difficulties arise in order to solve these pain points. The mPaaS team provides the MGS full-link stress testing solution based on the requests of many customers. The biggest difference between the full-link stress testing solution and previous testing solutions is the difference in perspective. The full-link stress testing solution takes the client's perspective as the entry point, treats the entire server link as a black box, uses real requests and responses as the basis for evaluation, simulates real business requests, real data traffic, and real user habits, to achieve the most realistic evaluation results possible. Link Sorting In a standard data link, the following model is generally used In the full-link stress test, we regard the entire server implementation as a black box, so we need to focus on the first half. The key points can be summarized as follows: 1. The client requests construction; 2. The client request is sent and passes through the MGS gateway; 3. The client parses the response returned by the MGS gateway and handles it correctly; 4. Implement high-concurrency client request clustering. After reviewing the above, we can summarize the following difficulties Difficulty 1: Client request construction The mPaaS mobile gateway RPC communication is a standardized interface method implemented on the basis of the HTTP protocol. On the premise of reusing the HTTP request standard, a set of data exchange formats is defined, using Header and Body as the actual distinction. It can be roughly understood as using the Operation-Type in the Header as the real API pointer, and encapsulating the body part according to the rules and forwarding it. In this step, we use JMeter as the implementation solution. Jmeter's flexible scripting features can well implement the client's real request simulation. Difficulty 2: Data encryption and decryption The unique data encryption method of the mPaaS mobile gateway RPC request constructs the more complex part of the request. The existing test solutions on the client side cannot cover this part of the capability, so they often choose to disable the signature verification and encryption functions of the gateway server to implement stress testing. The hidden danger of this method is that it is impossible to estimate the computing pressure that encryption and decryption bring to the gateway server. According to experience, different encryption and decryption algorithm configurations have a 20% to 40% impact on the gateway throughput. At this stage, the JMeter plug-in MGSJMeterExt was customized and developed by the Financial Line SRE team based on the user's production environment. This plug-in reversely implements the encryption and decryption process of the request body, so that the arrangement of the stress test script can include the encryption part. Difficulty 3 Request signature build The signature verification mechanism of the mPaaS mobile gateway RPC request is also quite special. Like data encryption and decryption, there is currently no solution on the client side that can cover this part of the capability, and the client often chooses to close the interface signature verification for testing. Similarly, with the help of MGSJMeterExt, the correct signature of the message can be achieved in JMeter and verified by the server. Difficulty 4: Deployment of stress test cluster environment For stress testing, the focus should be on the real traffic inlet and the real number of concurrent connections to get real results. However, implementing the stress testing environment by yourself and the high cost of cluster deployment have become unnecessary expenses. Therefore, we recommend that users use Alibaba Cloud PTS as the stress testing platform. Compared with other solutions, it has the advantages of easy deployment, support for Jmeter scripts, and real traffic. It can also provide users with more detailed stress testing reports. Overview The above model can be simply summarized into the following structure Full-link solution and implementation Part 1 Preliminary preparation and research In the early stage, the goal is to provide relevant preparation and data support for the actual stress testing and to establish the stress testing goals and overall direction. 1.1 Objectives and data preparation 1. Customers need to clearly define their own stress testing goals and purposes. Based on the stress testing goals and referring to past operational data, customers should provide the specific business categories and possible user behavior habits involved, as well as the relative weight of each habit in the overall business. 1.2 Client Preparation 1. The client needs to sort out the interfaces and data flows that may be involved in the client implementation based on the corresponding business goals, such as whether it includes pre-steps such as login, whether it includes mandatory steps such as refreshing the homepage, etc., collect the actual request and response in this step by capturing packets, and determine the value conditions that meet the expectations. 2. This step involves different business structures, and the preparation can also be completed by the server interface. 1.3 Server Preparation 1. On the server side, based on the relevant interfaces counted in 1.2, make relevant data shields to avoid polluting the real database with test data. 2. Since the server is regarded as a black box in the mPaaS full-link stress test, it is necessary to monitor the performance indicators of each service on the server to provide a basis for subsequent server tuning. 1.4 MGSJMeterExt plugin preparation Since MGSJMeterExt needs to be customized according to the actual gateway environment, users are required to provide the following data: 1. Workspace related environmental data 2. Encryption algorithm and public key Q&A Q: How to implement stress testing scripts? A: Our expert team and on-site students will complete the stress testing script training in simple scenarios. In actual scenarios, it may involve multiple links of the business, such as obtaining a login token and some clear pre-steps. Since this type involves complex business scenarios, customers need to complete it themselves with the assistance of the Alibaba expert team. Q: Why is it a full-link? A: Although our stress testing script is implemented based on client logic, we actually simulate real data requests and confirm whether the server's response meets expectations, involving the entire data link and nodes. Q: How to track link indicators? A: The stress testing solution is based on a black box. It verifies the performance of the entire architecture from a user's perspective by checking the system's pts indicator, request parameters, and return rate, and the success rate of expected results. For some backend indicators, since there are many differences in the server architectures used by different customers, for such backend indicators, the corresponding service providers can generally provide relevant monitoring solutions, and there is no need for mPaaS to process them. Q: Why use PTS? A: The mPaaS team actually provides the MGS communication solution to assist customers in writing PTS scripts. It is not mandatory to use PTS. It only needs to provide the relevant Jmeter cluster deployment environment, and users need to purchase PTS-related resources by themselves. However, the mPaaS team has evaluated multiple cases and found that using PTS is relatively more cost-effective and can provide a more expected stress testing environment and complete stress testing reports. Therefore, it is recommended that users use PTS for stress testing. Q: Are there any detailed standards, such as what performance indicators should be achieved in the case of 2c4g, or 4c8g? A: The purpose of stress testing is to clarify the performance indicators that can be achieved under the relevant system resources. Due to the different server architectures and the different process nodes involved in the actual business, there are huge differences in performance in different environments. These are the purposes of stress testing. Stress testing is needed to clarify the real indicators and evaluate the actual resource consumption of each node. Part 2 Jmeter development and script modification We have summarized the special focus of the MGS communication solution, so we need to complete these transformations in Jmeter 2.1 Header transformation In the Header, we need to pay attention to the following points: 1. The MGS gateway protocol depends on some header fields, so it is necessary to ensure that the gateway parameters are complete. 2. Some parameters are fixed values and can be hard-coded. For related configurations, refer to the configuration file downloaded from the console. 3. If the business has other header dependencies such as cookide, you can add them directly. The MGS gateway will not filter the header information. 2.2 URL transformation In the URL, we need to pay attention to the following points: 1. The actual URL should point to the MGS gateway, not the actual business server. For related configurations, refer to the configuration file downloaded from the console. 2. Currently, all requests to the MGS gateway are post. If there is a get request, it will become a get when forwarded by the MGS, and it will also be a post in the communication with the MGS. 3. If there is no special requirement for the Body part, it is recommended to follow the picture. 2.3 Request transformation In Request, we need to pay attention to the following points: 1. The encryption/signature verification here depends on the MGSJMeterExt file, which needs to be referenced. 2. In general, you only need to modify the //config part. 3. The following parts are generally unified solutions, mainly for encryption and signature verification, and do not need to be modified. 2.4 Response transformation In Response, we need to pay attention to the following points: 1. Considering the performance of the pressure machine, it will not affect the evaluation ability of the server. Therefore, if there is no need for secondary use of data or result judgment, it is not necessary to write it here. 2. If you have any related needs, you can complete the secondary processing of the Response return parameter here Part 3 Actual stress test The general steps can be summarized as follows: 3.1 PTS and script performance tuning Alibaba Cloud Performance Testing Service (PTS) provides convenient and fast cloud-based stress testing capabilities. In this stress testing service, PTS is used to input Internet stress traffic. The interesting point is that encryption and decryption calculations not only bring computing pressure to the gateway, but also to the stress machine. Therefore, before implementing the first version of the plug-in and stress test script, we first conducted a "stress test" on the stress test machine. The first basic test PTS press test machine configuration: 1.PTS single IP unit configuration 2. Concurrency 500 (maximum concurrency for a single machine) 3. Fixed pressure flow model 4. Two-minute stress test duration From the recovered stress test report, the TPS result is not high, but the returned RT value is not high: Next, we observe the performance of the pressure machine. We can see that the CPU usage of the pressure machine has been relatively high. Therefore, we have reason to suspect that the pressure of encryption calculations has a greater impact on the pressure release of the pressure machine. By caching the encryption results of repeated content, the computing pressure is greatly reduced; at the same time, in order to avoid memory problems caused by cache design, the cache upper limit is limited. Second round of testing The test configuration is exactly the same as the first round, only the optimized encryption plug-in is replaced. According to the recovered test report, the scenario TPS has increased by 75%: There is an obvious optimization in the CPU performance of the pressure machine. The third round of testing With the results of the first round of investigation and the second round of optimization, the third round of testing used two stress machines to perform full load stress testing on the configuration and observed the stress test results: Judging from the results, the stress testing script and orchestration process are in line with expectations, and formal PTS cloud stress testing can be carried out in the customer production environment. 3.2 Stress Testing of Production Environment At the beginning of the formal stress test, several rounds of small-scale stress tests were conducted to observe whether the working status of the backend system met expectations. The following problems were found during the investigation: Problem 1: Uneven Nginx traffic forwarding From the logs of the MGS container, it can be seen that some containers cannot get any requests. After investigation, it is found that the problem is caused by three reasons: 1) The Nginx forwarding configuration in the DMZ zone lacks an MGS container IP; 2) The network policy from the DMZ zone to each MGS container IP needs to enable access rights; 3) The Nginx forwarding rule is set to iphash. In the test case of a single IP source, traffic can only be forwarded to one container. The problem was solved after configuring the correct IP list, enabling network permissions, and modifying the forwarding rules. Problem 2: The CPU load of a specific MGS container is too high Preliminary tests found that the CPU load of one MGS container (mpaasgw-7) was 25% in silent mode, which was not in line with expectations. I logged into the container and found a JPS process that consumed a lot of CPU. I suspected that it was not released normally after being called in the early debugging phase. The problem was solved after killing the JPS process. In order to avoid other problems, I restarted the container at the same time. Note: JPS, Java Virtual Machine Process Status Tool) is a command provided by Java to display the PIDs of all current Java processes, see: https://docs.oracle.com/javase/7/docs/technotes/tools/share/jps.html). Problem 3: The CoreWatch monitoring platform is inaccessible The CoreWatch console cannot be accessed, and the browser reports a 502 error. After restarting the CoreWatch container, the page can be loaded, but it is always in the loading state. http://corewatch.*.com/xflush/env.js is always in pending state. After checking, it was found that there was an error in the ALB instance monitoring configuration. After correction, the problem was solved. 3.3 Production Environment Stress Test & Summary After resolving all issues in 3.2, the system is ready for stress testing. The formal stress test will conduct stress tests on both the "encrypted scenario" and the "non-encrypted" scenario. Since production data is not disclosed, the following only lists some examples of the problems encountered. Test under "encryption" 1. During stress testing, it was found that TPS did not increase when the number of concurrent connections was around 500, which means that a bottleneck may have been reached. 2. Observe the load of the MGS gateway container and find that the overall CPU load has reached the limit. 3. The CPU load of the MCUBE container in the same time period is healthy, and other performance indicators (IO, network, etc.) are also in a healthy state. 4. From the above situation, in the encryption scenario, the main performance bottleneck is on the MGS gateway. According to experience and process analysis, the main performance pressure is caused by intensive calculations during message encryption and decryption. To solve this bottleneck, the MGS container needs to be expanded. Testing without encryption 1. The TPS growth stops when the concurrency reaches about 1000. Generally speaking, this situation indicates that the system capacity bottleneck has been reached. 2. Observe the load of the MGS gateway container. Unlike the situation in the encryption case, the overall CPU load is not high. 3. At the same time, according to the feedback from the network team: During the stress test, the number of TCP sessions from the Internet to the DMZ area was 3 to 4 times that of the DMZ area to the intranet area, and the CPU pressure of the firewall in the transaction intranet segment was relatively high. 4. Combining the above three manifestations, it is suspected that the network bottleneck has been reached. According to the on-site situation, it was found that the Nginx in the DMZ area did not adopt a long connection maintenance strategy when forwarding to the intranet. Modify the Nginx configuration, add the keepalive 1000 configuration, and re-perform the second round of testing. About the parameter Keepalive: By default, Nginx uses short connections (HTTP1.0) to access the backend. For each new request, Nginx will open a new port to establish a connection with the backend, and actively close the connection after the backend completes the execution. The Keepalive parameter tells Nginx the number of cached long connections between the backend server and the backend server. When a new request comes in, the TCP connection can be reused directly to reduce the performance impact of establishing a TCP connection. See: http://nginx.org/en/docs/http/ngx_http_upstream_module.html. Summarize After optimizing the above issues, there is at least a 70% performance improvement in non-encrypted scenarios and a 10% performance improvement in encrypted scenarios. A significant performance improvement can be achieved after the MGS expansion is completed. The results of the optimization far exceeded expectations. |
<<: Imitate Spring to implement a class management container
>>: Borui Data passed the CMMI Level 5 assessment, the first in the domestic APM field
The deployment of the three major mobile operator...
We all want devices to communicate with each othe...
The tribe has shared information about EtherNetse...
5G networks are starting to roll out across the U...
Changing the rules of the online market It’s safe...
[51CTO.com original article] As a global ICT indu...
Market research firm Technavio released a latest ...
Arthur C. Clarke, a famous British science fictio...
A few days ago, we shared CMIVPS's regular pr...
VIAVI Solutions (NASDAQ: VIAV) today announced a ...
The tribe has shared information about RepriseHos...
Xi’an University of Architecture and Technology i...
Zgovps also released a promotion during this year...
[[178931]] The application scenarios of various s...
Today, at the China (Beijing) International Infor...