Verse: Do not advise others to do good without experiencing their suffering. After more than two weeks of hard work, I have finally achieved some results in order to find out the reason why a certain online interface has a 100% timeout. Although I still have doubts, due to time constraints and personal ability issues, I will make the following summary in preparation for another day. Maximize the egress bandwidthIt was a fluke that I was able to discover this problem. I vaguely remember that it was a stormy night, and the wind and rain destined this night to be extraordinary. Sure enough, the root cause of 100% online timeouts was discovered! Our online interface needs to make external requests, and when our outbound bandwidth is fully utilized, it naturally takes a long time, thus causing timeouts. Of course, these are all results. After all, the hardships in the intermediate process are far beyond the scope of what Lao Xu can describe in words. Reflection The results are there, but we still need to reflect on it. For example, why was there no early warning when the outbound bandwidth was maxed out? Whether it was because of confidence that the bandwidth was sufficient or because of lack of experience, it is worth remembering. Before the bandwidth problem was actually discovered, Lao Xu had actually had some doubts about the bandwidth in his heart, but he did not seriously verify it and only listened to the speculations of others, which delayed the discovery of the problem. httptrace Sometimes I have to praise Go for its good support for http trace. Lao Xu also made a demo based on this, which can print the time taken for each stage of http request. The above is the time output of each stage of an http request. If you are interested, you can go to https://github.com/Isites/go-coder/blob/master/httptrace/trace.go to get the source code. Lao Xu's doubts about bandwidth are mainly based on the speculation given by online analysis and testing of the source code in this demo. Framework issuesThis section is more suitable for Tencent brothers to read, and other non-Tencent technologies can be skipped directly. Our company's framework is TarsGo. We set the handletimeout to 1500ms online. This parameter is mainly used to control the total time consumption of a certain interface to not exceed 1500ms. Our timeout alarms are all 3s, so even if the bandwidth is full, this 100% timeout alarm should not appear. In order to study the reason, Lao Xu had to spend some spare time reading the source code, and finally found that the handletimeout control of [email protected] was invalid. Let's take a look at the problematic source code:
When the total execution time of an interface exceeds handletimeout, the InvokeTimeout method is called to inform the client of the call timeout. However, the above logic ignores the response of IRequestId, which results in the client being unable to match the response packet with a certain request when it receives the response packet, causing the client to wait for the response until it times out. The final changes are as follows:
Later, Lao Xu used a demo to verify that handletimeout can finally be controlled and take effect. Of course, Lao Xu has submitted an issue and PR on GitHub for this modification, which has been merged into master. The relevant issues and PR are as follows: https://github.com/TarsCloud/TarsGo/issues/294 https://github.com/TarsCloud/TarsGo/pull/295 Still have doubtsAt this point, the problem has not been perfectly resolved. The above figure shows the maximum time consumption of external requests. The glitches are serious and the time consumption is unreasonable. The red part in the figure takes about 881 seconds. In fact, we have strict timeout control when initiating http requests. This is also the most troublesome problem for Lao Xu. The acne on his face these days is the proof of staying up late for it. What is even more frightening is that after we replaced the official http with fasthttp, the burrs disappeared! Lao Xu thought he had a superficial understanding of go's http source code, but the cruel reality made him doubt his life. So far, Lao Xu has briefly read the source code of http again and still found no problems. This will most likely become an unsolved case. I hope that experienced experts can share one or two, so that at least this article can have a beginning and an end. When replacing fasthttp, the bandwidth has not been fully utilized. A bright futureFinally, without further ado, here are the pictures! |
>>: Let’s talk about the stories behind Cookie, Session and Token
LOCVPS (Global Cloud) is a Chinese VPS service pr...
As autumn is approaching, ZJI has launched a prom...
【51CTO.com Quick Translation】 Starting a new open...
The future of connectivity has never been more ex...
WeChat and QQ have become the strongest kings in ...
[[348075]] We still have a long way to go before ...
2023 has officially begun, and RAKsmart has launc...
[[424098]] In recent years, with the advent of a ...
EasyVM is a foreign hosting company founded in 20...
Beijing, June 8, 2021 - Denodo, a leader in data ...
[Beijing, China, September 27] Today, the 2022 Ch...
The essence of penetration testing is information...
On March 26, according to foreign media reports, ...
[[333327]] 3GPP defines the 5G core network as a ...