A 100% timeout murder caused by maxing out the bandwidth!

A 100% timeout murder caused by maxing out the bandwidth!

[[421757]]

Verse: Do not advise others to do good without experiencing their suffering.

After more than two weeks of hard work, I have finally achieved some results in order to find out the reason why a certain online interface has a 100% timeout. Although I still have doubts, due to time constraints and personal ability issues, I will make the following summary in preparation for another day.

Maximize the egress bandwidth

It was a fluke that I was able to discover this problem. I vaguely remember that it was a stormy night, and the wind and rain destined this night to be extraordinary. Sure enough, the root cause of 100% online timeouts was discovered!

Our online interface needs to make external requests, and when our outbound bandwidth is fully utilized, it naturally takes a long time, thus causing timeouts. Of course, these are all results. After all, the hardships in the intermediate process are far beyond the scope of what Lao Xu can describe in words.

Reflection

The results are there, but we still need to reflect on it. For example, why was there no early warning when the outbound bandwidth was maxed out? Whether it was because of confidence that the bandwidth was sufficient or because of lack of experience, it is worth remembering.

Before the bandwidth problem was actually discovered, Lao Xu had actually had some doubts about the bandwidth in his heart, but he did not seriously verify it and only listened to the speculations of others, which delayed the discovery of the problem.

httptrace

Sometimes I have to praise Go for its good support for http trace. Lao Xu also made a demo based on this, which can print the time taken for each stage of http request.

The above is the time output of each stage of an http request. If you are interested, you can go to https://github.com/Isites/go-coder/blob/master/httptrace/trace.go to get the source code.

Lao Xu's doubts about bandwidth are mainly based on the speculation given by online analysis and testing of the source code in this demo.

Framework issues

This section is more suitable for Tencent brothers to read, and other non-Tencent technologies can be skipped directly.

Our company's framework is TarsGo. We set the handletimeout to 1500ms online. This parameter is mainly used to control the total time consumption of a certain interface to not exceed 1500ms. Our timeout alarms are all 3s, so even if the bandwidth is full, this 100% timeout alarm should not appear.

In order to study the reason, Lao Xu had to spend some spare time reading the source code, and finally found that the handletimeout control of [email protected] was invalid.

Let's take a look at the problematic source code:

  1. func (s *TarsProtocol) InvokeTimeout(pkg []byte) []byte {
  2. rspPackage := requestf.ResponsePacket{}
  3. rspPackage.IRet = 1
  4. rspPackage.SResultDesc = "server invoke timeout"  
  5. return s.rsp2Byte(&rspPackage)
  6. }

When the total execution time of an interface exceeds handletimeout, the InvokeTimeout method is called to inform the client of the call timeout. However, the above logic ignores the response of IRequestId, which results in the client being unable to match the response packet with a certain request when it receives the response packet, causing the client to wait for the response until it times out.

The final changes are as follows:

  1. func (s *TarsProtocol) InvokeTimeout(pkg []byte) []byte {
  2. rspPackage := requestf.ResponsePacket{}
  3. // invoketimeout need to   return IRequestId
  4. reqPackage := requestf.RequestPacket{}
  5. is := codec.NewReader(pkg[4:])
  6. reqPackage.ReadFrom( is )
  7. rspPackage.IRequestId = reqPackage.IRequestId
  8. rspPackage.IRet = 1
  9. rspPackage.SResultDesc = "server invoke timeout"  
  10. return s.rsp2Byte(&rspPackage)
  11. }

Later, Lao Xu used a demo to verify that handletimeout can finally be controlled and take effect. Of course, Lao Xu has submitted an issue and PR on GitHub for this modification, which has been merged into master. The relevant issues and PR are as follows:

https://github.com/TarsCloud/TarsGo/issues/294

https://github.com/TarsCloud/TarsGo/pull/295

Still have doubts

At this point, the problem has not been perfectly resolved.

The above figure shows the maximum time consumption of external requests. The glitches are serious and the time consumption is unreasonable. The red part in the figure takes about 881 seconds. In fact, we have strict timeout control when initiating http requests. This is also the most troublesome problem for Lao Xu. The acne on his face these days is the proof of staying up late for it.

What is even more frightening is that after we replaced the official http with fasthttp, the burrs disappeared! Lao Xu thought he had a superficial understanding of go's http source code, but the cruel reality made him doubt his life.

So far, Lao Xu has briefly read the source code of http again and still found no problems. This will most likely become an unsolved case. I hope that experienced experts can share one or two, so that at least this article can have a beginning and an end.

When replacing fasthttp, the bandwidth has not been fully utilized.

A bright future

Finally, without further ado, here are the pictures!

<<:  The fourth largest operator is preparing to release numbers, and will become a 5G market leader as soon as it enters the market

>>:  Let’s talk about the stories behind Cookie, Session and Token

Recommend

Seven common mistakes in open source projects

【51CTO.com Quick Translation】 Starting a new open...

The future of connectivity: Five breakthroughs in smart device research for 2023

The future of connectivity has never been more ex...

Can 5G drive innovation in the smart home market?

[[348075]] We still have a long way to go before ...

EasyVM: $3/month KVM-2GB/30GB/2TB/Dallas & New York, etc.

EasyVM is a foreign hosting company founded in 20...

Denodo announces product launch in China through AWS Marketplace

Beijing, June 8, 2021 - Denodo, a leader in data ...

How to collect intranet information

The essence of penetration testing is information...

By 2026, the Wi-Fi 6 and 6E market in Asia Pacific will reach US$8.559 billion

On March 26, according to foreign media reports, ...

Do you know the characteristics of 5G core network (5GC)?

[[333327]] 3GPP defines the 5G core network as a ...