How to troubleshoot 502 issues? Have you learned how to do it?

How to troubleshoot 502 issues? Have you learned how to do it?

When I first started working, one time, the guy who called my service upstream said, "Your service reported a 502 error. Go and see why."

At that time, there happened to be a call log in the service, which usually records various 200, 4xx status code information. So I went to the service log to search for the number 502, but found nothing. So I said to my brother, "There is no record of 502 in the service log, are you mistaken?"

Thinking about it now, I feel a little embarrassed.

I don’t know how many brothers are in the same situation as I was at that time. In this article, let’s talk about what 502 error is?

Let's start by talking about what the status code is.

HTTP Status Codes

Taobao and Baidu that we usually browse in the browser are actually front-end web pages.

Generally speaking, the front end does not store too much data, and most of the time it needs to get data from the back end server.

Therefore, the front-end and back-end need to establish a connection through the TCP protocol, and then transmit data based on TCP.

TCP is a protocol based on data streams. When transmitting data, it does not add data boundaries to each message. Directly using raw TCP for data transmission will cause the "packet sticking" problem.

Therefore, a special protocol format is needed to parse the data. So the HTTP protocol was designed on this basis. For details, please refer to my previous article "Since there is HTTP protocol, why do we need RPC?"

For example, if I want to see the specific information of a certain product, the product ID is passed in the HTTP request sent by the front end, and the price, store name, shipping address, etc. of the product are returned in the HTTP response returned by the back end.

Get product details by id

In this way, on the surface, we are browsing various web pages, but in fact, multiple HTTP messages are being sent and received behind the scenes.

Users browse products online

But here comes the problem. The above are all normal situations. What if there are abnormal situations? For example, the data sent by the front end is not a product ID at all, but a picture. It is impossible for the back-end server to give a normal response. Therefore, a set of HTTP status codes needs to be designed to indicate whether the HTTP request response process is normal. This can affect the behavior of the browser.

For example, if everything is normal, the server returns a 200​ status code, and the front-end can safely use the response data after receiving it. But if the server finds that the things sent by the client are abnormal, it will respond with a 4xx​ status code, which means that this is a client error. The xx in 4xx can be further subdivided into various codes according to the type of error, such as 401​ means that the client has no permission, and 404​ means that the client requested a web page that does not exist at all. Conversely, if there is a problem with the server, it will return a 5xx status code.

The difference between 4xx and 5xx

But here comes the problem.

There is a problem on the server side. In serious cases, the server may crash directly. How can it return a status code to you?

Yes, in this case, the server cannot return a status code to the client. Therefore, in general, the 5xx status code is not actually returned by the server to the client.

They are returned by a gateway, a common gateway like nginx.

The role of nginx

Back to the topic of front-end and back-end interactive data, if there are few front-end users, the back-end can handle the requests with ease. However, as the number of users increases, the back-end server is limited by resources, and the CPU or memory may be seriously insufficient. At this time, the solution is also very simple, which is to get a few more identical servers, so that these front-end requests can be evenly distributed to several servers, thereby improving processing capabilities.

But to achieve this effect, the front end must know which specific servers there are on the back end and establish TCP connections with them one by one.

Establish connections between the front end and multiple servers

It's not impossible, but it's troublesome.

But at this time it would be nice if there was a middle layer between them, so that the client only needs to connect to the middle layer, and the middle layer then establishes a connection with the server.

Therefore, this middle layer becomes a proxy for these servers. The client will go to the proxy for everything. It just sends its request, and the proxy will find a server to complete the response. Throughout the whole process, the client only knows that its request has been handled by the proxy, but the client does not know which server the proxy found to complete it, nor does it need to know.

This kind of proxy method that blocks specific servers is called reverse proxy.

Reverse Proxy

Conversely, blocking specific client proxy methods is the so-called forward proxy.

The role of this middle layer is generally played by gateways such as nginx.

In addition, since the performance configurations of the servers behind may be different, some have 4 cores and 8G, and some have 2 cores and 4G, nginx can add different access weights to them, and multiple forwarding point requests with high weights, and implement different load balancing strategies in this way.

nginx returns 5xx status code

With nginx as the middle layer, the client changes from directly connecting to the server to directly connecting to nginx, and then directly connecting to the server by nginx. From one TCP connection to two TCP connections.

Therefore, when an exception occurs on the server, the TCP connection sent by nginx to the server cannot respond normally. After receiving this information, nginx will return a 5xx error code to the client. In other words, the 5xx error is actually identified by nginx and returned to the client. The server itself does not have 5xx log information. Therefore, the scene at the beginning of the article appears. The upstream received a 502 error from my service, but I could not find this information in my service log.

Common causes of 502

The official explanation of the 502 error code in rfc7231 is

 502 Bad Gateway
The 502 ( Bad Gateway ) status code indicates that the server , while acting as a gateway or proxy , received an invalid response from an inbound server it accessed while attempting to fulfill the request.

In other words, the 502 (Bad Gateway) status code means that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while trying to fulfill the request.

Did you listen to what people say?

For most programmers, this not only fails to explain the problem, but only raises more questions. For example, what exactly does the invalid response mentioned above mean?

Let me explain. It actually means that 502 is actually issued by the gateway proxy (nginx) because the gateway proxy forwards the client's request to the server, but the server issues an invalid response. The invalid response here generally refers to TCP's RST message or a FIN message with four waves.

I guess everyone is familiar with the four waves, so let’s skip it and focus on what the RST message is.

What is RST?

We all know that TCP normally disconnects the connection by waving four times, which is an elegant approach under normal circumstances.

But under abnormal circumstances, both the sender and the receiver may not be normal, and even waving may not be possible, so a mechanism is needed to forcibly close the connection.

RST is used in this case, and is generally used to abnormally close a connection. It is a flag in the TCP packet header. After receiving a data packet with this flag set, the connection will be closed. At this time, the party receiving the RST will see a connection reset​ or connection refused error at the application layer.

TCP header RST bit

There are generally two common reasons for sending a RST message.

The server disconnected prematurely

There is a TCP connection between nginx and the server. When nginx forwards the client request to the server, the two should maintain this connection until the server returns the result normally and then disconnects.

However, if the server disconnects too early and nginx continues to send messages, nginx will receive a RST message returned by the server kernel or a FIN message with four waves, forcing the connection on the nginx side to end.

There are two common reasons for premature disconnection.

The first is that the timeout set by the server is too short. Regardless of the programming language used, there is generally a ready-made HTTP library. The server generally has several timeout parameters. For example, there is a write timeout (WriteTimeout) in the HTTP service framework of golang. Assuming it is set to 2s, it means that the server needs to process the request within 2s after receiving it and write the result to the response. If it cannot wait, the connection will be disconnected.

For example, if your interface processing time is 5s, but your WriteTimeout is only 2s, the HTTP framework will actively disconnect the connection before the response is written. At this time, nginx may receive four FIN messages (some frameworks may also send RST messages), and then disconnect, so the client will receive a 502 error.

If you encounter this problem, just increase the WriteTimeout time.

The relationship between FIN and 502

The second reason, and the most common reason for the 502 status code, is that the server application process crashed.

The server crashes, that is, there is no process listening to the server port, and you try to send data to a non-existent port, the server's Linux kernel protocol stack will respond with a RST packet. Similarly, nginx will also give the client a 502.

RST and 502

This situation is most common during the development process.

Now most of our servers will restart the hung services, so we need to determine whether the service has ever crashed.

If you have monitored the CPU or memory of the server, you can check whether the CPU or memory monitoring graph has a sudden drop. If so, it is likely that your server application has crashed.

CPU suddenly plummeted

In addition, you can also use the following command to see when the process was last started.

 ps -olstart { pid }

For example, if the process ID I want to see is 13515, the command needs to be as follows.

 # ps -olstart 13515
STARTED
Wed Aug 31 14 : 28 : 53 2022

You can see that the last startup time was August 31st. If this time is different from the operation time you remember, it means that the process may have crashed and then been restarted.

When encountering this kind of problem, the most important thing is to find out the cause of the crash. There are many reasons for the crash, such as writing to an uninitialized memory address, or out-of-bounds memory access (the length of array arr is only 2, but the code reads arr[3]).

This situation is almost always caused by a program having a code logic problem. When a program crashes, it will usually leave a code stack. You can use the stack error to troubleshoot the problem and fix it. For example, the following picture shows the error stack information of golang, and other languages ​​are similar.

Error stack

The case where the stack is not printed

But there are some cases where sometimes no stack is left at all.

For example, a memory leak may cause a process to occupy more and more memory, eventually exceeding the server's maximum memory limit and triggering OOM (out of memory), and the process is directly killed by the operating system.

There are even more subtle ones, where the code logic hides the operation of actively exiting the process. For example, in golang's log printing, there is a method called log.Fatalln()​, which will execute os.Exit() to exit the process directly after printing the log. Newbies who don't understand the source code can easily make this mistake.

After printing, exit the process

If you are sure that your service has never crashed, then continue reading.

The gateway sent the request to a non-existent IP

Nginx is used to proxy multiple servers through configuration. This configuration is usually placed in /etc/nginx/nginx.conf.

Open it and you may see something like the following.

 upstream xiaobaidebug .top {
server 10.14 .12 .19 : 9235 weight = 2 ;
server 10.14 .16 .13 : 8145 weight = 5 ;
server 10.14 .12 .133 : 9702 weight = 8 ;
server 10.14 .11 .15 : 7035 weight = 10 ;
}

The above configuration means that if the client accesses the xiaobaidebug.top domain name, nginx will forward the client's request to the following four server IPs. There is also a weight next to the IP. The higher the weight, the more times it will be forwarded.

It can be seen that nginx has quite rich configuration capabilities. But it should be noted that these files need to be configured manually. This is of course no problem for a small number of servers that do not change much.

But now is the era of cloud native, many companies have their own cloud products, and services will naturally be moved to the cloud. Generally speaking, every time a service is updated, it may be deployed to a new machine. And this IP will also change. Do you need to manually change the configuration on nginx every time a service is released? This is obviously unrealistic.

If the service can actively tell nginx its own IP when the service starts, and then nginx generates such a configuration and reloads it, things will be much simpler.

In order to realize such a service registration function, many companies will carry out secondary development based on nginx.

However, if there is a problem with the service registration function, for example, after the service is started, the new service is not registered, but the old service has been destroyed. At this time, nginx will still make a request to the IP of the old service. Since the machine where the old service is located no longer has this service, the server kernel will respond with RST. After receiving RST, nginx replies 502 to the client.

The instance has been destroyed but the configuration has not deleted the IP

It is not difficult to troubleshoot this problem.

At this time, you can check whether there are relevant logs printed on the nginx side to see whether the forwarded IP port meets expectations.

If it does not meet your expectations, you can find the colleague who worked on this basic component and have a friendly exchange with him.

Summarize

The HTTP status code is used to indicate the status of the response result, where 200 is a normal response, 4xx is a client error, and 5xx is a server error.

Adding nginx between the client and the server can play the role of reverse proxy and load balancing. The client only requests data from nginx and does not care which server handles the request.

If the backend server application crashes, nginx will receive the RST message returned by the server when accessing the server, and then return a 502 error to the client. 502 is not sent by the server application, but by nginx. Therefore, when 502 occurs, the backend server may not have relevant 502 logs, and this 502 log can only be seen on the nginx side.

If you find 502, first check whether the server application has crashed and restarted through monitoring. If so, check whether there is a crash stack log. If there is no log, check whether it may be OOM or other reasons that caused the process to exit actively. If the process has not crashed, check the nginx log to see if the request was made to an unknown IP port. ​

<<:  The high-quality development of 5G still requires more active policy support

>>:  Focusing on "Software Defined Cars", F5 injects new impetus into car intelligence

Recommend

2018 F5 China Application Service Summit Forum was successfully held

The "F5 China Application Service Summit For...

Huawei Mate X: The Future of the Future

Looking back at the year 2019 that is about to en...

The Internet of Things drives the rapid development of the chip industry

With the popularity of the Internet of Things (Io...

OpLink: $3.50/month-AMD Ryzen/512MB/20GB NVMe/10Gbps ports

OpLink is a foreign hosting company founded in 19...

...

Effective Risk Management in Data Centers

Today, data center managers are constantly battli...

What is an API Gateway?

Hello everyone, I am ApeJava. What is an API Gate...

How to get out of the maze of mixed network management

When people are walking on a broad road, the road...