The Socket and TCP connection process you must know

The Socket and TCP connection process you must know

This article mainly explains the operation of sockets at various stages during the TCP connection process. I hope it can help people who have no basic knowledge of network programming understand what sockets are and what roles they play. If you find any errors, please point them out.

[[274866]]

I. Background

1. The complete socket format {protocol, src_addr, src_port, dest_addr, dest_port}.

This is often called the socket's five-tuple. The protocol specifies whether it is a TCP or UDP connection, and the rest specify the source address, source port, destination address, and destination port. But where do these contents come from?

2. The TCP protocol stack maintains two socket buffers: send buffer and recv buffer.

The data to be sent through the TCP connection is first copied to the send buffer, which may be copied from the app buffer of the user space process or from the kernel buffer of the kernel. The copying process is completed by the send() function. Since the write() function can also be used to write data, this process is also called writing data, and the corresponding send buffer is also called write buffer. However, the send() function is more efficient than the write() function.

The final data flows out through the network card, so the data in the send buffer needs to be copied to the network card. Since one end is the memory and the other end is the network card device, DMA can be used directly to copy without the involvement of the CPU. In other words, the data in the send buffer is copied to the network card through DMA and transmitted through the network to the other end of the TCP connection: the receiving end.

When receiving data through a TCP connection, the data must first flow in through the network card, and then be copied to the recv buffer through DMA, and then the data is copied from the recv buffer to the app buffer of the user space process through the recv() function.

The general process is as follows:


3. Two types of sockets: listening sockets and connected sockets.

The listening socket is created by the socket() function when the service process reads the configuration file, and then the listening socket is bound to the corresponding address and port through the bind() function. Subsequently, the process/thread can listen to the port (strictly speaking, monitor the listening socket) through the listen() function.

A connected socket is a socket returned by the accept() function after a TCP connection request is listened to and a three-way handshake is performed. Subsequent processes/threads can then communicate with the client via TCP through this connected socket.

In order to distinguish the two socket descriptors returned by the socket() function and the accept() function, some people use listenfd and connfd to represent the listening socket and the connected socket respectively. This is quite vivid and is occasionally used below.

The following will explain the functions of various functions. Analyzing these functions is also the process of connecting and disconnecting.

2. Analysis of the specific process of connection

As shown below:


2.1 socket() function

The socket() function is used to generate a socket file descriptor sockfd for communication (socket() creates an endpoint for communication and returns a descriptor). This socket descriptor can be used as the binding object of the bind() function later.

2.2 bind() function

The service program analyzes the configuration file, parses out the address and port to be monitored, and then uses the socket sockfd generated by the socket() function to bind the socket to the address and port combination "addr:port" to be monitored using the bind() function. The socket bound to the port can be used as the monitoring object of the listen() function.

The socket bound to the address and port has the source address and source port (source for the server itself), plus the protocol type specified in the configuration file, there are three tuples in the five-tuple. That is:

{protocal,src_addr,src_port}

However, it is common to see that some service programs can be configured to listen to multiple addresses and ports to achieve multiple instances. This is actually achieved by generating and binding multiple sockets through multiple socket()+bind() system calls.

2.3 listen() and connect() functions

As the name implies, the listen() function listens to the socket that has been bound to addr+port through bind(). After listening, the socket changes from the CLOSE state to the LISTEN state, so that the socket can provide a TCP connection window to the outside world.

The connect() function is used to initiate a connection request to a listening socket, that is, to initiate the three-way handshake process of TCP. From this, we can see that the connection requester (such as the client) will use the connect() function. Of course, before initiating connect(), the connection initiator also needs to generate a sockfd, and the socket used is likely to be bound to a random port. Since the connect() function initiates a connection to a socket, it is natural to bring the destination of the connection when using the connect() function, that is, the target address and target port, which are the address and port bound to the listening socket of the server. At the same time, it also needs to bring its own address and port. For the server, this is the source address and source port of the connection request. As a result, the sockets at both ends of the TCP connection have become a complete five-tuple format.

2.3.1 In-depth analysis of listen()

Let's talk about the listen() function in detail. If multiple addresses + ports are monitored, that is, multiple sockets need to be monitored, then the process/thread responsible for monitoring at this moment will use select(), poll() to poll these sockets (of course, epoll() mode can also be used). In fact, when only one socket is monitored, these modes are also used for polling, but select() or poll() is interested in only one socket descriptor.

Regardless of whether the select() or poll() mode is used (there is no need to say more about the different monitoring methods of epoll), during the process/thread (listener) listening, it blocks on select() or poll(). Until data (SYN information) is written to the sockfd it monitors (i.e., recv buffer), the listener is awakened and copies the SYN data to the app buffer managed by itself in the user space for some processing, and sends SYN+ACK. This data also needs to be copied from the app buffer to the send buffer (using the send() function), and then copied to the network card for transmission. At this time, a new item will be created for this connection in the connection incomplete queue and set to the SYN_RECV state. Then the select()/poll() method is used to monitor the socket listenfd again until data is written to the listenfd again. The listener is awakened. If the data written this time is ACK information, the data is copied into the app buffer for processing, and the corresponding item in the connection incomplete queue is moved to the connection completed queue and set to the ESTABLISHED state. If the received data is not ACK, it must be SYN, which is a new connection request. So, it is put into the connection incomplete queue in the same way as the above process. This is the cycle of the listener processing the entire TCP connection.

That is to say, the listen() function also maintains two queues: the incomplete connection queue and the completed connection queue. When the listener receives a SYN from a client and replies with SYN+ACK, it creates an entry about this client at the end of the incomplete connection queue and sets its status to SYN_RECV. Obviously, this entry must contain the client's address and port related information (it may be hashed, I'm not sure). When the server receives the ACK information sent by the client again, the listener thread knows which item in the incomplete connection queue this message is replied to by analyzing the data, so it moves this item to the completed connection queue and sets its status to ESTABLISHED.

When the unfinished connection queue is full, the listener is blocked and no longer receives new connection requests. It waits for the two queues to trigger writable events through select()/poll(). When the completed connection queue is full, the listener will not receive new connection requests. At the same time, the action of moving to the completed connection queue is blocked. Before Linux 2.2, the listen() function had a backlog parameter to set the maximum total length of the two queues. Starting from Linux 2.2, this parameter only indicates the maximum length of the completed queue, and /proc/sys/net/ipv4/tcp_max_syn_backlog is used to set the maximum length of the unfinished queue. /proc/sys/net/core/somaxconn is a hard limit on the maximum length of the completed queue, which defaults to 128. If the backlog is greater than somaxconn, the backlog will be truncated to equal this value.

When a connection in the connection completion queue is accepted(), it means that the TCP connection has been established, and this connection will use its own socket buffer to transmit data with the client. This socket buffer and the socket buffer of the listening socket are both used to store TCP received and sent data, but their meanings are no longer the same: the socket buffer of the listening socket only accepts the syn and ack data in the TCP connection request process; while the socket buffer of the established TCP connection mainly stores the "formal" data transmitted by both ends, such as the response data constructed by the server and the Http request data initiated by the client.

The Send-Q and Recv-Q columns of the netstat command represent the contents related to the socket buffer. The following is the explanation from man netstat.

Recv-Q Established: The count of bytes not copied by the user program connected to this socket. Listening: Since Kernel 2.6.18 this column contains the current syn backlog. Send-Q Established: The count of bytes not acknowledged by the remote host. Listening: Since Kernel 2.6.18 this column contains the maximum size of the syn backlog.

For a socket in the listening state, Recv-Q indicates the current syn backlog, that is, the number of connections currently in the completed queue, and Send-Q indicates the maximum value of the syn backlog, that is, the maximum number of connections in the completed connection queue;

For established TCP connections, the Recv-Q column indicates the size of data in the recv buffer that has not been copied by the user process, and the Send-Q column indicates the size of data that has not yet been returned by the remote host for the ACK message. The reason for distinguishing between established TCP connection sockets and listening sockets is that these two states of sockets use different socket buffers, where the listening socket pays more attention to the length of the queue, while the established TCP connection socket pays more attention to the size of the received and sent data.

[root@xuexi ~]# netstat -tnlActive Internet connections (only servers)Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN tcp6 0 0 :::80 :::* LISTEN tcp6 0 0 :::22 :::* LISTEN tcp6 0 0 ::1:25 :::* LISTEN[root@xuexi ~]# ss -tnlState Recv-Q Send-Q Local Address:Port Peer Address:PortLISTEN 0 128 *:22 *:* LISTEN 0 100 127.0.0.1:25 *:* LISTEN 0 128 :::80 :::* LISTEN 0 128 :::22 :::* LISTEN 0 100 ::1:25 :::*

Note that for sockets in the Listen state, the values ​​of the Send-Q column of netstat and the Send-Q column of the ss command are different, because netstat does not write the maximum length of the completed queue. Therefore, when judging whether there is still free space in the queue to receive new TCP connection requests, you should use the ss command instead of netstat as much as possible.

2.3.2 Impact of syn flood

In addition, if the listener does not receive the ACK message returned by the client after sending SYN+ACK, the listener will be awakened by the timeout set by select()/poll() and resend the SYN+ACK message to the client to prevent the message from being lost in the vast network. However, this retransmission is problematic. If the client forges the source address when calling connect(), the SYN+ACK message replied by the listener will definitely not reach the other party's host. In other words, the listener will not receive the ACK message for a long time, so it resends SYN+ACK. However, whether the listener is awakened again and again due to the timeout set by select()/poll(), or the data is copied into the send buffer again and again, the CPU needs to participate during this period, and the SYN+ACK in the send buffer needs to be copied into the network card again (this time it is a DMA copy, which does not require the CPU). If the client is an attacker and continuously sends thousands or tens of thousands of SYNs, the listener will almost collapse directly, and the network card will be severely blocked. This is the so-called syn flood attack.

There are many ways to solve syn flood, such as reducing the maximum length of the two queues maintained by listen(), reducing the number of syn+ack retransmissions, increasing the retransmission interval, reducing the waiting timeout for receiving ack, using syncookie, etc. However, any method of directly modifying TCP options cannot give a good balance between performance and efficiency. Therefore, it is extremely important to filter data packets before the connection reaches the listener thread.

2.4 accept() function

The function of the accpet() function is to read the first item in the completed connection queue (remove it from the queue after reading), and generate a socket descriptor for the subsequent connection for this item, assuming that connfd is used to represent it. With the new connection socket, the working process/thread (called the worker) can transmit data with the client through this connection socket, while the listening socket (sockfd) mentioned above is still being listened to by the listener.

For example, in the prefork mode of httpd, each child process is both a listener and a worker. When each client initiates a connection request, the child process receives it while listening and releases the listening of the listening socket, so that other child processes can listen to this socket. After multiple rounds, a new connection socket is finally generated through the accpet() function, so that the child process can concentrate on establishing interaction with the client through this socket. Of course, it may be blocked or sleep many times due to various io waits in the middle. This efficiency is really low. Considering only the stages from the child process receiving the SYN message to the final generation of a new connection socket, the child process is blocked again and again. Of course, the listening socket can be set to non-blocking IO mode, but even in non-blocking mode, it has to constantly check the status.

Consider the worker/event processing mode. Each child process uses a dedicated listening thread and N worker threads. The listening thread is responsible for listening and establishing new connection socket descriptors and putting them into the Apache socket queue. In this way, the listener and the worker are separated. During the listening process, the worker can still work freely. If we only look at the listening aspect, the performance of the worker/event mode is much higher than that of the prefork mode.

When the listener initiates the accept() system call, if there is no data in the completed connection queue, the listener will be blocked. Of course, the socket can be set to non-blocking mode, then accept() will return EWOULDBLOCK or EAGAIN error when no data is obtained. You can use select() or poll() or epoll to wait for readable events in the completed connection queue. You can also set the socket to signal-driven IO mode, so that the newly added data in the completed connection queue notifies the listener to copy the data to the app buffer and use accept() for processing.

We often hear about the concepts of synchronous connection and asynchronous connection. How do we distinguish between them? Synchronous connection means that from the moment the listener listens to the SYN data sent by a client, it must wait until the connection socket is established and the data exchange with the client is completed. Before the connection with this client is closed, no other client connection request will be received in the middle. To explain in more detail, when synchronously connecting, it is necessary to ensure that the socket buffer and app buffer data remain consistent. Usually when processing in a synchronous connection manner, the listener and the worker are the same process, such as the prefork model of httpd. Asynchronous connection can receive and process other connection requests at any stage of establishing a connection and data exchange. Usually, when the listener and the worker are not the same process, an asynchronous connection is used. For example, in the event model of httpd, although the listener and the worker are separated in the worker model, a synchronous connection is still used. After the listener receives the connection request and creates a connection socket, it is immediately handed over to the worker thread. During the processing of the worker thread, it only serves the client until the connection is disconnected. The asynchronous event model is only when the worker thread handles a special connection (such as a connection in a long connection state), it can be handed over to the listener thread for safekeeping. For normal connections, it is still equivalent to the synchronous connection method. Therefore, the so-called asynchronous of httpd's event is actually pseudo-asynchronous. In layman's terms, synchronous connection is a process/thread processing a connection, and asynchronous connection is a process/thread processing multiple connections.

2.5 send() and recv() functions

The send() function copies data from the app buffer to the send buffer (of course, it may also be copied directly from the kernel buffer), and the recv() function copies data in the recv buffer to the app buffer. Of course, there is nothing wrong with using the write() and read() functions to replace them, but send()/recv() is more targeted.

Both functions involve socket buffers, but when calling send() or recv(), it is necessary to consider whether there is data in the source buffer to be copied, and whether the target buffer to be copied is full and cannot be written. No matter which side, as long as the conditions are not met, the process/thread will be blocked when calling send()/recv() (assuming that the socket is set to a blocking IO model). Of course, the socket can be set to a non-blocking IO model. In this case, when the send()/recv() function is called when the buffer does not meet the conditions, the process/thread calling the function will return the error status information EWOULDBLOCK or EAGAIN. Whether there is data in the buffer or whether it is full and cannot be written, in fact, select()/poll()/epoll can be used to monitor the corresponding file descriptor (the corresponding socket buffer monitors the socket descriptor). When the conditions are met, calling send()/recv() can operate normally. You can also set the socket to a signal-driven IO or asynchronous IO model, so that you don't have to do useless work to call send()/recv() before the data is ready and copied.

2.6 close(), shutdown() functions

The general close() function can close a file descriptor, including connection-oriented network socket descriptors. When close() is called, all data in the send buffer will be sent. However, the close() function only reduces the reference count of the socket by 1, just like rm, when deleting a file, it only removes a hard link number. Only when all reference counts of the socket are deleted, the socket descriptor will be really closed, and the subsequent four waves will begin. For concurrent service programs that share sockets between parent and child processes, calling close() to close the socket of the child process will not really close the socket, because the socket of the parent process is still open. If the parent process does not call the close() function, the socket will remain open, and the four waves will not be entered.

The shutdown() function is specifically used to close the connection of the network socket. Unlike close(), which reduces the reference count by one, it directly cuts off all connections of the socket, thus triggering a four-wave process. Three closing methods can be specified:

1. Disable writing. At this time, no more data can be written to the send buffer, and the existing data in the send buffer will be sent until it is completed.

2. Disable reading. At this time, data cannot be read from the recv buffer and the existing data in the recv buffer can only be discarded.

3. Disable reading and writing. At this time, you cannot read or write. The data in the send buffer will be sent until it is completed, but the data in the recv buffer will be discarded.

Whether it is shutdown() or close(), each time they are called, they will send a FIN before actually entering the process of four handshakes.

3. Address/Port Reuse Technology

Under normal circumstances, an addr+port can only be bound to one socket. In other words, addr+port cannot be reused, and different sockets can only be bound to different addr+ports. For example, if you want to start two sshd instances, the same addr+port must not be configured in the configuration files of the sshd instances started one after another. Similarly, when configuring web virtual hosts, unless they are based on domain names, two virtual hosts must not be configured with the same addr+port. The reason why domain-based virtual hosts can be bound to the same addr+port is that the http request message contains host name information. In fact, when such connection requests arrive, they are still listened to through the same socket, but after listening, the httpd working process/thread can assign this connection to the corresponding host.

Since the above is about normal situations, there are of course abnormal situations, that is, address reuse and port reuse technology, which together are socket reuse. In the current Linux kernel, there are already socket options SO_REUSEADDR that support address reuse and SO_REUSEPORT that support port reuse. After setting the port reuse option, there will be no more errors when binding the socket. Moreover, after an instance is bound to two addr+ports (you can bind multiple, two are used as an example here), two listening processes/threads can be used to listen to them at the same time, and the connections sent by the client can be received in turn through the round-robin balancing algorithm.

For a listening process/thread, each reused socket is called a listener bucket, that is, each listening socket is a listening bucket.

Taking the worker or event model of httpd as an example, suppose there are currently 3 child processes, each of which has a listening thread and N worker threads.

Then, in the absence of address reuse, each listening thread is competing for listening. At a certain moment, only one listening thread can listen on this listening socket (obtaining the listening qualification by acquiring the mutex lock). When this listening thread receives a request, it gives up the listening qualification, so other listening threads compete for this listening qualification, and only one thread can get it. As shown in the following figure:


When address reuse and port reuse technologies are used, multiple sockets can be bound to the same addr+port. For example, in the figure below, when one more listening bucket is used, there are two sockets, so two listening threads can listen at the same time. When a listening thread receives a request, it gives up the qualification and allows other listening threads to compete for the qualification.


If one more socket is bound, the three listening threads do not need to give up their listening qualifications and can listen indefinitely. See the figure below.


It seems that the performance is very good. It not only reduces the competition for listening qualifications (mutex locks) and avoids the "starvation problem", but also allows more efficient listening. Because it can balance the load, it can reduce the pressure on the listening thread. But in fact, the listening process of each listening thread consumes CPU. If there is only one core CPU, even if it is reused, the advantage of reuse cannot be reflected. On the contrary, the performance is reduced due to switching listening threads. Therefore, to use port reuse, it is necessary to consider whether each listening process/thread has been isolated in its own CPU. In other words, whether to reuse and how many times to reuse must consider the number of CPU cores and whether to bind the process to the CPU.

<<:  What happens behind the scenes when the Ping command is issued?

>>:  The difficulty of operation and maintenance has reached a new level - it does not exist!

Recommend

Survey: Germany more dependent on Huawei 5G equipment than before

Germany is even more reliant on Huawei for its 5G...

What 5G means for the real-time data market

5G, the next generation of cellular network techn...

Teach you two tricks to easily export Html pages to PDF files

[[398656]] This article is reprinted from the WeC...

Can the Internet of Things drive the deployment of IPv6?

IPv6 has features that IPv4 lacks, which makes it...

Why consider 800G now?

Increased demand for home offices, streaming serv...

How 5G standardization will impact future innovation and growth

In 2019, mobile technologies and services contrib...

Exploration of 5G and edge computing applications in the post-epidemic era

AT&T and many other leading wireless network ...

How far is 400G from true commercial deployment?

With the continuous growth of data traffic, the d...