Diagram | In-depth understanding of the stumbling blocks on the road to high-performance network development - synchronous blocking network IO

Diagram | In-depth understanding of the stumbling blocks on the road to high-performance network development - synchronous blocking network IO

[[386495]]

This article is reprinted from the WeChat public account "Kaida Neigong Xiuxian", written by Zhang Yanfei allen. To reprint this article, please contact the WeChat public account "Kaida Neigong Xiuxian".

In the network development model, there is a method that is very easy for developers to use, that is, synchronous blocking network IO (usually called BIO in Java).

For example, if we want to request a piece of data from the server, then a code demo in C language might look like this:

  1. int main()
  2. {
  3. int sk = socket(AF_INET, SOCK_STREAM, 0);
  4. connect (sk, ...)
  5. recv(sk, ...)
  6. }

However, in high-concurrency server development, the performance of this network IO is extremely poor.

1. The process is likely to be blocked during recv, resulting in a process switch

2. When the connection data is ready, the process will be awakened again, and it is another process switch

3. A process can only wait for one connection at a time. If there are many concurrent connections, many processes are required.

If we use one sentence to summarize it, it is: synchronous blocking network IO is a stumbling block on the road to high-performance network development! As the saying goes, only by knowing yourself and the enemy can you win every battle. So today we will not talk about optimization, but only deeply analyze the internal implementation of synchronous blocking network IO.

Although there are only two or three lines of code in the demo above, the user process and the kernel actually do a lot of work together. First, the user process initiates the instruction to create a socket, and then switches to kernel state to complete the initialization of the kernel object. Next, when Linux receives data packets, hard interrupts and ksoftirqd processes are processing them. When the ksoftirqd process completes the processing, it notifies the relevant user process.

From the creation of a socket by a user process to the arrival of a network packet at the network card and its receipt by the user process, the overall flow chart is as follows:

Today we will use diagrams and source code analysis to disassemble each of the above steps in detail and see how they are implemented in the kernel. After reading this article, you will have a deep understanding of the reasons for the poor performance of synchronously blocked network IO!

1. Create a socket

After the socket function call in the source code at the beginning is executed, the kernel creates a series of socket-related kernel objects (yes, not just one). The relationship between them is shown in the figure. Of course, this object is more complicated than the figure shows. I only show the content related to today's topic in the figure.

Let's look through the source code to see how the above structure is created.

  1. //file:net/socket.c
  2. SYSCALL_DEFINE3(socket, int , family, int , type, int , protocol)
  3. {
  4. ......
  5. retval = sock_create(family, type, protocol, &sock);
  6. }

sock_create is the main place to create a socket, and sock_create calls __sock_create.

  1. //file:net/socket.c
  2. int __sock_create(struct net *net, int family, int type, int protocol,
  3. struct socket **res, int kern)
  4. {
  5. struct socket *sock;
  6. const struct net_proto_family *pf;
  7.  
  8. ......
  9.  
  10. // Allocate socket object
  11. sock = sock_alloc();
  12.  
  13. //Get the operation table for each protocol family
  14. pf = rcu_dereference(net_families[family]);
  15.  
  16. //Call the creation function of each protocol family. For AF_INET, the corresponding function is
  17. err = pf-> create (net, sock, protocol, kern);
  18. }

In __sock_create, first call sock_alloc to allocate a struct sock object. Then get the protocol family's operation function table and call its create method. For the AF_INET protocol family, the inet_create method is executed.

  1. //file:net/ipv4/af_inet.c
  2. tatic int inet_create(struct net *net, struct socket *sock, int protocol,
  3. int kern)
  4. {
  5. struct sock *sk;
  6.  
  7. //Find the corresponding protocol. For TCP SOCK_STREAM, we get
  8. // static struct inet_protosw inetsw_array[] =
  9. //{
  10. // {
  11. // .type = SOCK_STREAM,
  12. // .protocol = IPPROTO_TCP,
  13. // .prot = &tcp_prot,
  14. // .ops = &inet_stream_ops,
  15. // .no_check = 0,
  16. // .flags = INET_PROTOSW_PERMANENT |
  17. // INET_PROTOSW_ICSK,
  18. // },
  19. //}
  20. list_for_each_entry_rcu(answer, &inetsw[sock->type], list) {
  21.  
  22. //Assign inet_stream_ops to socket->ops
  23. sock->ops = answer->ops;
  24.  
  25. //Get tcp_prot
  26. answer_prot = answer->prot;
  27.  
  28. //Allocate sock object and assign tcp_prot to sock->sk_prot
  29. sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot);
  30.  
  31. //Initialize the sock object
  32. sock_init_data(sock, sk);
  33. }

In inet_create, the operation method implementation set inet_stream_ops and tcp_prot defined for TCP are found according to the type SOCK_STREAM, and they are set to socket->ops and sock->sk_prot respectively.

We can see sock_init_data further down. In this method, the sk_data_ready function pointer in sock is initialized and set to the default sock_def_readable().

  1. //file: net/core/sock.c
  2. void sock_init_data(struct socket *sock, struct sock *sk)
  3. {
  4. sk->sk_data_ready = sock_def_readable;
  5. sk->sk_write_space = sock_def_write_space;
  6. sk->sk_error_report = sock_def_error_report;
  7. }

When a data packet is received on the soft interrupt, the process waiting on the sock will be awakened by calling the sk_data_ready function pointer (actually set to sock_def_readable()). We will talk about this later when introducing soft interrupts, but just remember this here.

At this point, a TCP object, more precisely a SOCK_STREAM object under the AF_INET protocol family, has been created. This consumes the overhead of a socket system call.

2. Waiting to receive messages

Next, let's look at the underlying implementation that the recv function depends on. First, by tracing with the strace command, we can see that the clib library function recv will execute the recvfrom system call.

After entering the system call, the user process enters the kernel state, executes a series of kernel protocol layer functions, and then checks whether there is data in the receiving queue of the socket object. If not, it adds itself to the waiting queue corresponding to the socket. Finally, the CPU is released, and the operating system will select the next ready process to execute. The whole flow chart is as follows:

After reading the whole flowchart, let's look at the source code in more detail. The focus of today's study is how recvfrom blocks its own process in the end (if we do not use the O_NONBLOCK flag).

  1. //file: net/socket.c
  2. SYSCALL_DEFINE6(recvfrom, int , fd, void __user *, ubuf, size_t, size ,
  3. unsigned int , flags, struct sockaddr __user *, addr,
  4. int __user *, addr_len)
  5. {
  6. struct socket *sock;
  7.  
  8. //Find the socket object based on the fd passed in by the user
  9. sock = sockfd_lookup_light(fd, &err, &fput_needed);
  10. ......
  11. err = sock_recvmsg(sock, &msg, size , flags);
  12. ......
  13. }

sock_recvmsg ==> __sock_recvmsg => __sock_recvmsg_nosec

  1. static inline int __sock_recvmsg_nosec(struct kiocb *iocb, struct socket *sock,
  2. struct msghdr *msg, size_t size , int flags)
  3. {
  4. ......
  5. return sock->ops->recvmsg(iocb, sock, msg, size , flags);
  6. }

Call recvmsg in the socket object ops. Recall the socket object diagram above. You can see from the diagram that recvmsg points to the inet_recvmsg method.

  1. //file: net/ipv4/af_inet.c
  2. int inet_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
  3. size_t size , int flags)
  4. {
  5. ...
  6.  
  7. err = sk->sk_prot->recvmsg(iocb, sk, msg, size , flags & MSG_DONTWAIT,
  8. flags & ~MSG_DONTWAIT, &addr_len);

Here we encounter another function pointer, this time calling the recvmsg method under sk_prot in the socket object. As above, we can conclude that this recvmsg method corresponds to the tcp_recvmsg method.

  1. //file: net/ipv4/tcp.c
  2. int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
  3. size_t len, int nonblock, int flags, int *addr_len)
  4. {
  5. int copied = 0;
  6. ...
  7. do {
  8. //Traverse the receive queue to receive data
  9. skb_queue_walk(&sk->sk_receive_queue, skb) {
  10. ...
  11. }
  12. ...
  13. }
  14.  
  15. if (copied >= target) {
  16. release_sock(sk);
  17. lock_sock(sk);
  18. } else //Not enough data received, enable sk_wait_data to block the current process
  19. sk_wait_data(sk, &timeo);
  20. }

Finally we see what we want to see. skb_queue_walk is accessing the receive queue under the sock object.

If no data is received, or not enough data is received, sk_wait_data is called to block the current process.

  1. //file: net/core/sock.c
  2. int sk_wait_data(struct sock *sk, long *timeo)
  3. {
  4. //The current process ( current ) is associated with the defined waiting queue item
  5. DEFINE_WAIT(wait);
  6.  
  7. //Call sk_sleep to get the wait state under the sock object
  8. // and prepare to suspend, set the process state to INTERRUPTIBLE
  9. prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
  10. set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
  11.  
  12. // Give up the CPU by calling schedule_timeout and then sleep
  13. rc = sk_wait_event(sk, timeo, !skb_queue_empty(&sk->sk_receive_queue));
  14. ...

Let's take a closer look at how sk_wait_data blocks the current process.

First, under the DEFINE_WAIT macro, a wait queue item wait is defined. On this new wait queue item, the callback function autoremove_wake_function is registered, and the current process descriptor current is associated with its .private member.

  1. //file: include/linux/wait.h
  2. #define DEFINE_WAIT( name ) DEFINE_WAIT_FUNC( name , autoremove_wake_function)
  3.  
  4. #define DEFINE_WAIT_FUNC( name , function ) \
  5. wait_queue_t name = { \
  6. .private = current , \
  7. .func = function , \
  8. .task_list = LIST_HEAD_INIT(( name ).task_list), \
  9. }

Then, sk_sleep is called in sk_wait_data to obtain the wait queue list head wait_queue_head_t under the sock object. The source code of sk_sleep is as follows:

  1. //file: include/net/sock.h
  2. static inline wait_queue_head_t *sk_sleep(struct sock *sk)
  3. {
  4. BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
  5. return &rcu_dereference_raw(sk->sk_wq)->wait;
  6. }

Then call prepare_to_wait to insert the newly defined wait queue item wait into the wait queue of the sock object.

  1. //file: kernel/wait.c
  2. void
  3. prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
  4. {
  5. unsigned long flags;
  6.  
  7. wait->flags &= ~WQ_FLAG_EXCLUSIVE;
  8. spin_lock_irqsave(&q->lock, flags);
  9. if (list_empty(&wait->task_list))
  10. __add_wait_queue(q, wait);
  11. set_current_state(state);
  12. spin_unlock_irqrestore(&q->lock, flags);
  13. }

In this way, when the kernel receives the data and generates the ready time, it can find the waiting item on the socket waiting queue, and then find the callback function and the process waiting for the socket ready event.

Finally, sk_wait_event is called to give up the CPU, and the process will enter a sleep state, which will result in a process context overhead.

In the next section we will see how the process is woken up.

3. Soft interrupt module

Next, let's change our perspective and look at the soft interrupt responsible for receiving and processing data packets. I won't go into details about how the network packet is received by the network card and finally handed over to the soft interrupt for processing. If you are interested, please read the previous article "Illustrated Linux Network Packet Receiving Process". Today we will start directly from the TCP protocol receiving function tcp_v4_rcv.

After receiving the data packet in the soft interrupt (that is, the ksoftirqd process in Linux), if it is a TCP packet, it will execute the tcp_v4_rcv function. Then, if it is a data packet in the ESTABLISH state, it will eventually split the data and put it into the receiving queue of the corresponding socket. Then it calls sk_data_ready to wake up the user process.

Let's look at the code in more detail:

  1. // file: net/ipv4/tcp_ipv4.c
  2. int tcp_v4_rcv(struct sk_buff *skb)
  3. {
  4. ......
  5. th = tcp_hdr(skb); //Get TCP header
  6. iph = ip_hdr(skb); //Get IP header
  7.  
  8. //Find the corresponding socket according to the IP and port information in the data packet header
  9. sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
  10. ......
  11.  
  12. //socket is not locked by the user
  13. if (!sock_owned_by_user(sk)) {
  14. {
  15. if (!tcp_prequeue(sk, skb))
  16. ret = tcp_v4_do_rcv(sk, skb);
  17. }
  18. }
  19. }

In tcp_v4_rcv, we first query the corresponding socket on the local machine based on the source and dest information in the header of the received network packet. After finding it, we directly enter the receiving main function tcp_v4_do_rcv to see it.

  1. //file: net/ipv4/tcp_ipv4.c
  2. int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
  3. {
  4. if (sk->sk_state == TCP_ESTABLISHED) {
  5.  
  6. //Execute data processing in the connected state
  7. if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
  8. rsk = sk;
  9. goto reset;
  10. }
  11. return 0;
  12. }
  13.  
  14. //Other non-ESTABLISH state data packet processing
  15. ......
  16. }

We assume that the packet being processed is in the ESTABLISH state, so it enters the tcp_rcv_established function for processing.

  1. //file: net/ipv4/tcp_input.c
  2. int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
  3. const struct tcphdr *th, unsigned int len)
  4. {
  5. ......
  6.  
  7. //Receive data into the queue
  8. eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
  9. &fragstolen);
  10.  
  11. //Data ready, wake up the blocked process on the socket
  12. sk->sk_data_ready(sk, 0);

In tcp_rcv_established, the received data is placed on the socket's receive queue by calling the tcp_queue_rcv function.

As shown in the following source code

  1. //file: net/ipv4/tcp_input.c
  2. static   int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen,
  3. bool *fragstolen)
  4. {
  5. //Put the received data at the end of the socket's receive queue
  6. if (!eaten) {
  7. __skb_queue_tail(&sk->sk_receive_queue, skb);
  8. skb_set_owner_r(skb, sk);
  9. }
  10. return eaten;
  11. }

After the reception is completed by calling tcp_queue_rcv, sk_data_ready is called to wake up the user process waiting on the socket. This is another function pointer. Recall the sock_init_data function we executed in the socket creation process above. In this function, sk_data_ready has been set to the sock_def_readable function (you can press ctrl + f to search the previous text). It is the default data ready processing function.

  1. //file: net/core/sock.c
  2. static void sock_def_readable(struct sock *sk, int len)
  3. {
  4. struct socket_wq *wq;
  5.  
  6. rcu_read_lock();
  7. wq = rcu_dereference(sk->sk_wq);
  8.  
  9. //There is a process waiting in the queue for this socket
  10. if (wq_has_sleeper(wq))
  11. //Wake up the process on the waiting queue
  12. wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI |
  13. POLLRDNORM | POLLRDBAND);
  14. sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
  15. rcu_read_unlock();
  16. }

In sock_def_readable, we access wait under sock->sk_wq again. Recall that at the end of the previous call to recvfrom, we added the wait queue associated with the current process to wait under sock->sk_wq through DEFINE_WAIT(wait).

The next step is to call wake_up_interruptible_sync_poll to wake up the process that is blocked on the socket because of waiting for data.

  1. //file: include/linux/wait.h
  2. #define wake_up_interruptible_sync_poll(x, m) \
  3. __wake_up_sync_key((x), TASK_INTERRUPTIBLE, 1, (void *) (m))
  1. //file: kernel/sched/core.c
  2. void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode,
  3. int nr_exclusive, void * key )
  4. {
  5. unsigned long flags;
  6. int wake_flags = WF_SYNC;
  7.  
  8. if (unlikely(!q))
  9. return ;
  10.  
  11. if (unlikely(!nr_exclusive))
  12. wake_flags = 0;
  13.  
  14. spin_lock_irqsave(&q->lock, flags);
  15. __wake_up_common(q, mode, nr_exclusive, wake_flags, key );
  16. spin_unlock_irqrestore(&q->lock, flags);
  17. }

__wake_up_common implements wakeup. Note that the parameter nr_exclusive passed to this function call is 1, which means that even if multiple processes are blocked on the same socket, only one process will be woken up. Its purpose is to avoid panic.

  1. //file: kernel/sched/core.c
  2. static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
  3. int nr_exclusive, int wake_flags, void * key )
  4. {
  5. wait_queue_t *curr, * next ;
  6.  
  7. list_for_each_entry_safe(curr, next , &q->task_list, task_list) {
  8. unsigned flags = curr->flags;
  9.  
  10. if (curr->func(curr, mode, wake_flags, key ) &&
  11. (flags & WQ_FLAG_EXCLUSIVE) && ! --nr_exclusive)  
  12. break;
  13. }
  14. }

In __wake_up_common, find a waiting queue item curr, and then call its curr->func. Recall that when we executed the recv function earlier, we used DEFINE_WAIT() to define the details of the waiting queue item, and the kernel set curr->func to autoremove_wake_function.

  1. //file: include/linux/wait.h
  2. #define DEFINE_WAIT( name ) DEFINE_WAIT_FUNC( name , autoremove_wake_function)
  3.  
  4. #define DEFINE_WAIT_FUNC( name , function ) \
  5. wait_queue_t name = { \
  6. .private = current , \
  7. .func = function , \
  8. .task_list = LIST_HEAD_INIT(( name ).task_list), \
  9. }

In autoremove_wake_function, default_wake_function is called.

  1. //file: kernel/sched/core.c
  2. int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
  3. void * key )
  4. {
  5. return try_to_wake_up(curr->private, mode, wake_flags);
  6. }

The task_struct passed in when calling try_to_wake_up is curr->private. This is the process item that was blocked because of waiting. When this function is executed, the process that was blocked because of waiting on the socket is pushed into the runnable queue, which will be another process context switch overhead.

summary

OK, let's summarize the above process. The kernel notifies the network packet of the operating environment in two parts:

The first part is the process where our own code is located. The socket() function we call will enter the kernel state to create the necessary kernel objects. After entering the kernel state, the recv() function is responsible for checking the receive queue and blocking the current process to give up the CPU when there is no data to process.

The second part is the context of hard interrupt and soft interrupt (system process ksoftirqd). In these components, after processing the packet, it will be placed in the socket's receiving queue. Then, according to the socket kernel object, the process in its waiting queue that is blocked due to waiting is found and then woken up.

Each time a process is waiting for data on a socket, it has to be taken off the CPU. Then another process is switched. When the data is ready, the sleeping process will be awakened. There are two process context switching overheads in total. According to previous tests, each switch takes about 3-5 us (microseconds). If it is a network IO-intensive application, the CPU will keep doing useless work such as process switching.

This model is completely unusable in the server role. This is because the socket and process in this simple model are one-to-one. Now we need to carry thousands, even tens or millions of user connection requests on a single machine. If we use the above method, we have to create a process for each user request. I believe you have never seen anyone do this in any primitive server network programming.

If I were to give it a name, it would be single-channel non-multiplexing (a term coined by Fei Ge himself). So is there a more efficient network IO model? Of course there is, and that is the select, poll and epoll that you are familiar with. Next time, Fei Ge will start to disassemble the implementation source code of epoll, so stay tuned!

This mode is still used in the client role, because your process may have to wait for the MySQL data to be returned successfully before rendering the page and returning it to the user, otherwise you can't do anything.

Please note that I am talking about roles, not specific machines. For example, for your php/java/golang interface machine, when you receive user requests, you are in the server role. But when you request redis, you become the client role.

However, there are some well-encapsulated network frameworks such as Sogou Workflow, Golang's net package, etc., which have already abandoned this inefficient model in the role of network client!

<<:  The network is too slow to get a subscription?! A hardcore comparison tells you which is faster, 5G or Wi-Fi 6!

>>:  Deploy on demand: China Telecom plans to open 320,000 5G base stations in 2021

Blog    

Recommend

The industry chain works together to make great progress in 5G messaging

2020 is a critical year for my country's 5G c...

How fiber optic networks can create more efficient and secure connections

We live in a technologically advanced age where h...

5G development still has a long way to go

The construction and development of 5G has gone t...

Enterprises need to prioritize mobile unified communications

The need for secure, reliable, and easy-to-use co...

CDN+MEC will become the main battlefield in the future

With the rapid development of cloud computing, cl...

Problems that edge computing needs to solve urgently

At present, edge computing has been widely recogn...

FCC win clears way for massive Wi-Fi 6E upgrade

A ruling [PDF] made public on Tuesday by the U.S....

The Internet of Things is not new, but why is it important?

The Internet of Things (IoT) is a term that is be...

What is DNS and how does it work?

The Domain Name System (DNS) is one of the founda...

IT Knowledge Encyclopedia: Detailed Explanation of IPv6

As an Internet user, you have more or less heard ...

GSA identifies 4G/5G private network deployments in 40 countries

London, UK, May 17, 2021 - The Global Mobile Supp...

Comment: Why is the price war in the CDN industry slowing down at this stage?

In the past two years, cloud computing companies ...