Kubernetes network technology analysis: Pod communication based on routing mode

Kubernetes network technology analysis: Pod communication based on routing mode

Preface

Pods can communicate with each other within a Kubernetes cluster, which is one of the important scenarios for Kubernetes network implementation.

Kubernetes provides a unified interface and protocol through CNI, so that we can choose different network components and modes according to our needs. Common choices include Flannel's VXLAN or HostGW, Calico's IPIP or BGP, etc.

How do these different network components achieve Pod communication? What are the underlying technical principles? This article will take Calico's BGP mode communication model (a pure three-layer model based on routing) as an example to analyze its underlying implementation principles.

Since we are talking about container networks, we cannot avoid Network Namespace. Network Namespace is the core technology for implementing virtual networks (I will not go into details here, but interested students can learn about Linux Namespaces on their own), and has been widely used in container-related network scenarios. A container created by Docker will have an independent Network Namespace, and N containers in a Kubernetes Pod will also share an independent Network Namespace.

So how do Network Namespaces communicate with each other?

Today, we will implement multiple Network Namespaces based on routing mode that can communicate across hosts.

Create a Network Namespace

1. Prepare a Linux host (node03--100.100.198.250) and check whether the ip command is valid (if not, you need to install iproute2)

2. Execute the following command to create a new net-ns demo01

  1. ip netns add demo01 #Create demo01
  2.  
  3. ip netns list # Check the result. If the returned value contains `demo01(id:xxx)`, it means the creation is successful.

3. Check the network card resources under demo01 and enable the lo network card

  1. ip netns exec demo01 ip addr #ip netns exec demo01 <command to be executed> can be used to perform related operations on demo01. You can also enter the virtual network environment of demo01 through `ip netns exec demo01 /bin/bash` and then directly execute commands. After the operation is completed, you need to exit to the host network space and execute exit
  2.  
  3. ip netns exec demo01 ip link set lo up #Turn on the lo network card, which will automatically bind to 127.0.0.1  
  4.  
  5. #Through the first command, you can see that there is only one lo network card under this namespace, and the network card is in a closed state (you can directly execute `ip addr` in the host network space to observe the difference between the two)

demo01-lo network card has been started

At this point, a new Network Namespace demo01 has been created.

Currently, demo01 only has a local lo network card, so how can it communicate with the Host network space?

Configure network card pair and routing for demo01

To achieve communication between demo01 and the host network space, we can also think of it as communication between two independent Network Namespaces. So we need to create a network card pair, put both ends in the host network space and demo01 respectively, and start the network card.

  1. ip link add vethhost01 type veth peer name vethdemo01 #Create a virtual network card pair ip link add <network card name> type veth peer name <paired network card name>
  2.  
  3. ip addr #Check the network card information and you can see that there are two new network cards under the host network space.

You can see the new network card pair created under the host network space

Assign one end of the NIC pair to demo01 and turn it on:

  1. ip link set vethdemo01 netns demo01 #Assign the virtual network card vethdemo01 to demo01 ip link set <network card name> netns <net-ns name>
  2.  
  3. ip link set vethhost01 up #Turn on the network card on the host network space
  4.  
  5. ip netns exec demo01 ip link set vethdemo01 up #Turn on the network card on demo01. Now both network cards are turned on and are in two network namespaces.

To achieve communication, you also need to give demo01's network card an IP address (the network card vethhost01 in the host network space does not need to be set with an IP address):

  1. ip netns exec demo01 ip addr add 10.0 . 1.2 / 24 dev vethdemo01 #Set the IP of vethdemo01 to 10.0 . 1.2 (the address can be set arbitrarily, as long as it is not in the same network segment as the host network) The command format for adding IP is -- `ip addr add <IP address/subnet mask> dev <network card>`

The demo01-vethdemo01 network card IP has been configured

Can we achieve communication after setting the IP address? For example, can we ping successfully?

verify

  • Execute ping 10.0.1.2 in the host network space. The result is: no response, all data packets are lost, and through mtr, it is found that the data flows to the default gateway of the host.
  • In demo01, reverse request ping 100.100.198.250 , result: connect: Network is unreachable

Why doesn't it work?

  1. 10.0.1.2  
  2. 10.0 . 1.0 / 24  

Conclusion: Missing routing.

Adding a bidirectional route

To achieve ping success, it must be bidirectional, so there needs to be routing pointing in both directions for communication.

1. Add a route pointing to 10.0.1.2 in the Host network space, and point the fixed IP 10.0.1.2 to the network card vethhost01 in the same network space (its peer is the network card of demo01)

2. Add a point to 100.100.198.250 in demo01. You only need to add the default routing gateway to point to your own network card vethdemo01 (its peer network card vethhost01 is in the network space of 100.100.198.252 )

  1. route add -host 10.0 . 1.2 dev vethhost01 #Add the route to demo01 in the host network space. The routing command is `route <action - commonly used add or del> <destination address type - commonly used net or host> <destination address, if it is a network segment, you must add /subnet mask> <next hop type, network card device - dev, ip address - gw> <next network card name or ip address>` The rule interpretation of this route is: when the host receives a data packet with a destination address of 10.0 . 1.2 , it forwards it to the local vethost1 network card
  2.  
  3. ip netns exec demo01 route add default dev vethdemo01 #demo01 adds a default route. This route is for the next route to take effect.
  4.  
  5. ip netns exec demo01 route add -net 0.0 . 0.0 gw 100.100 . 198.250 #For demonstration purposes, this route is used to simulate the host network space as a router. The default next hop is pointed to 100.100 . 198.250 to forward requests. The rule interpretation of this route is: when the destination address of the data packet received by the host does not match any other routing rules, it will be forwarded to 100.100 . 198.250 by default  

Verify again that bidirectional communication has been achieved.

Two-way ping success

At this point, a Network Namespace that can communicate with the Host network space has been created.

Create demo02 and make demo01 communicate with it

We can create another Network Namespace by referring to the steps in demo01.

  1. Name it demo02
  2. The IP address is set to 10.0.2.2
  3. Add bidirectional routing to achieve intercommunication 100.100.198.252 and 10.0.2.2

Current network model

After all of the above are implemented, verify the network connectivity:

  1. Ping between demo01/02 and host network space
  2. The expected results are all successful, indicating that demo02 has also been successfully created

Verify whether demo01<---->demo02 are interoperable:

  • Result: No response, all data packets are lost. Through mtr, it is found that the data flows to the default gateway 100.100.198.250 of the host and then the data packets disappear.
  • The phenomenon is very similar to the previous situation when the host was not added to the route of demo01. However, there are routes to 10.0.1.2 and 10.0.2.2 in the host network space. Data packets can go from 100.100.198.250 to demo01/02.
    • The only difference is that the original addresses of the data packets this time are 10.0.1.2 and 10.0.2.2 , and the data packets are forwarded through 100.100.198.250 network card.

Two-way ping between demo01 and 02

mtr result of demo01

mtr result of demo02

Where did the data packets go?

There is already a route, why are the data packets not forwarded by the host network space?

To understand how packets are processed by the kernel and finally reach the protocol stack, you first need to understand the packet processing flow of iptables.

What is 4表5链?

The logical direction of the message being processed by the kernel based on iptables related rules

chain

  • In layman's terms, each chain is a关卡(the red parts in the message logic diagram are PREROUTING, INPUT, OUTPUT, FORWARD, and POSTROUTING)
  • Corresponding rules can be configured in each关卡
    • There may be more than one rule on a关卡, there may be many rules, when these rules are strung together into a set,is formed.
    • Each message that passes through this关卡must match all the rules on this chain. If there is a rule that meets the conditions, the action corresponding to the rule is executed (it may be discarded, may be processed and then flow downstream, or may flow directly downstream)

surface

  • A collection of rules with the same function is called. Rules with different functions can be placed in different tables for management.
  • These rules in the chain are roughly divided into four categories based on their functions (filtering, network address translation, message disassembly\modification\repackaging, and closing connection tracking)
    • Filter table: responsible for filtering, firewall. Kernel module: iptables_filter , can filter messages and restrict certain messages from passing (relevant usage scenarios: firewall blacklist function, prohibiting certain IPs from accessing, or restricting host external communication)
    • nat table: network address translation, network address translation function. Kernel module: iptable_nat , which can realize the conversion of source and destination addresses in outgoing messages (relevant usage scenarios: Docker's overlay network, the source address of the container started by Docker for external communication is not the container's IP)
    • Mangle table: disassembles the message, makes modifications, and repackages it. Kernel module: iptable_mangle , which can modify some flags of the data packet so that other rules or programs can use this flag to filter or policy-route the data packet (relevant usage scenario: Kubernetes-calico's policy implementation for cluster container networks)
    • Raw table: turns off the connection tracking mechanism enabled on the NAT table. Kernel module: iptable_raw , which can skip the NAT table and ip_conntrack processing for packet processing, that is, no longer do address translation and link tracking processing of data packets (Related usage scenarios: RAW table can be used in situations where NAT is not required to improve performance. For example, for a web server with a large number of visits, port 80 can stop iptables from doing link tracking processing of data packets to improve user access speed)

Relationship between tables and chains

  • The responsibilities of the five chains are different, so some chains are bound to not contain certain types of rules.
  • In actual use, the table is used as the operation entry to define the rules. When adding rules to the table, you need to specify the chain to which it belongs.

The type of rules contained in each chain

Sorting out the message trend

Taking the Host network space as the analysis object, analyze the process of the Host network space actively initiating ping

Host network space---->demo01 (sending data):

1. When a ping is initiated from 100.100.198.250 , the starting position of the message is at the top layer of the protocol stack in the figure.

2. After judging from the routing table, match the route with the destination address (if no route is matched and it is unknown where the next one goes, then discard it directly), mark the next hop of the data - vethhost01

3. After the OUTPUT chain check (the default for the four types of policy rules is ACCEPT - no blocking), since there are no other rules, it flows directly to the downstream

  1. iptables -t raw --line-numbers -nvL OUTPUT #View the relevant rule names of the specified chain of the specified iptables table `riptables -t <table name> --line-numbers -nvL <chain name>` Interpretation of this command: Print all raw type rules in the OUTPUT chain, and obtain the default access policy of this type of rule set from the value of policy
  2.  
  3. iptables -t mangle --line-numbers -nvL OUTPUT
  4.  
  5. iptables -t nat --line-numbers -nvL OUTPUT
  6.  
  7. iptables -t filter --line-numbers -nvL OUTPUT #This chain is the most powerful, with all four types of functions.

Rules for the OUTPUT chain

4. After the POSTROUTING chain check (the default policy for mangle and NAT types is ACCEPT--do not block), since there are no other rules, the message is finally sent to the network card enp2s0 of 100.100.198.252

  1. iptables -t mangle --line-numbers -nvL POSTROUTING #The command is used in the same way as above. POSTROUTING is the `exit checkpoint' of the network namespace.
  2.  
  3. iptables -t nat --line-numbers -nvL POSTROUTING

Rules for the POSTROUTING chain

5. Network card enp2s0 sends the message to vethhost01 with the routing mark. Vethhost01 will transfer the message to its own network card peer, vethdemo01, by default, and arrive at the kernel of demo01.

Host network space<----demo01 (receiving data):

1. When a message is received from demo01 and a reply message is returned, the reply message arrives at the network card enp2s0 of 100.100.198.252 (the process before the message arrives at the network card enp2s0 is the process of发送数据mentioned above, and the starting position of the message is the protocol stack of demo01)

2. The kernel receives the message from the network card and first checks it through the PREROUTING chain (the policy defaults to ACCEPT - do not block). Since there are no other rules, it flows directly to the downstream.

  1. iptables -t raw --line-numbers -nvL PREROUTING #The command is used in the same way as above. PREROUTING is the entry level of the network namespace.
  2.  
  3. iptables -t mangle --line-numbers -nvL PREROUTING
  4.  
  5. iptables -t nat --line-numbers -nvL PREROUTING

Rules for the PREROUTING chain

3. Determine whether the destination address of the message is the local machine (the destination address is the local machine)

4. After the INPUT chain check (policy defaults to ACCEPT - no blocking), since there are no other rules, it finally reaches the protocol stack, and the initiator of ping receives the return message, completing a complete ICMP communication

  1. iptables -t mangle --line-numbers -nvL INPUT #The command is used in the same way as above. The message passing through the INPUT chain can enter the protocol stack.
  2.  
  3. iptables -t nat --line-numbers -nvL INPUT
  4.  
  5. iptables -t filter --line-numbers -nvL INPUT

Rules for the INPUT chain

Take the Host network space as路由器to forward ping packets as an example

demo01---->host host network space---->demo02 (forwarding data):

1. The initial message is sent from demo01 to the network card enp2s0 at 100.100.198.252

2. The kernel receives the message from the network card and first checks it in the PREROUTING chain (same as above)

3. Determine whether the destination address of the message is the local machine (the destination address is not the local machine), and then check the routing table - match the destination address to the route (if no route is matched, and it is unknown where to go next, then discard it directly), and mark the next hop of the data - 10.1.2.2

4. After the FORWARD chain check (the default filter type rule policy is DROP-blocking), the message can flow downstream only when it matches a passable rule

  1. iptables -t mangle --line-numbers -nvL FORWARD #mangle type rule default policy rule ACCEPT
  2.  
  3. iptables -t filter --line-numbers -nvL FORWARD #The filter type rule policy defaults to DROP
  4.  
  5. # You can see that this type of rule has dropped many packets. If you keep testing ping, you can see that the number of dropped packets keeps increasing.

FORWARD chain filter type rules

5. After the POSTROUTING chain check (same as above), the message is finally sent to the network card enp2s0 at 100.100.198.252

6. Network card enp2s0 sends the message to vethhost02 with the routing mark. Vethhost02 will transfer the message to its own network card peer, vethdemo02, by default, and arrive at the kernel of demo02.

7. The demo02 kernel processes the message according to the above-mentioned接收数据flow, and sends the ICMP reply message in the reverse direction according to the发送数据flow, with the demo01 IP address 10.0.1.2 as the destination address.

8. The Host network space also forwards转发数据again according to the above process, but this time it forwards the response message sent by demo02 (the Host network space completes message forwarding in two directions, the outbound direction and the return direction)

9. Finally, the protocol stack of demo01 receives the response message and completes an ICMP communication forwarded through 100.100.198.252

Three links of messages processed by the kernel

  1. Send data, the kernel processes the link of the protocol stack to send data: routing judgment (matching the next hop) --> OUTPUT --> POSTROUTING --> network card --> next hop
  2. Receive data, the kernel processes the external data link received by the network card: PREROUTING-->routing judgment (this is the local machine)-->INPUT-->protocol stack
  3. Forwarding data, the kernel processes the external data link received by the network card: PREROUTING-->routing judgment (not local, matching the next hop)-->FORWARD-->POSTROUTING-->network card-->next hop

Based on the above information, we can conclude that the reason why demo01/02 cannot ping each other is that the host network kernel, which acts as路由器, intercepts and discards the relevant messages.

The interception关卡is located in the FORWARD chain, and the过滤rules in the FORWARD chain are not allowed to pass by default.

  • 颁发&quot;通行证&quot; to both parties from demo01/02
  • The communication is bidirectional. You need to configure two rules for demo01/02, oneand one, for a total of four rules.
  1. iptables -t filter -A FORWARD -o vethhost01 -j ACCEPT #Add command `iptables -t <table> -A <chain> -o <NIC> -j ACCEPT` Command interpretation: Add a rule of type filter in the FORWARD chain. The rule content is -- all packets whose destination address is vethhost01 in the routing table are allowed to pass
  2.  
  3. iptables -t filter -A FORWARD -i vethhost01 -j ACCEPT #Add command `iptables -t <table> -A <chain> -i <NIC> -j ACCEPT` Command interpretation: Add a filter rule in the FORWARD chain. The rule content is -- All packets sent from vethhost01 are allowed to pass
  4.  
  5. iptables -t filter -A FORWARD -i vethhost02 -j ACCEPT #Same as above
  6.  
  7. iptables -t filter -A FORWARD -o vethhost02 -j ACCEPT

The filter type rule for the FORWARD chain has been added

verify:

demo01/02 is interconnected

Cross-host communication (to achieve intercommunication between demo01/02 and other hosts)

Prepare a Linux host (node02--100.100.198.253).

  • Node02 does not know that it needs to access 10.0.1.2/10.0.2.2 on node03. It needs to add relevant routes on node02 and point the next hop of the destination address to node03 (100.100.198.250).
  • Node02 and node03 are already interconnected (and in the same network segment), so there is no need to add any routing and iptables related configuration on node03

Node02's route

verify:

node02-->demo01/02

demo01/02-->node02

When capturing packets on node02, the source address displayed is 10.0.1.2

Current network model

At this point, the Network Namespace that enables cross-host communication has been created and configured.

The creation and configuration process of N3 and N4 on Node02 will not be demonstrated here. The operation steps can be referred to above.

Conclusion

There is more than one network namespace communication model, and each model has its own advantages and disadvantages. The communication model between network namespaces implemented this time is based on the routing mode (纯三层implementation).

The performance of the related network solutions of this communication model is closer to the host network. The implementation of the BGP mode of the Kubernetes network component Calico is based on this model, and its performance is greatly improved compared to the IPIP mode (Overlay type network).

The obvious feature of this type of solution is that the packets received by the terminal have not been processed by NAT. For example, the packet capture from node02 shows that the source address of the message is the IP of the initiator demo01: 10.0.1.2 . In some demand scenarios, the receiving end of the request must obtain the source IP. In this scenario, the Overlay type network solution cannot meet the requirements.

I hope this article can help you understand some of the underlying technical implementations of the Kubernetes network.

<<:  Snapchat QUIC Practice: Small Protocol Solve Big Problems

>>:  Global 5G commercial networks have reached 169 continents and are now covered

Recommend

Before 5G mobile phones become popular, these problems must be solved first

Although information about 5G has attracted a lot...

The 5G revolution started ten years ago

Will the "Warring States Period" patter...

It’s time to show the real technology! See 5G+IoT=?

The Internet of Things is already booming, and it...

Let’s talk about the truth about 5G cars

[[259646]] Under the global consensus that "...

5G and the Internet of Things: Connecting Millions of Devices

As the number of connected devices continues to g...

Talk about 5G in plain language: ten knowledge points to ensure you understand

When it comes to 5G, everyone can basically talk ...

What is DNS and how does it work?

The Domain Name System (DNS) is one of the founda...

Will the popular SD-WAN really kill MPLS?

[[419147]] Arguably, no network technology has re...

How cloud services enable a 5G-driven future

As high-speed cellular networks become mainstream...

All the information about IPv6 is here? Learn more in one article

Now many operators support IPv6. The day before y...