For the first time, such a clear and unconventional explanation of K8S network

For the first time, such a clear and unconventional explanation of K8S network

[51CTO.com original article] K8S network design and implementation is a summary of the process of learning K8S network. This article explains the complex network architecture of K8S according to the K8S network design principles, Pod internal network, Pod inter-network and other steps.

[[273327]]

Image source: "An Illustrated Guide to Kubernetes That Even Your Daughter Can Understand"

K8S network design principles

The K8S network design principles are as follows:

  • Each Pod has an independent IP address, and all containers in the Pod share the IP address.
  • All Pods in the cluster are in a directly connected flat network and can be directly accessed via IP.

That is: all containers can directly access each other without NAT; all nodes and all containers can directly access each other without NAT; the IP seen by the container itself is the same as that seen by other containers.

K8S Network Specifications

CNI is a container network specification proposed by CoreOS. Others that have adopted the specification include Apache Mesos, Cloud Foundry, Kubernetes, Kurma, and RKT.

In addition, projects such as Contiv Networking, Project Calico and Weave also provide plugins for CNI.

CNI specifies a simple contract between the container runtime and the network plug-in. This contract defines the input and output that the CNI plug-in needs to provide through JSON syntax.

The CNI plugin provides two functions:

  • One is used to join the network interface to the specified network.
  • Another one to remove it.

These two interfaces are called when the container is created and destroyed respectively. The container runtime first needs to obtain a network namespace and a container ID, and then pass them to the network driver along with some CNI configuration parameters.

The network driver then connects the container to the network and returns the assigned IP address to the container runtime in JSON format.

K8S network plugin requirements

In general, K8S has two basic requirements for network plug-ins:

  • It is necessary to be able to assign non-conflicting IP addresses to the Pods on each Node.
  • All Pods must be able to access each other.

K8S network implementation solution

There are several K8S network implementation solutions:

Tunnel Solution

The tunnel solution is also widely used in IaaS layer networks, distributing Pods in a large Layer 2 network. The network topology is simple, but the complexity increases as the node scale grows.

Representative Program:

  • Weave: UDP broadcast, the local machine establishes a new BR and communicates through PCAP.
  • Open vSwitch: Based on VxLan and GRE protocols, but with severe performance loss.
  • Flannel: UDP broadcast, VxLan.
  • Racher: IPsec.

Routing Solution

Routing solutions generally achieve isolation and cross-host container interoperability from Layer 3 or Layer 2, and problems are easy to troubleshoot.

Representative Program:

  • Calico: A routing solution based on the BGP protocol that supports very detailed ACL control and has a high affinity for hybrid clouds.
  • Macvlan: A solution with excellent isolation and performance from the perspective of logic and Kernel layers. It is based on Layer 2 isolation, so it requires support from Layer 2 routers. Most cloud service providers do not support it, so it is difficult to implement on a hybrid cloud.

K8S Pod network creation process

The network creation process of K8S Pod is as follows:

  • In addition to the container specified when creating a Pod, each Pod has a base container specified when Kubelet is started.
  • Kubelet creates the basic container and generates the Network Namespace.
  • Kubelet calls the network CNIdriver and calls the specific CNI plug-in according to the configuration.
  • The CNI plugin configures the network for the base container.
  • Other containers in the Pod share the network of the base container.

Networking in Pods

Pod is the smallest working unit of K8S. Each Pod contains one or more containers. K8S also manages Pods instead of directly managing containers. The containers in the Pod will be scheduled as a whole by the Master to run on a Node.

The design concept of Pod is to support multiple containers sharing network addresses and file systems in one Pod, and to combine services in a simple and efficient way through inter-process communication and file sharing.

A Pod can contain multiple containers, but a Pod has only one IP address. So how do multiple containers access each other and the Internet using this one IP address?

The answer is: multiple containers share the same underlying network namespace Net (network devices, network stacks, ports, etc.).

The following is a small example to create a Pod containing two containers. The yaml file is as follows:

  1. apiVersion: apps/v1beta1
  2. kind: Deployment
  3. metadata:
  4. name : Pod-two-container
  5. spec:
  6. replicas: 1
  7. template:
  8. metadata:
  9. labels:
  10. app: nginx
  11. spec:
  12. containers:
  13. - name : busybox
  14. image: busybox
  15. command:
  16. - "/bin/sh"  
  17. - "-c"  
  18. - "while true;do echo hello;sleep 1;done"  
  19. - name : nginx
  20. image: nginx

When creating a Pod that contains two containers, three containers will actually be created. The extra one is the "Pause" container.

This Container is the basic container of the Pod and provides network functions for other containers.

View the basic information of the Pause container:

Use the docker inspect container_ID command to view Nginx detailed information. Its network command space uses the namespace of the Pause container, as well as the namespace for inter-process communication.

Looking at Busybox again, we can find that its network command space uses the namespace of the Pause container, and the namespace of process communication is also the namespace of the Pause container.

Implementation: The reason why Nginx and Busybox can connect to the Pause namespace is because Docker has a feature: it can use the specified Docker network namespace when creating it.

There is a description on Docker's official website:

  1. https://docs.docker.com/engine/reference/run/

So if you want to manually complete the above Pod, you can first create Pause, then create Nginx and Busybox, and specify the network as the network namespace of Pause.

docker run --name pause mirrorgooglecontainers/pause-amd64:3.1docker run --name=nginx --network=container:pause nginx docker run --name=busybox --network=container:pause busybox

The above steps are completed by K8S, so the Pod namespace should be like this:

Now that we have discussed the network implementation within a Pod, let's look at how a Pod obtains an IP address and how Pods communicate with each other.

Flannel Network

Introduction to Flannel

Flannel is a network planning service designed by the CoreOS team for Kubernetes. Simply put, its function is to allow Docker containers created by different node hosts in the cluster to have a unique virtual IP address for the entire cluster.

In the default Docker configuration, the Docker service on each node is responsible for allocating IP addresses for containers on that node. This leads to a problem that containers on different nodes may obtain the same IP address.

Flannel is designed to re-plan the IP address usage rules for all nodes in the cluster, so that containers on different nodes can obtain "non-duplicate" IP addresses that belong to the same intranet, and containers on different nodes can communicate directly through the intranet IP.

Flannel is essentially an "overlay network", which means that TCP data is packaged in another network packet for routing, forwarding and communication.

Currently, data forwarding methods such as UDP, Vxlan, Host-gw, Aws-vpc, Gce and Alloc routing are supported. The default data communication method between nodes is UDP forwarding.

Applicable scenarios: No need to isolate Pods and the cluster size is small.

Design idea: Allocate an IP segment to each Node so that the IPs of the Nodes are not repeated and the Pods can be accessed directly using the IPs.

Design advantages: The network model is simple, installation and configuration are relatively easy, and the environment is mature and suitable for most use cases.

Flannel's solution to network requirements

Non-conflicting IPs:

  • Flannel uses the Kubernetes API or etcd to store the network configuration of the entire cluster and records the network segments used by the cluster according to the configuration.
  • Flannel runs Flanneld as an agent in each host. It obtains a small network segment Subnet from the network address space of the cluster for the host. The IP addresses of all containers in the host will be allocated from it.

For example, IP allocation in the test environment:

①Master Node

②Node1

③Node2

In the Flannel Network, each Pod is assigned a unique IP address, and the Subnet of each K8S Node does not overlap or intersect.

Pods access each other:

  • Flanneld stores the Subnet obtained by the host and the Public IP used for communication between hosts through etcd, and sends them to the corresponding module when needed.
  • Flannel uses various backend mechanisms, such as UDP, Vxlan, etc., to forward network traffic between containers across hosts and complete cross-host communication between containers.

Flannel Architecture Principles

Flannel architecture diagram:

The components are explained below:

Cni0: Bridge device. Each time a Pod is created, a Veth Pair is created. One end is eth0 in the Pod, and the other end is the port (network card) in the cni0 bridge.

All traffic sent from the eth0 network card in the Pod will be sent to the port (network card) of the cni0 bridge device.

Note: The IP address obtained by the cni0 device is the first address of the network segment to which the node is assigned.

Flannel.1: Overlay network device, used to process Vxlan messages (packaging and unpacking). Pod data traffic between different nodes is sent from the Overlay device to the peer end in the form of a tunnel.

Flanneld: Flannel runs Flanneld as an agent in each host. It obtains a small network segment Subnet from the network address space of the cluster for the host, and the IP addresses of all containers in the host will be allocated from it.

At the same time, Flanneld monitors the K8S cluster database and provides the Flannel.1 device with the necessary Mac, IP and other network data information when encapsulating data.

The communication process of Pods on different Nodes:

  • Data is generated in the Pod and sent to cni0 based on the Pod's routing information.
  • cni0 sends the data to the tunnel device Flannel.1 according to the node's routing table.
  • Flannel.1 checks the destination IP address of the data packet, obtains the necessary information about the peer tunnel device from Flanneld, and encapsulates the data packet.
  • Flannel.1 sends the data packet to the peer device. The node's network card receives the data packet, finds that the data packet is an Overlay data packet, decapsulates the outer layer, and sends the inner layer encapsulation to the Flannel.1 device.
  • The Flannel.1 device looks at the data packet, matches it according to the routing table, and sends the data to the cni0 device.
  • cni0 matches the routing table and sends data to the corresponding port on the bridge.

Communication Process

①Container in Pod1 to cni0

Pod1 and Pod3 can ping each other:

The Dst IP of the Ping packet is 192.20.1.43. According to the route matching to the last routing table entry, all packets destined for 192.20.0.0/12 are forwarded to 192.20.0.1.

192.20.0.1 is the IP address of cni0.

②cni0 to Flannel1.1

When the Icmp packet reaches cni0, cni0 finds that Dst is 192.20.1.43, and CNI looks for a match according to the host routing table.

According to the minimum matching principle, a routing table entry in the figure is matched. The packet destined for the 192.20.1.0/24 network segment is sent to the 192.20.1.0 gateway, and the gateway device is Flannel.1.

③Flannel.1

Inner encapsulation: Flannel.1 is the endpoint of the Vxlan tunnel. When a data packet arrives at Flannel.1, it needs to be encapsulated. At this time:

  • The source IP src ip is 192.20.0.51.
  • The destination IP dst ip is 192.20.1.43.

The data packet needs to know the destination IP 192.20.1.43 and the Mac address corresponding to the IP address to continue encapsulating. At this time, Flannel.1 will not send an ARP request to obtain the Mac address of the destination IP, but send the request to the Flanned program in the user space.

After receiving the kernel's request event, the Flanned program searches etcd for the Mac address of the Flannel.1 device in the subnet that matches the address, that is, the Mac address of the Flannel.1 device in the Host where the Pod is located.

When Flannel assigns IP segments to Node nodes, it records all network segments and Mac information. The interactive process is shown in the following figure:

Flanned puts the queried information into the ARP Cache table of the Master node:

At this point, the Vxlan inner data packet has been encapsulated. The format is as follows:

To briefly summarize this process:

  • The data packet arrives at Flannel.1. By looking up the routing table, it is known that the data packet must be sent to 192.20.1.0 through Flannel.1.
  • Through the ARP Cache table, we know the Mac address of the destination IP 192.20.1.0.

Outer encapsulation: At this point, the inner encapsulation is ready, and the outer encapsulation of Vxlan needs to be found. The kernel needs to check the FDB (forwarding database) on the Node to obtain the Node address of the destination Vtep device in the inner packet.

Because the Mac address of the destination device has been found in the ARP Table as 52:77:71:e6:4f:58, and the IP address of the Node corresponding to the Mac address exists in the FDB.

If this information is not available in FDB, the Kernel will send an "L2 MISS" event to the Flanned program in user space. After receiving the event, Flanneld will query etcd to obtain the "Public IP" of the Node corresponding to the Vtep device and register the information in FDB.

When the kernel obtains the IP address sent to the machine, ARP obtains the Mac address, and then the outer encapsulation of Vxlan can be completed.

④ Peer Flannel.1

When the eth0 network card of the Node receives the Vxlan device packet, the Kernal will recognize that it is a Vxlan packet, split the packet and transfer it to the Flannel.1 device on the node.

In this way, the data packet arrives at the destination node from the sending node, and the Flannel.1 device will receive a data packet as follows:

The destination address is 192.20.1.43. Flannel.1 searches its own routing table and completes the forwarding according to the routing table.

According to the matching principle, Flannel.1 forwards the traffic destined for 192.20.1.0/24 to cni0.

⑤cnio to Pod

cni0 is a bridge device. When cni0 receives a data packet, it sends the data packet to the Pod through Veth Pair. View the bridge in the Node node.

Through ARP resolution on the Node, we can find out that the Mac address of 192.20.1.43 is 66:57:8e:3d:00:85:

This address is the address of the Pod's network card eth0.

At the same time, through the pairing relationship of Veth Pair, we can see that eth0 in the Pod is one end of the Veth Pair, and the other end is on the Node row. The corresponding network card is vethd356ffc1@if3:

Therefore, the Veth Pair of the Pod mounted on the cni0 bridge is vethd356ffc1, that is:

eth0@if50 and vethd356ffc1@if3 form a Veth Pair, which is equivalent to plugging eth0 in the Pod directly into cni0.

Briefly summarize the principle of cni0 forwarding traffic:

  • First, use ARP to find out the Mac address corresponding to the IP address.
  • Forward the traffic to the corresponding Veth Pair port of the eth0 network where the Mac address is located.
  • The Veth Pair port receives the traffic and directly injects the traffic into the eth0 network card of the Pod.

Summarize the characteristics of Flannel:

  • The Docker containers created by different Node hosts in the cluster all have unique virtual IP addresses for the entire cluster.
  • Establish an overlay network, through which the data packet is delivered intact to the target container.
  • Create a new virtual network card Flannel0 to receive data from the Docker bridge, and package and forward the received data by maintaining the routing table.
  • etcd ensures that the configurations seen by Flanned on all Nodes are consistent. At the same time, Flanned on each Node monitors data changes on etcd and perceives changes in Nodes in the cluster in real time.

Packaging of different backends

Flannel can specify different forwarding backend networks, commonly used ones are Hostgw, UDP, Vxlan, etc. The above uses the Vxlan network.

①Hostgw: It is a simple Backend. Its principle is very simple. It directly adds routing, treats the destination host as the gateway, and directly routes the original packet.

For example, we listen to an EventAdded event from etcd. Subnet 10.1.15.0/24 is assigned to the host Public IP 192.168.0.100.

What Hostgw needs to do is to add a destination address of 10.1.15.0/24, a gateway address of 192.168.0.100, and an output device of the network card selected above for cluster interaction on the host.

Advantages: simple, direct and efficient.

Disadvantages: All Pods are required to be in one subnet. If they cross network segments, they cannot communicate.

②UDP: How to deal with the scenario where the Pod is not in the same subnet? Treat the Pod's network packet as an application layer data packet, encapsulate it with UDP, and transmit it in the cluster, which is Overlay.

The above picture comes from Flannel official. The encapsulation format of the Packer on the right is the format of the Overlay using UDP.

When container 10.1.15.2/24 wants to communicate with container 10.1.20.2/24:

  • Because the destination of the packet is not within the host subnet, the packet will first be forwarded to the host through the bridge.
  • After routing matching on the host, it enters the network card Flannel.1. (It should be noted that Flannel.1 is a Tun device, which is a virtual network device working at the third layer, and Flanneld is a Proxy that listens to Flannel.1 and forwards traffic.)
  • When a packet enters Flannel.1, Flanneld can read the packet from Flanne.1. Since Flanne.1 is a three-layer device, the read packet only contains the IP layer header and its payload.
  • Finally, Flanneld will use the obtained packet as payload data and send it to the destination host through UDP Socket.
  • The Flanneld on the destination host will listen to the device where the Public IP is located, read the payload of the UDP packet from it, and put it into the Flannel.1 device.
  • The container network packet reaches the destination host and can then be forwarded to the destination container through the bridge.

Advantages: Pods can be accessed across network segments.

Disadvantages: Insufficient isolation, UDP cannot isolate two network segments.

③Vxlan: The packet structure is very similar to the udpbackend mentioned above. The difference is that there is an additional vxlanheader and an additional layer 2 header in the original message.

When initializing the cluster, the Vxlan network is initialized: when host B joins the Flannel network, it will write its three information into etcd, namely: Subnet10.1.16.0/24, Public IP 192.168.0.101, and Mac address MAC B of Vtep device Flannel.1.

Afterwards, host A will receive the EventAdded event and obtain the various information added to etcd by B mentioned above.

At this time, it will add three pieces of information on the local machine:

  • Routing information: All packets destined for the destination address 10.1.16.0/24 are sent through the Vtep device Flannel.1 device, and the gateway address they are sent to is 10.1.16.0, which is the Flannel.1 device in host B.
  • FDB information: Packets with MAC address MAC B will be sent to the destination address 192.168.0.101, that is, host B, through Vxlan.
  • ARP information: The gateway address 10.1.16.0 has a MAC address of B.

In fact, Flannel only uses part of the functions of Vxlan. Since VNI is fixed to 1, its working method is essentially similar to UDP Backend. The only difference is that the UDP Proxy is replaced by the Vxlan processing module in the kernel.

The original load is extended from Layer 3 to Layer 2, but this is meaningless for the Layer 3 network solution Flannel. This is done only to adapt to the Vxlan model.

Problems

The problems are as follows:

  • Network isolation between Pods is not supported. The design concept of Flannel is to put all Pods in a large Layer 2 network, so there is no isolation strategy between Pods.
  • The equipment is complex and the efficiency is low. There are three types of equipment under the Flannel model. The number of equipment encapsulated and parsed by multiple equipment will inevitably lead to a decrease in transmission efficiency.

[[273330]]

Calico Network

About Calico

Calico is an open source networking and network security solution for containers, virtual machines, and native host-based workloads.

Calico supports a wide range of platforms, including Kubernetes, OpenShift, Docker EE, OpenStack, and bare metal services.

In virtualization platforms, such as OpenStack and Docker, workloads need to be interconnected, but containers also need to be isolated and controlled, just like services on the Internet only open port 80 and public clouds have multiple tenants, providing isolation and control mechanisms.

In most virtualization platform implementations, Layer 2 isolation technology is usually used to implement container networks. These Layer 2 technologies have some disadvantages, such as the need to rely on VLAN, Bridge, and tunnel technologies.

Bridges bring complexity, while VLAN isolation and tunnels consume more resources and have requirements on the physical environment. As the network scale increases, the overall complexity will become more and more complex.

Calico is a pure three-layer solution. It considers each Node as a router and all containers as network terminals connected to the router. It runs the standard routing protocol, BGP, between routers and lets them learn how to forward the network topology.

Calico allocates a subnet to each host as the IP range that can be allocated to the container, so that relatively fixed routing rules can be generated for each host based on the CIDR of the subnet, and the traffic sent by the Pod to another host can be forwarded according to the matching of the routing rules.

Applicable scenarios: The cluster is large and Pods in the environment need to be isolated.

Design concept: Calico does not use tunnels or NAT to achieve forwarding, but cleverly converts all layer 2 and 3 traffic into layer 3 traffic, and completes cross-host forwarding through routing configuration on the Host.

The design advantages are as follows:

① Better resource utilization: Layer 2 network communication relies on a broadcast message mechanism, and the overhead of broadcast messages grows exponentially with the number of hosts. The Layer 3 routing method used by Calico completely suppresses Layer 2 broadcasts and reduces resource overhead.

In addition, the Layer 2 network uses VLAN isolation technology, which is inherently limited to 4096 specifications. Even if Vxlan can be used to solve this problem, Vxlan brings new problems with tunnel overhead. Calico does not use Vlan or Vxlan technology, which makes resource utilization more efficient.

② Scalability: Calico uses a solution similar to the Internet. The Internet's network is larger than any data center, and Calico is also naturally scalable.

③ Simple and easier to debug: Because there is no tunnel, the path between workloads is shorter and simpler, with less configuration, and it is easier to debug on the host.

④ Fewer dependencies: Calico only relies on Layer 3 routing to be reachable.

⑤ Adaptability: Calico's fewer dependencies enable it to adapt to all VM, Container, white box or hybrid environment scenarios.

Calico Architecture Principles

The architecture diagram is as follows:

The main working components of the Calico network model are:

  • Felix: An agent process running on each host, responsible for network interface management and monitoring, routing, ARP management, ACL management and synchronization, status reporting, etc.
  • etcd: Distributed key-value storage, mainly responsible for network metadata consistency, ensuring the accuracy of Calico network status, and can be shared with Kubernetes.
  • BGP Client (BIRD): Calico deploys a BGP Client for each Host, implemented using BIRD. BIRD is a separate ongoing project that implements many dynamic routing protocols such as BGP, OSPF, RIP, etc.

The role of Calico is to listen to the routing information injected by Felix on the Host, and then broadcast it to the remaining Host nodes through the BGP protocol, thereby achieving network interconnection.

  • BGP Route Reflector (BIRD): In a large network, if only BGP Client is used to form a Mesh network interconnection solution, it will lead to scale limitations, because all nodes are interconnected, which requires N^2 connections.

In order to solve this scale problem, the RouterReflector method of BGP can be used to make all BGPClients only interconnect with specific RR nodes and perform route synchronization, thereby greatly reducing the number of connections.

Felix: It will monitor the storage of the ectd center and obtain events from it, such as when a user adds an IP to this machine or creates a container.

After the user creates a Pod, Felix is ​​responsible for setting up its network card, IP, and MAC, and then writes an entry in the kernel's routing table to indicate that this IP should go to this network card.

Similarly, if the user has established an isolation policy, Felix will also create the policy into the ACL to achieve isolation.

BIRD: It is a standard routing program. It obtains the IP routes that have changed from the kernel, and then spreads them to other host machines through the standard BGP routing protocol, so that other Node nodes know that the IP is here, making it easier to generate routing entries.

Since Calico is a pure three-layer implementation, it can avoid the data packet encapsulation operations related to the two-layer solution. There is no NAT or overlay in the middle.

Therefore, its forwarding efficiency may be the highest among all solutions, because its packets go directly through the native TCP/IP protocol stack, and its isolation is also easier to implement because of this stack.

Because the TCP/IP protocol stack provides a complete set of firewall rules, it can achieve more complex isolation logic through IPTABLES rules.

Two networks between Calico network nodes

IPIP: It is to put an IP data packet in an IP packet, that is, to encapsulate the IP layer into a tunnel of the IP layer.

It basically acts as an IP-based bridge! Generally speaking, ordinary bridges are Mac-based and do not require IP at all.

IPIP creates a tunnel through the routers at both ends, connecting two originally unconnected networks point-to-point.

BGP: Border Gateway Protocol (BGP) is a core decentralized autonomous routing protocol on the Internet.

It achieves reachability between autonomous systems (AS) by maintaining an IP routing table or 'prefix' table and is a vector routing protocol.

BGP does not use traditional interior gateway protocol (IGP) metrics, but instead uses path-based, network policies or rule sets to make routing decisions.

IPIP working mode

①Test environment

One Msater node, IP 172.171.5.95, one Node node IP 172.171.5.96:

Create a Daemonset application, Pod1 is located on the Master node with an IP address of 192.168.236.3, and Pod2 is located on the Node node with an IP address of 192.168.190.203:

Pod1 pings Pod2:

②Ping package trip

Routing information on Pod1:

According to the routing information, Ping 192.168.190.203 will match the first route. The first route means that data packets destined for any network segment are sent to the network management port 169.254.1.1 and then sent out from the eth0 network card.

The meaning of the Flags in the routing table:

  • U: UP means the current state is startup.
  • H: Host indicates that the route is a host, which is usually the route to reach the data packet.
  • G: Gateway indicates that the route is a gateway. If no destination is specified, it is directly connected.
  • D: Dynamically indicates that the route is modified by redirection message.
  • M: Indicates that the route has been modified by a redirect message.

Routing information on the Master node:

When the Ping packet arrives at the Master node, it will match the route tunl0. This route means that all data packets destined for the network segment 192.169.190.192/26 are sent to the gateway 172.171.5.96.

Because Pod1 is at 5.95 and Pod2 is at 5.96, the data packet is sent to the Node through this route.

Routing information on Node:

When the Node network card receives the data packet, it finds that the destination IP is 192.168.190.203, so it matches the red route (the green line is the route from the Node to the Master).

This route means: 192.168.190.203 is a directly connected device on this machine, and data packets destined for the device are sent to caliadce112d250. This device is one end of the Veth Pair of Pod2.

When creating Pod2, Calico will create a Veth Pair device for Pod2. One end is the network card of Pod2, and the other end is the caliadce112d250 we see.

You can verify this by installing the ethtool tool in Pod2, and then using ethtool-S eth0 to view the device number of the other end of the Veth Pair.

The device number of the other end of the Pod2 network card is 18. When checking the network device numbered 18 on the Node, you can find that the network device is caliadce112d250.

Therefore, the data sent by the router on the Node to caliadce112d250 is actually sent to the network card of Pod2. The Ping packet reaches its destination here.

Check the routing information in Pod2 and find that it is the same as that in Pod1.

As the name implies, IPIP network is to encapsulate IP network in IP network. The characteristic of IPIP network is that all Pod data traffic is sent from tunnel tunl0, and a layer of transport layer packet is added to tunl0.

Capture packets on the Master network card to analyze the process:

Open ICMP 285, the data packet of Pod1 Pinging Pod2, you can see that the data packet has a total of 5 layers, of which there are two network layers where IP is located, namely the network between Pods and the network encapsulation between hosts.

According to the encapsulation order of the data packets, there should be an extra layer of data packets between hosts encapsulated outside the ICMP packet of Pod1 Ping Pod2.

The reason for doing this is that tunl0 is a tunnel endpoint device, and a layer of encapsulation is added when the data arrives to facilitate cross-segment access.

Specific contents of two-layer IP encapsulation:

IPIP connection method:

BGP Working Mode

①Modify configuration

When installing the Calico network, the default installation is the IPIP network. In the calico.yaml file, change the value of CALICO_IPV4POOL_IPIP to "off" to replace it with the BGP network.

②Comparison

The biggest difference between a BGP network and an IPIP network is that there is no tunnel device tunl0, and traffic is not sent through the tunnel device.

As mentioned earlier, traffic between Pods in an IPIP network is sent through tunl0, and then tunl0 sends it to the peer device. In a BGP network, traffic between Pods is sent directly from the network card to the destination, reducing the tunl0 environment.

Routing information on the Master node. From the routing information, there is no tunl0 device.

Similarly, create a Daemonset, with Pod1 on the Master node and Pod2 on the Node node.

③Ping package journey

Pod1 pings Pod2:

According to the routing information in Pod1, the Ping packet is sent to the Master node through the eth0 network card.

Routing information on the Master node. According to the matched 192.168.190.192 route, the route means: data packets destined for the network segment 192.168.190.192/26 are sent to the network segment 172.171.5.96.

And 5.96 is the Node. Therefore, the data packet is sent directly to the 5.96 machine:

Routing information on the Node. According to the matching route of 192.168.190.192, the data will be sent to the cali6fcd7d1702e device, which is the same as the one analyzed above and is one end of the Veth Pair of Pod2. The data is directly sent to the network card of Pod2.

When Pod2 responds to the Ping packet, the data reaches the Node and matches the route 192.168.236.0.

This route says: data destined for network segment 192.168.236.0/26 is sent to gateway 172.171.5.95. The data packet is directly sent to the Master node through network card ens160.

By capturing packets on the Master node, you can view the traffic passing through, filter out ICMP, and find the data packet of Pod1 Ping Pod2.

It can be seen that in the BGP network, the IPIP mode is not used and the data packets are encapsulated normally.

It is worth noting the encapsulation of the Mac address. 192.168.236.0 is the IP of Pod1, and 192.168.190.198 is the IP of Pod2.

The source Mac address is the Mac of the Master node network card, and the destination Mac is the Mac of the Node node network card.

This means that when the Master node's router receives data and reconstructs the data packet, it uses an ARP request to obtain the Node node's Mac and then encapsulates it into the data link layer.

The BGP connection method is as follows:

Network comparison

IPIP Network:

  • Traffic: The tunlo device encapsulates data, forms a tunnel, and carries traffic.
  • Applicable network types: Applicable to scenarios where Pods accessing each other are not in the same network segment and access across network segments. The outer encapsulated IP can solve the routing problem across network segments.
  • Efficiency: Traffic needs to be encapsulated by the tunl0 device, which has a slightly lower efficiency.

BGP Network:

  • Traffic flow: Directs traffic using routing information.
  • Applicable network types: Applicable to Pods that access each other in the same network segment. Cross-segment access requires support from an upstream switch or router. Applicable to large networks.
  • Efficiency: Native HostGW, high efficiency.

Problems

①Tenant isolation problem

Calico's three-layer solution is to perform routing addressing directly on the Host, so if multiple tenants use the same CIDR network, they will face the problem of address conflicts.

② Routing scale issue

It can be seen from the routing rules that the routing scale is related to the Pod distribution. If the Pods are discretely distributed in the Host cluster, more routing items will inevitably be generated.

③IPtables rule scale issue

A host may virtualize dozens of container instances. Too many IPtables rules cause complexity and undebugability, and also cause performance loss.

④ Gateway routing issues across subnets

When the peer network is not reachable at Layer 2 and needs to be routed through Layer 3, the gateway needs to support custom routing configuration, that is, the destination address of the Pod is the gateway address of this network segment, and then the gateway forwards across Layer 3.

K8S Network Solution Comparison

The following comparison is an excerpt:

http://www.sohu.com/a/256113338_764649

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<:  Flexible consumption model reduces IT expenses and helps investments

>>:  Network | Comic: What is the HTTPS protocol?

Recommend

Is the expansion speed of WiFi 6 really much faster than 5G?

In daily life, we may encounter the following sit...

Smart commercial buildings: Top 10 technologies to watch

In recent years, rapid advances in digital techno...

5G bidding is finalized, and competition is changing again

[[417538]] 2021 is the third year of 5G commercia...

Vultr US Silicon Valley Data Center VPS Simple Test

I haven't shared information about Vultr for ...