An article to understand the IPIP network mode of calico

An article to understand the IPIP network mode of calico

[[397426]]

Preface

This article mainly analyzes the IPIP network mode of the network component calico in k8s. It aims to understand the calixxxx, tunl0 and other devices generated in the IPIP network mode and the cross-node network communication method. It may seem a bit boring, but please take a few minutes to read it. If you forget the previous part after reading the latter part, please read it twice. These few minutes will definitely be worth it.

1. Introduction to calico

Calico is another popular network choice in the Kubernetes ecosystem. While Flannel is recognized as the simplest choice, Calico is known for its performance and flexibility. Calico is more comprehensive, not only providing network connectivity between hosts and pods, but also network security and management. The Calico CNI plugin encapsulates the functionality of Calico within the CNI framework.

Calico is a pure three-layer network solution based on BGP, which can be well integrated with cloud platforms such as OpenStack, Kubernetes, AWS, GCE, etc. Calico uses Linux Kernel to implement an efficient virtual router vRouter on each computing node to be responsible for data forwarding. Each vRouter broadcasts the routing information of the container running on this node to the entire Calico network through the BGP1 protocol, and automatically sets the routing forwarding rules to reach other nodes. Calico ensures that the data traffic between all containers is interconnected through IP routing. When networking Calico nodes, the network structure (L2 or L3) of the data center can be directly used. There is no need for additional NAT, tunnels or Overlay Networks, and there is no additional packet unpacking, which can save CPU computing and improve network efficiency.

In addition, Calico also provides a variety of network policies based on iptables, implements Kubernetes' Network Policy policy, and provides the function of limiting network accessibility between containers.

Calico official website: https://www.projectcalico.org/

2. Calico architecture and core components

The architecture diagram is as follows:

Calico core components:

  • Felix: An agent process that runs on each node that needs to run a workload. It is mainly responsible for configuring routing and ACLs (access control lists) to ensure endpoint connectivity and network connectivity across host containers;
  • etcd: A highly consistent and available key-value storage system that persistently stores Calico data. It is mainly responsible for network metadata consistency and ensures the accuracy of Calico network status;
  • BGP Client (BIRD): Reads the kernel routing state set by Felix and distributes the state in the data center.
  • BGP Route Reflector (BIRD): BGP route reflector, used in large-scale deployment. If only BGP Client is used to form a mesh network, it will lead to scale limitations, because all BGP client nodes are connected to each other, and N^2 connections need to be established, and the topology will become complicated. Therefore, reflectors are used to connect clients to prevent nodes from connecting to each other.

3. Working principle of calico

Calico considers the protocol stack of each operating system as a router, and all containers as network terminals connected to this router. It runs the standard routing protocol, BGP, between routers, and lets them learn how to forward packets in this network topology. Therefore, the Calico solution is actually a pure three-layer solution, which means that the three layers of the protocol stack of each machine are used to ensure the three-layer connectivity between two containers and between containers across hosts.

4. Two network modes of calico

1) IPIP

A tunnel that encapsulates the IP layer into the IP layer. Its function is basically equivalent to a bridge based on the IP layer! Generally speaking, ordinary bridges are based on the MAC layer and do not require IP at all. However, this ipip uses the routers at both ends to make a tunnel, connecting two originally unconnected networks through point-to-point connection. The source code of ipip can be found in the kernel net/ipv4/ipip.c.

2) BGP

Border Gateway Protocol (BGP) is a core decentralized autonomous routing protocol on the Internet. It achieves reachability between autonomous systems (AS) by maintaining IP routing tables or 'prefix' tables, and is a vector routing protocol. BGP does not use the traditional indicators of the interior gateway protocol (IGP), but uses paths, network policies or rule sets to determine routes. Therefore, it is more suitable to be called a vector protocol rather than a routing protocol.

5. IPIP Network Mode Analysis

Since the IPIP mode is used in my personal environment, I will analyze this mode here.

  1. # kubectl get po -o wide -n paas | grep hello
  2. demo-hello-perf-d84bffcb8-7fxqj 1/1 Running 0 9d 10.20.105.215 node2.perf <none> <none>
  3. demo-hello-sit-6d5c9f44bc-ncpql 1/1 Running 0 9d 10.20.42.31 node1.sit <none> <none>

Perform a ping test

Here, ping the demo-hello-sit pod from the demo-hello-perf pod.

  1. root@demo-hello-perf-d84bffcb8-7fxqj:/# ping 10.20.42.31
  2. PING 10.20.42.31 (10.20.42.31) 56(84) bytes of data.
  3. 64 bytes from 10.20.42.31: icmp_seq=1 ttl=62 time =5.60 ms
  4. 64 bytes from 10.20.42.31: icmp_seq=2 ttl=62 time =1.66 ms
  5. 64 bytes from 10.20.42.31: icmp_seq=3 ttl=62 time =1.79 ms
  6. ^C
  7. --- 10.20.42.31 ping statistics ---  
  8. 3 packets transmitted, 3 received, 0% packet loss, time 6ms
  9. rtt min / avg / max /mdev = 1.662/3.015/5.595/1.825 ms

Enter the pod demo-hello-perf to view the routing information in this pod

  1. root@demo-hello-perf-d84bffcb8-7fxqj:/# route -n
  2. Kernel IP routing table  
  3. Destination Gateway Genmask Flags Metric Ref Use Iface
  4. 0.0.0.0 169.254.1.1 0.0.0.0 UG 0 0 0 eth0
  5. 169.254.1.1 0.0.0.0 255.255.255.255 UH 0 0 0 eth0

According to the routing information, ping 10.20.42.31 will match the first entry.

The first route means that all data packets destined for any network segment are sent to the gateway 169.254.1.1 and then sent out from the eth0 network card.

The routing information on the host of node node2.perf where demo-hello-perf is located is as follows:

  1. # route -n
  2. Kernel IP routing table  
  3. Destination Gateway Genmask Flags Metric Ref Use Iface
  4. 0.0.0.0 172.16.36.1 0.0.0.0 UG 100 0 0 eth0
  5. 10.20.42.0 172.16.35.4 255.255.255.192UG 0 0 0 tunl0
  6. 10.20.105.196 0.0.0.0 255.255.255.255 UH 0 0 0 cali4bb1efe70a2
  7. 169.254.169.254 172.16.36.2 255.255.255.255 UGH 100 0 0 eth0
  8. 172.16.36.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0
  9. 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0

You can see a route with a destination of 10.20.42.0.

This means: when the ping packet comes to the master node, it will match the route tunl0. This route means: all packets destined for the 10.20.42.0/26 network segment are sent to the gateway 172.16.35.4. Because the demo-hello-perf pod is on 172.16.36.5 and the demo-hello-sit pod is on 172.16.35.4, the packets are sent to the node through the device tunl0.

The routing information on the host of node node1.sit where demo-hello-sit is located is as follows:

  1. # route -n
  2. Kernel IP routing table  
  3. Destination Gateway Genmask Flags Metric Ref Use Iface
  4. 0.0.0.0 172.16.35.1 0.0.0.0 UG 100 0 0 eth0
  5. 10.20.15.64 172.16.36.4 255.255.255.192UG 0 0 0 tunl0
  6. 10.20.42.31 0.0.0.0 255.255.255.255 UH 0 0 0 cali04736ec14ce
  7. 10.20.105.192 172.16.36.5 255.255.255.192UG 0 0 0 tunl0

When the node network card receives the data packet, it finds that the destination IP is 10.20.42.31, so it matches the route with Destination 10.20.42.31.

This route means: 10.20.42.31 is a directly connected device on this machine, and the data packets destined for the device are sent to cali04736ec14ce

Why is there such a strange device named cali04736ec14ce? What is it?

In fact, this device is one end of the veth pair. When creating demo-hello-sit, calico will create a veth pair device for demo-hello-sit. One end is the network card of demo-hello-sit, and the other end is the cali04736ec14ce we see.

Let's verify it. We enter the demo-hello-sit pod and check the number behind device 4: 122964

  1. root@demo-hello-sit --6d5c9f44bc-ncpql:/# ip a  
  2. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group   default qlen 1000
  3. link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  4. inet 127.0.0.1/8 scope host lo
  5. valid_lft forever preferred_lft forever
  6. 2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group   default qlen 1000
  7. link/ipip 0.0.0.0 brd 0.0.0.0
  8. 4: eth0@if122964: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1380 qdisc noqueue state UP group   default   
  9. link/ether 9a:7d:b2:26:9b:17 brd ff:ff:ff:ff:ff:ff link-netnsid 0
  10. inet 10.20.42.31/32 brd 10.20.42.31 scope global eth0
  11. valid_lft forever preferred_lft forever

Then we log in to the host where the demo-hello-sit pod is located to view

  1. # ip a | grep -A 5 "cali04736ec14ce"  
  2. 122964: cali04736ec14ce@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1380 qdisc noqueue state UP group   default   
  3. link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 16
  4. inet6 fe80::ecee:eeff:feee:eeee/64 scope link
  5. valid_lft forever preferred_lft forever
  6. 120918: calidd1cafcd275@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1380 qdisc noqueue state UP group   default   
  7. link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 2

It is found that the other end device number in pod demo-hello-sit is the same as the cali04736ec14ce number 122964 seen on the node here

Therefore, the routing on the node sends the data of the cali04736ec14ce network card device to the demo-hello-sit pod. At this point, the ping packet reaches the destination.

Note the route of the host where the demo-hello-sit pod is located. There is a route with a destination of 10.20.105.192.

  1. # route -n
  2. Kernel IP routing table  
  3. Destination Gateway Genmask Flags Metric Ref Use Iface
  4. ...
  5. 0.0.0.0 172.16.35.1 0.0.0.0 UG 100 0 0 eth0
  6. 10.20.105.192 172.16.36.5 255.255.255.192UG 0 0 0 tunl0
  7. ...

Check the routing information in the demo-hello-sit pod again. It is the same as that in the demo-hello-perf pod.

Therefore, based on the above examples, the network mode of IPIP is to encapsulate the IP network with a layer. The characteristic is that all pod data traffic is sent from the tunnel tunl0, and tunl0 adds a layer of transport layer packet operation.

6. Packet capture analysis

Ping the demo-hello-sit pod from the demo-hello-perf pod, and then perform a tcpdump on the host where the demo-hello-sit pod is located.

  1. # tcpdump -i eth0 -nn -w icmp_ping.cap
  2. tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes

Ping demo-hello-sit in the demo-hello-perf pod

  1. root@demo-hello-perf-d84bffcb8-7fxqj:/# ping 10.20.42.31
  2. PING 10.20.42.31 (10.20.42.31) 56(84) bytes of data.
  3. 64 bytes from 10.20.42.31: icmp_seq=1 ttl=62 time =5.66 ms
  4. 64 bytes from 10.20.42.31: icmp_seq=2 ttl=62 time =1.68 ms
  5. 64 bytes from 10.20.42.31: icmp_seq=3 ttl=62 time =1.61 ms
  6. ^C
  7. --- 10.20.42.31 ping statistics ---  
  8. 3 packets transmitted, 3 received, 0% packet loss, time 6ms
  9. rtt min / avg / max /mdev = 1.608/2.983/5.659/1.892 ms

After finishing the packet capture, download icmp_ping.cap to the local windows for packet capture analysis

It can be seen that the data packet has a total of 5 layers, of which there are two network layers where IP (Internet Protocol) is located, namely the network between pods and the network encapsulation between hosts.

The red box selects the host where the two pods are located, the blue box selects the IP addresses of the two pods, src indicates the host IP address of the pod that initiates the ping operation and the IP address of the pod that initiates the ping operation, and dst indicates the host IP address of the pod being pinged and the IP address of the pod being pinged

According to the encapsulation order of data packets, an extra layer of data packets between hosts should be encapsulated outside the ICMP packet of demo-hello-perf ping demo-hello-sit.

You can see that each datagram has two IP network layers, the inner layer is the IP network message between the Pod containers, and the outer layer is the network message of the host node (2 nodes). This is done because tunl0 is a tunnel endpoint device, and a layer of encapsulation is added when the data arrives to facilitate sending to the opposite tunnel device.

The specific contents of the two-layer packet are as follows:

The communication between Pods is forwarded via the Layer 3 tunnel of IPIP. Compared with the Layer 2 tunnel of VxLAN, the IPIP tunnel has less overhead, but its security is also worse.

7. Access from pod to svc

View service

  1. # kubectl get svc -o wide -n paas | grep hello
  2. demo-hello-perf ClusterIP 10.10.255.18 <none> 8080/TCP 10d appEnv=perf,appName=demo-hello
  3. demo-hello-sit ClusterIP 10.10.48.254 <none> 8080/TCP 10d appEnv=sit,appName=demo-hello

Capture packets on the host of pod demo-hello-sit

  1. # tcpdump -i eth0 -nn -w svc.cap
  2. tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes

Test access, curl demo-hello-perf's svc address and port in demo-hello-sit

  1. root@demo-hello-perf-d84bffcb8-7fxqj:/# curl -I http://10.10.48.254:8080/actuator/health
  2. HTTP/1.1 200
  3. Content-Type: application/vnd.spring-boot.actuator.v3+json
  4. Transfer-Encoding: chunked
  5. Date : Fri, 30 Apr 2021 01:42:56 GMT
  6.  
  7. root@demo-hello-perf-d84bffcb8-7fxqj:/# curl -I http://10.10.48.254:8080/actuator/health
  8. HTTP/1.1 200
  9. Content-Type: application/vnd.spring-boot.actuator.v3+json
  10. Transfer-Encoding: chunked
  11. Date : Fri, 30 Apr 2021 01:42:58 GMT
  12.  
  13. root@demo-hello-perf-d84bffcb8-7fxqj:/# curl -I http://10.10.48.254:8080/actuator/health
  14. HTTP/1.1 200
  15. Content-Type: application/vnd.spring-boot.actuator.v3+json
  16. Transfer-Encoding: chunked
  17. Date : Fri, 30 Apr 2021 01:42:58 GMT

Finish capturing packets, download the svc.cap file and open it in Wireshark


You can see the results of Src and Dst in wireshark. They are still the same as the IP addresses used to access the pod in the above pod. Here, Src and Dst are still the intranet IP addresses of the two pod hosts and the two pods' own IP addresses. They communicate using ipip.

Through the above examples, you should understand the communication method of IPIP network mode!

<<:  The Q1 quarterly reports of the three major operators are released, and mobile customers are losing

>>:  Beavers chew through fiber optic cables, leaving Canadian community without internet for a day and a half

Recommend

Space Data-as-a-Service Ready to Take Off

【51CTO.com Quick Translation】The upcoming commerc...

How professionals can develop their latest data center skills

When there are a plethora of industry certificati...

What exactly is “5G New Call”?

In today’s article, let’s talk about a very popul...

What is the current status of 6G and when will it arrive?

6G will bring many improvements in many areas, bu...

5G is not here yet, but it is within reach

5G is currently the most eye-catching new technol...

Popularize knowledge! 20 key 5G technologies

5G network technology is mainly divided into thre...

3 Ways 5G is Driving Edge Intelligence

5G is closely tied to edge computing. With a whol...

Amazing, TCP/IP service protocol, network topology summary

Network topology (Tpology) Topology refers to the...

Australia completes first millimeter wave auction, 5 companies receive licenses

According to foreign media, Australia has complet...

5G interface protocol: from CPRI to ECPRI

In the architecture of early 2G and 3G base stati...

What are the deployments and arrangements for 5G in 2022? MIIT responds

On January 20, the State Council Information Offi...