Understanding the working principle of keepalive in one article

Understanding the working principle of keepalive in one article

Keepalive is a high-availability component that implements the access layer based on the VRRP protocol to avoid single point failures of the system.

How it works

To understand the principle of keepalive, you need to understand how the VRRP protocol works.

Concept explanation:

Vrrp (Virtual Router Redundancy Protocol) is a fault-tolerant protocol designed to avoid single-point failure of routers.

The network architecture is as follows:

How it works

  • Multiple routers form a router group, that is, a virtual router. As shown in the figure, RouterA and RouterB form a virtual router.
  • The virtual router uses a virtual IP to interact with the external network (such as VIP in the figure) and a virtual MAC to interact with the internal network (such as VMAC in the figure).
  • The router that obtains VIP is the master router (Master state), and the other routers are backup routers (Backup state).
  • The main router sends a notification message (ie, heartbeat) to other routers in the group every advertisement_intervl seconds (corresponding to the configuration item advert_int in the figure), informing them of the priority of the router and other information.
  • It is sent via multicast, and the multicast address is 224.0.0.18.
  • Only the primary router will respond to the ARP request, and other routers in the group will discard the ARP request.
  • Only the master router will respond to the virtual IP request, and other routers in the group will discard the request for the virtual IP.

Active/standby switchover

The Master in the virtual router can be switched (that is, the VIP is switched to the backup router). There are three switching methods:

The master router exits the router group

  • The master router sets the priority of this router to 0 in the VRRP message, indicating that the original master device declares that it will not participate in the VRRP group.
  • After receiving this message, the Backup router will wait for skew_time (offset time, = (256-backup_priority)/256) and switch to the Master state.

The Master router lowers its own priority

  • The master router lowers the priority of this router in the VRRP message to a level lower than the priority of the backup router (but not to 0).
  • At this time, the Backup router will discard the message; if it is in preemptive mode, it will immediately switch to the Master state; if it is in non-preemptive mode, it will remain in the Backup state.

Backup timed out and did not receive vrrp message

When the Backup router does not receive the VRRP message from the Master for a certain period of time (Master_down_interval=3*advert_int + skew_time), it will switch to the Master.

Frequently asked questions

When Router A and Router B cannot communicate normally, there may be two Master routers, which is called "split brain".

Solution:

  1. Check the network between A and B, turn off the firewall or configure the IP properly, and ensure the network is unobstructed between the routers.
  2. Use 2 lines to connect 2 routers, when one line breaks down, the other one is a backup.
  3. The master node writes a script to test the network status; if the network is not accessible, the keepalived process is closed.
  4. When the Master router goes down, an alarm is immediately sounded and human intervention is required.

Expand your knowledge

VRRP Protocol Stack

VRRP protocol stack

The fields marked in red above are key fields and will appear in the keepalive configuration file.

  • Version: VRRP protocol version number. RFC3768 defines version 2.
  • Type: This field indicates the type of VRRP message. RFC3768 only defines one type of VRRP message, which is the VRRP announcement message, so this field is always set to 1. If the received VRRP announcement message has a type value other than 1, it will be discarded.
  • Virtual Rtr ID: This is the VRID we introduced above. A VRID uniquely identifies a virtual router. The value range is [1,255]. Therefore, a router interface can run up to 255 VRRP instances simultaneously. This field has no default value and must be set manually.
  • Priority: Priority is used to select the Master router and Backup router in a virtual router. The larger the value, the higher the priority. This field has 8 bits and the value range is [1,254]. If there is no manual setting, the default value is 100. Among them, the VRRP protocol will always set this field of the IP address owner router to 255. If it is manually specified as other values, it will not affect the default behavior of the VRRP protocol, that is, the field of the IP address owner router is always 255. In addition, if this field is set to 0, the following situation will occur: when the Master router fails, it will immediately send a VRRP announcement message with Priority set to 0. When the Backup router receives this announcement message, it will wait for the Skew time and then switch itself to the Master router. The Skew time = (256-the priority of the Backup router) / 256, in seconds. For example, if the priority of the Backup router is 100, then the Skew time = 156/256 = 0.609 seconds. For the master router, the Skew time has no practical meaning, although Cisco routers will also calculate and display it.
  • Count IP Addrs: The number of IP addresses contained in the VRRP announcement message. This field is actually the number of IP addresses allocated to a VRRP virtual router.
  • Auth Type: Authentication type field, is an 8-bit unsigned integer. A virtual router can only use one authentication type. If the authentication type field in the notification message received by the Backup router is unknown or does not match the local configuration, it will discard the data packet. Currently, 3 authentication methods are supported: no authentication, simple character, and MD5 authentication.
  • Adver Int: This field specifies the time interval for the Master router to send VRRP advertisement messages. The unit is seconds and the value range is [1,255]. If there is no manual configuration, the default value is 1 second.
  • Checksum: The checksum of the entire VRRP message. During the calculation process, the Checksum field is set to 0. After the calculation is completed, the result is filled in this field. If you want to learn more about the calculation of Checksum, you can refer to RFC1071 (CKSM).
  • IP Address: This field stores the virtual IP addresses of the three VRRP virtual routers. The number of virtual IP addresses configured will be encapsulated. In the above Cisco example, we configured three virtual IP addresses, so three VRRP announcement messages will be encapsulated.
  • Authentication Data: RFC3768 stipulates that this field is only for compatibility with RFC2338. In actual encapsulation, all values ​​are set to 0. The receiver will also ignore this field.

Virtual MAC address

The structure is 00-00-5E-00-01-{VRID}. The first three bytes 00-00-5E are allocated by the IANA organization. The next two bytes 00-01 are specified for the VRRP protocol. The last VRID is the virtual router identifier, and its value range is [1, 255].

VRID

Virtual router identifier. Routers in the same VRRP group must have the same VRID.

other

Keepalive has a built-in module that can manipulate the kernel through configuration files, add rules to ipvs, and create LVS. This is another key component and will not be discussed in this section.

keepalived process

There are 3 processes when keepalived starts:

  • Main process, monitors other subprocesses.
  • VRRP child process, responsible for VRRP communication.
  • The checker subprocess detects the service status; if the service is unavailable, the vrrp subprocess is notified and a downgrade notice is issued.

Final Thoughts

For common components, we not only need to know how to use them, but also their underlying principles, so that we can get twice the result with half the effort when encountering problems. I hope that this article can give you a deeper understanding of keepalive.

<<:  A survival guide for communications professionals

>>:  SPI subsystem SPI spec

Recommend

Germany to remove Huawei equipment from its 5G mobile network

Germany plans to completely remove Chinese-made c...

The impact of 5G on enterprises

By 2024, more than 40% of the world’s population ...

Jiangsu Cable and Ruijie Networks jointly build the "Smart Confucius Temple"

In ancient times, there was a saying that went, &...

I’ve explained the QUIC protocol in ten minutes. Do you understand it?

Let's review the development of HTTP. First, ...

What exactly are CPU, GPU, TPU, NPU, etc.?

[[373720]] CPU stands for Central Processing Unit...

Without 5G performance guarantees, can operators fully exploit this opportunity?

Private 5G networks are attractive to the largest...

The 5G vision has not yet been fully realized, but 6G is coming?

◎ Science and Technology Daily reporter Liu Yan O...

A super simple TCP communication package in C#: step by step guide

Hey, fellow developers! Today we are going to tal...

...