A brief introduction to ZAB protocol in Zookeeper

A brief introduction to ZAB protocol in Zookeeper

The full name of the ZAB protocol is Zookeeper Atomic Broadcast Protocol.

Function: The ZAB protocol can be used to synchronize data between active and standby nodes in a cluster to ensure data consistency.

Before explaining the ZAB protocol, we must understand the role of each Zookeeper node.

The role of each Zookeeper node

Leader:

  • Responsible for processing read and write transaction requests sent by the client. The transaction request here can be understood as this request has the ACID characteristics of the transaction.
  • Synchronously write transaction requests to other nodes, and ensure the order of transactions.
  • The status is LEADING.

Follower:

  • Responsible for processing read requests sent by the client
  • Forward the write transaction request to the Leader.
  • Participate in the Leader election.
  • The status is FOLLOWING.

Observer:

  • Same as Follower, the only difference is that it does not participate in the Leader election and its status is OBSERING.
  • Can be used to linearly scale read QPS.

How to choose a Leader during the startup phase?

When Zookeeper is first started, multiple nodes need to find a leader. How do they find one? By voting.

For example, there are two nodes in the cluster, A and B, and the schematic diagram is as follows:

  • Node A votes for itself first. The voting information includes the node id (SID) and a ZXID, such as (1,0). The SID is configured and unique, and the ZXID is a unique increasing number.
  • Node B votes for itself first, and the voting information is (2,0).
  • Then nodes A and B will vote their votes to all nodes in the cluster.
  • After receiving the voting information from node B, node A checks whether the status of node B is in this round of voting and whether it is in the LOOKING state.
  • Voting PK: Node A will compare its vote with others' votes. If the ZXID sent by other nodes is larger, it will update its voting information to the voting information sent by other nodes. If the ZXIDs are equal, the SIDs will be compared. Here, the ZXIDs of nodes A and B are the same, and the SID of node B is larger, so node A updates the voting information to (2, 0) and then sends the voting information out again. Node B does not need to update the voting information, but it needs to send out the vote again in the next round.

At this time, the voting information of node A is (2, 0), as shown in the following figure:

  • Counting votes: In each round of voting, the voting information received by each node is counted to determine whether more than half of the nodes have received the same voting information. The voting information received by node A and node B is (2, 0), and the number is greater than the number of half of the nodes, so node B is selected as the leader.
  • Update node status: Node A acts as a Follower and updates its status to FOLLOWING; Node B acts as a Leader and updates its status to LEADING.

What should I do if the Leader crashes during operation?

During the operation of Zookeeper, the Leader will remain in the LEADING state until the Leader crashes. At this time, the Leader must be re-elected, and the election process is basically the same as the election process in the startup phase.

Points to note:

  • The remaining Followers conduct elections, and Observers do not participate in the election.
  • The zxid in the voting information is from the local disk log file. If the zxid on this node is larger, it will be elected as the leader. If the zxids of the followers are the same, the follower with the larger node id will be elected as the leader.

How to synchronize data between nodes?

Different clients can connect to the primary node or the backup node separately.

When the client sends a read or write request, it does not know whether it is connected to the Leader or the Follower. If the client is connected to the master node and sends a write request, the Leader will execute 2PC (two-phase commit protocol) to synchronize with other Followers and Observers. However, if the client is connected to a Follower and sends a write request, the Follower will forward the write request to the Leader, and then the Leader will perform 2PC to synchronize data to the Follower.

Two-phase commit protocol:

  • Phase 1: The leader sends a proposal to the follower, and the follower sends an ack response to the leader. If more than half of the acks are received, the next phase begins.
  • Phase 2: Leader loads data from disk log files into memory, Leader sends commit message to Follower, and Follower loads data into memory.

Let's take a look at the process of Leader synchronizing data:

  • ① The client sends a write transaction request.
  • ② After receiving the write request, the Leader converts it into a "proposal01:zxid1" transaction request and saves it to the disk log file.
  • ③ Send proposal to other followers.
  • ④ After receiving the proposal, the Follower writes the disk log file.

Next, let's see how the Follower handles the proposal transaction request sent by the Leader:

  • ⑤ Follower returns ack to Leader.
  • ⑥ Leader receives more than half of the acks and proceeds to the next stage
  • ⑦ Leader loads the proposal of the log file in the disk into the znode memory data structure.
  • ⑧ Leader sends commit message to all Followers and Observers.
  • ⑨ After receiving the commit message, Follower loads the data on disk into the znode memory data structure.

Now the data of the Leader and Follower are all in the memory and are consistent. The data read by the client from the Leader and Follower are consistent.

How does ZAB achieve sequential consistency?

When the Leader sends a proposal, it actually creates a queue for each Follower and sends the proposal to their respective queues.

The following figure shows the message broadcast process of Zookeeper:

The client sent three write transaction requests, and the corresponding proposals are:

 proposal01 : zxid1
proposal02 : zxid2
proposal03 : zxid3

After receiving the request, the Leader puts it into the queue one by one, and then the Follower gets the request from the queue one by one, thus ensuring the order of the data.

Is Zookeeper strongly consistent?

Official definition: sequential consistency.

Strong consistency is not guaranteed, why?

Because after the Leader sends the commit message to all Followers and Observers, they do not complete the commit at the same time.

For example, due to network reasons, different nodes receive commits later, so the submission time is also later, resulting in inconsistent data among multiple nodes. However, after a short period of time, when all nodes commit, the data will be synchronized.

In addition, Zookeeper supports strong consistency, which means manually calling the sync method to ensure that all nodes are committed for success.

Here is a question: If a node fails to commit, will the leader retry? How to ensure data consistency? Welcome to discuss.

Leader downtime data loss issue

First case:

Assume that the Leader has written the message to the local disk but has not yet sent a proposal to the Follower. At this time, the Leader crashes.

Then you need to select a new leader. When the new leader sends a proposal, the zxid auto-increment rule included in it will change:

  • The upper 32 bits of zxid are incremented by 1 once, and the upper 32 bits represent the version number of the Leader.
  • The lower 32 bits of zxid are incremented by 1, and continue to increase in size.

When the old Leader recovers, it will become a Follower. When the Leader sends the latest proposal to it, it finds that the high 32 bits of the zxid of the proposal on the local disk are smaller than the proposal sent by the new Leader, so it discards its own proposal.

Second case:

If the Leader successfully sends a commit message to the Follower, but all or some of the Followers have not had time to commit the proposal, that is, load the proposal from the disk into the memory, then the Leader crashes.

Then you need to select the Follower with the largest zxid in the disk log. If the zxids are the same, compare the node IDs and use the one with the larger node ID as the Leader.

This article tries to use plain language + drawings to explain, hoping to inspire everyone.

<<:  No exaggeration or criticism! A rational view of the value and application challenges of cyberspace mapping technology

>>:  5G investment steadily declines: CAPEX spending of the three major operators collectively "shifts"

Recommend

LoRa and 5G: Can they be used for IoT network connectivity at the same time?

There is no doubt that 5G is the new technology o...

How packets travel through the various layers of the TCP/IP protocol stack

All Internet services rely on the TCP/IP protocol...

Ping command advanced usage

ping command The ping command is used to test the...

New report identifies progress and benefits across the 5G network lifecycle

Infovista welcomes TM Forum’s new industry survey...

Linode's 18th Anniversary and Future Outlook

In the past two days, Linode released a blog post...

Let you understand the MQTT protocol

Author: Wang Yingyue, Unit: China Mobile Smart Ho...

Sparks from blockchain and the Internet of Things

The Internet of Things is the application that is...

In the 5G era, smart services will become the new normal

More than a year after its official commercial la...

Game lag? Be careful to use the wrong WiFi frequency at home

When you use WiFi at home to play games, you alwa...