The interviewer asked about the ZAB protocol right away, and I was trembling...

The interviewer asked about the ZAB protocol right away, and I was trembling...

[[391275]]

Zookeeper achieves the final consistency of distributed transactions through the ZAB consistency protocol.

ZAB Protocol Introduction

ZAB stands for Zookeeper Atomic Broadcast (Zookeeper Atomic Broadcast Protocol)

The ZAB protocol is a crash-recovery-supported consistency protocol designed specifically for the distributed coordination service ZooKeeper. Based on this protocol, ZooKeeper implements a master-slave system architecture to maintain data consistency between replicas in the cluster.

The message broadcast process of ZAB uses the atomic broadcast protocol, which is similar to the two-phase commit. In response to the client's request, the Leader server generates a corresponding transaction proposal and sends it to all Follower servers in the cluster. Then, the respective votes are collected and the transaction is finally committed. As shown in the figure:


In the ZAB protocol, two-phase commit removes the interruption logic. All follower servers either respond to the transaction proposal put forward by the leader normally, or abandon the leader server. At the same time, we can start submitting the transaction proposal after more than half of the follower servers have responded with ACK.

The Leader server will assign a global monotonically increasing ID to the transaction proposal, called the transaction ID (ZXID). Since the ZAB protocol needs to ensure the strict causal relationship of each message, each transaction proposal needs to be processed in the order of its ZXID.

During the message broadcast process, the Leader server will allocate a queue for each Follower server, then put the transaction proposals into these queues in turn, and send messages according to the FIFO strategy.

After receiving the transaction proposal, each Follower server will write the transaction proposal to the local disk in the form of a transaction log, and after successful writing, it will feedback ACK to the Leader server.

When the Leader server receives ACKs from more than half of the Follower servers, it sends a COMMIT message and completes the transaction commit. After receiving the COMMIT message, the Follower server also commits the transaction.

The reason why the atomic broadcast protocol is adopted is to ensure the consistency of distributed data. More than half of the nodes save data consistency.

Message Broadcast

You can think of the message broadcast mechanism as a simplified version of the 2PC protocol, which ensures the sequential consistency of transactions through the following mechanism.


When the client submits a transaction request, the Leader node generates a transaction Proposal for each request and sends it to all Follower nodes in the cluster. After receiving feedback from more than half of the Follower nodes, the transaction is submitted. The ZAB protocol uses the atomic broadcast protocol. In the ZAB protocol, only more than half of the Follower nodes need to feedback Ack to submit the transaction. This also leads to data inconsistency after the Leader node crashes. ZAB uses crash recovery to handle the problem of inconsistent numbers. The message broadcast uses the TCP protocol for communication to ensure the order of receiving and sending transactions. When broadcasting messages, the Leader node assigns a globally incremented ZXID (transaction ID) to each transaction Proposal, and each transaction Proposal is processed in the order of ZXID.

The Leader node allocates a queue for each Follower node and puts transactions into the queue in the order of transaction ZXID, and sends transactions according to the queue's FIFO rule. After receiving the transaction Proposal, the Follower node will write the transaction to the local disk in the form of a transaction log, and feedback the Ack message to the Leader node after success. The Leader will commit the transaction after receiving the Ack feedback from more than half of the Follower nodes, and broadcast the Commit message to all Follower nodes at the same time. After receiving the Commit, the Follower node will start to commit the transaction;

Crash recovery

During the message broadcast process, if the leader crashes, can data consistency be guaranteed? When the leader crashes, it will enter the crash recovery mode. In fact, it mainly handles the following two situations.

  1. What to do if the leader crashes after replicating data to all followers ?
  2. What should I do if the Leader crashes after receiving the Ack and submitting itself and sending out part of the commit ?

To address this issue, ZAB defines two principles:

  1. The ZAB protocol ensures that transactions that have been submitted by the Leader will eventually be submitted by all servers.
  2. The ZAB protocol ensures that transactions that are only proposed/replicated by the leader but not committed are discarded.

How to ensure that transactions that have been submitted by the leader are committed and transactions that have been skipped are discarded? The core is to process it through ZXID. When recovering after a crash, the largest zxid will be selected as the snapshot for recovery. The advantage of this is that the transaction submission check and transaction discard work can be omitted to improve efficiency.

Data Synchronization

After the leader election is completed, before the work officially starts, the leader server will confirm whether all transaction proposals in the transaction log (referring to the submitted transaction proposals) have been submitted by more than half of the machines, that is, whether data synchronization is completed. The following is the data synchronization process of the ZAB protocol.

The Leader server prepares a queue for each Follower server, and sends transactions that have not been synchronized by the Follower server to the Follower server one by one in the form of transaction proposals, and sends a commit message after each transaction proposal message to indicate that the transaction has been committed.

After the Follower server synchronizes all its unsynchronized transaction proposals from the Leader server and applies them to the local database, the Leader server will add the Follower server to the list of truly available Followers.

ZXID’s Design

ZXID is a 64-bit number, as shown in the figure below.


The lower 32 bits are a simple monotonically increasing counter. When the Leader server generates a new transaction proposal, it will add 1 to the counter.

The upper 32 bits are used to distinguish different leader servers. Specifically, each time a new leader server is elected, the largest ZXID is taken from the local log of the leader server, the corresponding epoch value is generated, and then 1 is added, and then the value is used as the new epoch. The lower 32 bits are used to generate ZXID starting from 0. (I understand that the epoch here represents the flag of a leader server. Each time a leader server is elected, the epoch value will be updated, indicating that the new leader server will handle transaction requests during this period).

The ZAB protocol uses epoch numbers to distinguish leader period changes, which can effectively prevent different leader servers from using the same ZXID.

Below is the core code for generating zxid of my Leader node. You can take a look at it.

  1. // Leader.java
  2. void lead() throws IOException, InterruptedException {
  3. // ....
  4. long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
  5. zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
  6. // ....
  7. }
  8. //
  9. public long getEpochToPropose(long sid, long lastAcceptedEpoch) throws InterruptedException, IOException {
  10. synchronized (connectingFollowers) {
  11. // ....
  12. if (isParticipant(sid)) {
  13. // Add yourself to the connection team to facilitate later judgment of whether the lead is valid
  14. connectingFollowers.add (sid) ;
  15. }
  16. QuorumVerifier verifier = self.getQuorumVerifier();
  17. // If enough followers enter, the election is valid, then there is no need to wait, and other waiting threads can be passed, similar to Barrier
  18. if (connectingFollowers. contains (self.getId()) && verifier.containsQuorum(connectingFollowers)) {
  19. waitingForNewEpoch = false ;
  20. self.setAcceptedEpoch(epoch);
  21. connectingFollowers.notifyAll();
  22. } else {
  23. // ....
  24. // If followers are not enough, wait, timeout is initLimit
  25. while (waitingForNewEpoch && cur < end && !quitWaitForEpoch) {
  26. connectingFollowers.wait( end - cur);
  27. cur = Time .currentElapsedTime();
  28. }
  29. // Timeout and exit, re-election
  30. if (waitingForNewEpoch) {
  31. throw new InterruptedException( "Timeout while waiting for epoch from quorum" );
  32. }
  33. }
  34. return epoch;
  35. }
  36. }
  37. // ZxidUtils
  38. public   static long makeZxid(long epoch, long counter) {
  39. return (epoch << 32L) | (counter & 0xffffffffL);
  40. }

ZAB protocol implementation

The process of writing data

Below I sorted out the process of writing data in the zookeeper source code, as shown in the following figure:


References

https://www.cnblogs.com/veblen/p/10985676.html

https://zookeeper.apache.org

<<:  The battle for power saving in 5G mobile phones

>>:  China Telecom is the best at number portability

Recommend

A brief discussion on the prospects for the evolution of 5G core network

I recently read a paper about 5G core network, &q...

Cisco, the hero behind the scenes who helps turn good thoughts into good deeds

People often have good intentions in their hearts...

What does a 5G network look like? A simple article to understand

[[311978]] Whether it is 2G, 3G, 4G or 5G, the mo...

Speedtest releases Starlink network speed test report

Ookla, the parent company of the well-known speed...

SDN: From ideal to reality

SDN is more than 10 years old. When it first came...

An article explains the detailed process of SSL handshake protocol

[[274498]] Overview SSL (Secure Socket Layer) is ...

MoeCloud: US CN2 GIA line VPS annual payment starts from 299 yuan

MoeCloud also launched a promotion this month, of...