1 Concept 1.1 Model node In a specific engineering project, a node is often a process on an operating system. In the model of this article, a node is considered to be a complete and indivisible whole. If a program process is actually composed of several relatively independent parts, then a process can be divided into multiple nodes in the model. abnormal
1.2 Instances Replica/copy refers to the redundancy provided for data or services in a distributed system. For data replica, it means persisting the same data on different nodes. When the data stored on a node is lost, the data can be read from the replica. Data replication is the only way for distributed systems to solve data loss anomalies. Another type of replication is service replication, where multiple nodes provide the same service. This type of service generally does not rely on the local storage of the node, and the required data generally comes from other nodes. The replication protocol is the theoretical core of the entire distributed system. Replica consistency The distributed system uses a replica control protocol to ensure that the data read from the external system and from each replica within the system are the same under certain constraints. This is called replica consistency. Replica consistency refers to the distributed system, not to a specific replica.
1.3 Metrics for measuring distributed systems
2 Principles of Distributed Systems 2.1 Data Distribution The so-called distributed system, as the name suggests, uses multiple computers to collaboratively solve computing, storage and other problems that cannot be solved by a single computer. The biggest difference between a stand-alone system and a distributed system lies in the scale of the problem, that is, the difference in the amount of data to be calculated and stored. To solve a single-machine problem using distributed computing, the first thing to do is to decompose the problem into a distributed system that can be solved using multiple machines, so that each machine in the distributed system is responsible for a subset of the original problem. Since the input object of both computing and storage is data, how to decompose the input data of a distributed system becomes a basic problem of distributed systems. Hash method The disadvantages of hash-distributed data are also obvious, especially its low scalability. Once the cluster size needs to be expanded, almost all the data needs to be migrated and redistributed. In engineering, when expanding a hash-distributed data system, the cluster size is often expanded exponentially, and the hash is recalculated according to the data. In this way, only half of the data on one machine needs to be migrated to another corresponding machine to complete the expansion. To address the problem of poor scalability of the hash method, one idea is to no longer simply map the hash value to the machine by division modulus, but to manage the corresponding relationship as metadata by a dedicated metadata server. At the same time, the number of hash value modulos is often greater than the number of machines, so the same machine needs to be responsible for multiple hash modulo remainders. However, a large amount of metadata needs to be maintained with a more complex mechanism. Another disadvantage of hash distribution data is that once the data of a certain data feature value is seriously uneven, the "data skew" problem is prone to occur. Another disadvantage of hash distribution data is that once the data of a certain data feature value is seriously uneven, it is easy to have a "data skew" problem. Distribution by data range Distribution by data range is another common data distribution method, which divides the data into different intervals according to the range of characteristic values, so that each server (group) in the cluster processes data in different intervals. In engineering, in order to facilitate load balancing operations such as data migration, the technology of dynamically dividing intervals is often used to make the amount of data served in each interval as equal as possible. When the amount of data in a certain interval is large, the interval is "split" into two intervals so that the amount of data in each data interval is kept below a relatively fixed threshold as much as possible. Generally, a dedicated server is often needed to maintain data distribution information in memory, which is called meta-information. Even for large-scale clusters, due to the huge size of the meta-information, a single computer cannot maintain it independently, and multiple machines need to be used as meta-information servers. Distribution by data volume The data volume distribution data has nothing to do with the specific data characteristics. Instead, the data is regarded as a sequentially growing file, and the file is divided into several data blocks (chunks) according to a relatively fixed size. Different data blocks are distributed to different servers. Similar to distributing data by data range, distributing data by data volume also requires recording the specific distribution of data blocks and managing the distribution information as metadata using a metadata server. Since it is independent of the specific data content, distributing data by data volume generally does not have the problem of data skew, and the data is always evenly divided and distributed to the cluster. When the cluster needs to be re-loaded, it can be done by migrating data blocks. There are also no major restrictions on cluster expansion, which can be completed by migrating part of the database to the newly added machine. The disadvantage of dividing data by data volume is that it requires the management of more complex metadata. Similar to the method of distributing data by range, when the cluster size is large, the amount of metadata also becomes large, and efficient management of metadata becomes a new issue. Consistent Hashing Consistent hashing is another widely used data distribution method in engineering. Consistent hashing was originally used as a common data distribution algorithm for distributed hash tables (DHT) in P2P networks. The basic method of consistent hashing is to use a hash function to calculate the hash value of data or data features, and let the output value range of the hash function be a closed ring, that is, the maximum value output by the hash function is the preorder of the minimum value. Nodes are randomly distributed on this ring, and each node is responsible for processing the data in the entire hash value range from itself to the next node clockwise. The consistent hashing method requires managing the node's position on the consistent hashing ring as metadata, which is more complicated than directly using hashing to distribute data. However, the node's location information is only related to the size of the machines in the cluster, and the amount of metadata is usually much smaller than that of data distributed by data range or by data volume. For this purpose, a common improved algorithm is to introduce the concept of virtual node. Many virtual nodes are created at the beginning of the system. The number of virtual nodes is generally much larger than the number of machines in the future cluster. The virtual nodes are evenly distributed on the consistent hash value domain ring. Their functions are the same as those of the nodes in the basic consistent hash algorithm. Several virtual nodes are assigned to each node. When operating data, first find the corresponding virtual node on the ring through the hash value of the data, and then find the corresponding real node by looking up the metadata. There are many advantages to using virtual nodes. First, once a node is unavailable, it will make multiple virtual nodes unavailable, causing multiple adjacent real nodes to bear the pressure of the failed node. Similarly, once a new node is added, multiple virtual nodes can be assigned, so that the new node can bear the pressure of multiple original nodes. From a global perspective, it is easier to achieve load balancing during expansion. Replication and data distribution The basic means of fault tolerance and improving availability of distributed systems is to use replicas. The distribution method of data replicas mainly affects the scalability of the system. A basic data replica strategy is to use machines as units, with several machines acting as replicas of each other, and the data between replica machines is exactly the same. This strategy is applicable to all the above data distribution methods. Its advantage is that it is very simple, but its disadvantage is that the efficiency of data recovery is not high and the scalability is not high. A more appropriate approach is not to use machines as replica units, but to split the data into more reasonable data segments and use the data segments as replicas. In practice, the size of each data segment is often kept as equal as possible and within a certain size. Data segments are called segments, fragments, chunks, partitions, etc. The choice of data segments is directly related to the data distribution method. For the hash data partitioning method, the remainder after each hash bucket can be used as a data segment. In order to control the size of the data segment, the number of buckets is often made larger than the cluster size. Once the data is divided into data segments, the replicas can be managed in units of data segments, so that the replicas are no longer rigidly related to the machines, and each machine can be responsible for the replicas of a certain data segment. Once the replica distribution is independent of the machine, the recovery efficiency after data loss will be very high. This is because once the data of a machine is lost, the replicas of the data segment on it will be distributed among all the machines in the entire cluster, rather than just in a few replica machines, so that the data can be copied and restored from the entire cluster at the same time, and each data source machine in the cluster can make a copy with very low resources. Even if the machines serving as the recovery data source are all limited to 1MB/s, if 100 machines participate in the recovery, the recovery speed can reach 100MB/s. Furthermore, the fact that replica distribution is independent of the machine also facilitates cluster fault tolerance. If a machine crashes, the pressure is naturally distributed to the entire cluster because the replicas on the crashed machine are scattered throughout the cluster. Finally, the fact that replica distribution is independent of machines also facilitates cluster expansion. In theory, if the cluster size is N machines, when a new machine is added, it is only necessary to migrate 1/N – 1/N+1 data segments from each machine to the new machine to achieve a new load balance. Since data is migrated from each machine in the cluster, the efficiency is also high, similar to data recovery. In engineering, creating replicas based entirely on data segments will increase the overhead of metadata that needs to be managed, and the difficulty of replica maintenance will also increase accordingly. A compromise approach is to group certain data segments into a data segment group, and manage replicas based on the granularity of the data segment group. This can control the replica granularity within a more appropriate range. Localized computing In a distributed system, the way data is distributed also deeply affects the way computing is distributed. In a distributed system, computing nodes and storage nodes that store computing data can be on the same physical machine or on different physical machines. If the computing nodes and storage nodes are located on different physical machines, the calculated data needs to be transmitted over the network. This method is very expensive and the network bandwidth may even become the overall bottleneck of the system. Another idea is to schedule the computation to the computing node on the same physical machine as the storage node as much as possible, which is called localized computing. Localized computing is an important optimization of computing scheduling, which embodies an important distributed scheduling idea: "Mobile computing is worse than mobile data." Choosing a data distribution method In actual engineering practice, data distribution methods can be reasonably selected according to demand and implementation complexity. In addition, data distribution methods can be flexibly combined and often have the advantages of various methods to achieve better comprehensive results. Example: Data skew problem. On the basis of data distribution by hash, data distribution by data volume is introduced to solve the data skew problem. Data is distributed by the hash value of the user ID. When the data volume of a certain user ID is particularly large, the data of this user always falls on a certain machine. At this time, the method of distributing data by data volume is introduced to count the user's data volume, and cut the user's data into multiple uniform data segments according to a certain threshold, and distribute these data segments to the cluster. Since the data volume of most users will not exceed the threshold, only the data segment distribution information of users exceeding the threshold is saved in the metadata, so that the scale of metadata can be controlled. This solution of combining the hash distribution method with the data distribution method by data volume has been used in a real system and achieved good results. 2.2 Basic Copy Protocol The replica control protocol refers to a distributed protocol that controls the reading and writing behavior of replica data according to a specific protocol process so that the replica meets certain availability and consistency requirements. The replica control protocol must have a certain fault tolerance against abnormal conditions, so that the system has a certain availability, and the replica control protocol must be able to provide a certain consistency level. According to the CAP principle (analyzed in detail in Section 2.9), it is impossible to design a replica protocol that meets strong consistency and is available in the event of any network anomalies. For this reason, the actual replica control protocol always compromises between various factors such as availability, consistency, and performance according to specific requirements. Replica control protocols can be divided into two categories: "centralized replica control protocols" and "decentralized replica control protocols". Centralized copy control protocol The basic idea of the centralized replica control protocol is to have a central node coordinate the update of replica data and maintain consistency between replicas. The figure shows the general architecture of the centralized replica protocol. The advantage of the centralized replica control protocol is that the protocol is relatively simple, and all replica-related controls are completed by the central node. Concurrency control will be completed by the central node, thus simplifying a distributed concurrent control problem into a single-machine concurrent control problem. Concurrency control means that when multiple nodes need to modify replica data at the same time, they need to resolve concurrency conflicts such as "write-write" and "read-write". Concurrency control is often performed by locking and other methods on a single-machine system. For distributed concurrency control, locking is also a common method, but if there is no central node to uniformly manage locks, a fully distributed lock system is required, which makes the protocol very complicated. The disadvantage of the centralized replica control protocol is that the availability of the system depends on the centralized node. When the central node is abnormal or the communication with the central node is interrupted, the system will lose certain services (usually at least the update service). Therefore, the disadvantage of the centralized replica control protocol is that there is a certain service downtime. primary-secondary protocol In the primary-secondary type of protocol, replicas are divided into two categories, one and only one replica is the primary replica, and all replicas except the primary are secondary replicas. The node that maintains the primary replica is the central node, which is responsible for maintaining data updates, concurrency control, and coordinating the consistency of replicas. Primary-secondary protocols generally solve four major types of problems: data update process, data reading method, determination and switching of Primary replicas, and data synchronization (reconcile). Basic process of data update
In engineering practice, if the primary sends data directly to the other N replicas at the same time, the update throughput of each secondary is limited by the total egress network bandwidth of the primary, and the maximum is 1/N of the primary network egress bandwidth. To solve this problem, some systems (for example, GFS) use a relay approach to synchronize data, where the primary sends updates to the first secondary copy, the first secondary copy sends updates to the second secondary copy, and so on. Data reading method The way data is read is also highly related to consistency. If only eventual consistency is required, reading any replica will meet the requirement. If session consistency is required, you can set a version number for the replica, increment the version number after each update, and verify the version number when the user reads the replica, thereby ensuring that the data read by the user is monotonically increasing within the session range. The more difficult part of using primary-secondary is to achieve strong consistency. Since the data update process is controlled by the primary, the data on the primary replica must be the latest, so if you always read only the data on the primary replica, you can achieve strong consistency. If you only read the primary replica, the secondary replica will not provide read services. In practice, if replicas are not bound to machines, but are maintained in units of data segments, only the primary replica provides read services, which does not waste machine resources in many scenarios. Distribute the replicas in the cluster, assuming that the primary is also randomly determined, then each machine has a primary replica of some data and a secondary replica of other data segments, so that a server actually provides both read and write services. The availability of the secondary node is controlled by the primary. When the primary fails to update a secondary replica, the primary marks the secondary replica as unavailable, so that users can no longer read the unavailable replica. The unavailable secondary replica can continue to try to synchronize data with the primary. After completing data synchronization with the primary, the primary can mark the replica as available. This approach makes all available replicas, whether primary or secondary, readable, and within a certain period of time, a secondary replica is either updated to the latest state consistent with the primary or marked as unavailable, thus meeting higher consistency requirements. This approach relies on a central metadata management system to record which replicas are available and which are unavailable. In a sense, this approach improves system consistency by reducing system availability. Determining and switching the primary replica In primary-secondary protocols, another core issue is how to determine the primary replica. Especially when the machine where the original primary replica is located crashes, some mechanism is needed to switch the primary replica so that a secondary replica becomes the new primary replica. Usually, in a primary-secondary type distributed system, the information about which replica is primary is meta information and is maintained by a dedicated metadata server. When performing an update operation, the metadata server is first queried to obtain the primary information of the replica, so as to further execute the data update process. Since reliable detection of node anomalies in a distributed system requires a certain amount of detection time, which is usually at the level of 10 seconds, this also means that once the primary is abnormal, it takes up to 10 seconds to detect it before the system can start the primary switch. During this 10-second period, the system cannot provide update services due to the absence of the primary. If the system can only read the primary replica, it cannot even provide read services during this period. From this, we can see that the biggest disadvantage of the primary-backup type replica protocol is the certain service downtime caused by the primary switch. Data Synchronization The inconsistent secondary copy needs to be synchronized (reconcile) with the primary. There are three common forms of inconsistency: 1. Due to abnormalities such as network fragmentation, the data on the secondary lags behind the data on the primary. Second, under certain protocols, the data on the secondary may be dirty data and need to be discarded. Dirty data is caused by the fact that the primary copy did not perform a certain update operation, but the secondary copy performed redundant modification operations, resulting in data errors on the secondary copy. 3. Secondary is a newly added replica with no data at all, and data needs to be copied from other replicas. For the first case where the secondary data lags behind, a common synchronization method is to replay the operation log on the primary (usually the redo log) to catch up with the update progress of the primary. For dirty data, the best approach is to design a distributed protocol that does not generate dirty data. If the protocol has the possibility of generating dirty data, the probability of generating dirty data should be reduced to a very low level, so that once dirty data occurs, the copy with dirty data can be simply discarded, which is equivalent to having no data in the copy. In addition, some undo log-based methods can be designed to delete dirty data. If the secondary replica has no data at all, the common practice is to directly copy the data of the primary replica. This method is often much faster than replaying the log to track the update progress. However, when copying data, the primary replica needs to be able to continue to provide update services, which requires the primary replica to support the snapshot function. That is, a snapshot is formed for the replica data at a certain moment, and then the snapshot is copied. After the copy is completed, the update operation after the snapshot is formed is tracked by replaying the log. Decentralized replica control protocol The decentralized replica control protocol has no central node. All nodes in the protocol are completely equal, and the nodes reach consensus through equal consultation. Therefore, the decentralized protocol does not have problems such as service outages caused by abnormal centralized nodes. The biggest disadvantage of decentralized protocols is that the protocol process is usually complicated. Especially when decentralized protocols need to achieve strong consistency, the protocol process becomes complicated and difficult to understand. Due to the complexity of the process, the efficiency or performance of decentralized protocols is generally lower than that of centralized protocols. An inappropriate analogy is that centralized replica control protocols are similar to autocratic systems. The system is efficient but highly dependent on central nodes. Once the central node is abnormal, the system is greatly affected; decentralized replica control protocols are similar to democratic systems. Nodes collectively negotiate and are inefficient, but the abnormality of individual nodes will not have much impact on the overall system. 2.3 Lease Mechanism The Lease mechanism is the most important distributed protocol and is widely used in various practical distributed systems. Lease-based distributed cache system The basic problem background is as follows: In a distributed system, there is a central server node that stores and maintains some data, which is the metadata of the system. Other nodes in the system access the central server node to read and modify the metadata on it. Since all operations in the system depend on metadata, if each operation to read metadata accesses the central server node, the performance of the central server node becomes the bottleneck of the system. To this end, a metadata cache is designed to cache metadata information on each node, thereby reducing the access to the central server node and improving performance. On the other hand, the correct operation of the system strictly depends on the correctness of the metadata, which requires that the data cached on each node is always consistent with the data on the central server, and the data in the cache cannot be old and dirty data. Finally, the designed cache system should be able to handle abnormalities such as node downtime and network interruption as much as possible to maximize the availability of the system. To this end, a cache system is designed using the lease mechanism, and its basic principles are as follows. When the central server sends data to each node, it also issues a lease to the node. Each lease has an expiration date, similar to the expiration date on a credit card. The expiration date on the lease is usually a specific time point, such as 12:00:10. Once the real time exceeds this time point, the lease expires. In this way, the validity period of the lease has nothing to do with the time when the node receives the lease. The lease may have expired when the node receives the lease. Here, we first assume that the clocks of the central server and each node are synchronized. The impact of clock asynchrony on the lease will be discussed in the next section. The meaning of the lease issued by the central server is: within the validity period of the lease, the central server guarantees that the value of the corresponding data will not be modified. Therefore, after the node receives the data and the lease, it adds the data to the local cache. Once the corresponding lease times out, the node deletes the corresponding local cache data. When the central server modifies data, it first blocks all new read requests and waits for all leases previously issued for the data to time out and then modifies the value of the data. Based on the lease cache, the client node reads the metadata
The above mechanism can ensure that the cache on each node is always consistent with the center on the central server. This is because the central server node grants the corresponding lease to the node while sending data. During the validity period of the lease, the server will not modify the data, so the client node can safely cache the data during the validity period of the lease. The key to the fault tolerance of the above lease mechanism is: once the server sends data and the lease, regardless of whether the client receives it or not, regardless of whether the subsequent client crashes, and regardless of whether the subsequent network is normal, the server only needs to wait for the lease to time out to ensure that the corresponding client node will no longer continue to cache data, so that the data can be modified without destroying the consistency of the cache. The above basic process has some performance and availability issues, but they can be easily optimized and modified. Optimization point 1: When modifying metadata, the server must first block all new read requests, resulting in no read service. This is to prevent the issuance of new leases from causing new client nodes to constantly hold leases and cache data, forming a "live lock". The optimization method is very simple. After the server enters the data modification process, once a read request is received, it only returns data but does not issue a lease. As a result, during the execution of the modification process, the client can read the metadata, but cannot cache the metadata. A further optimization is that when entering the modification process, the validity period of the lease issued by the server is selected as the maximum validity period of the issued lease. In this way, the client can continue to cache metadata after the server enters the modification process, but the server's waiting time for all leases to expire will not be continuously extended due to the issuance of new leases. Finally, the difference between the cache mechanism and the multi-copy mechanism. The cache mechanism and the multi-copy mechanism are similar in that they both store a piece of data on multiple nodes. However, the cache mechanism is much simpler. The cached data can be deleted and discarded at any time, and the consequence of hitting the cache is only that the data source needs to be accessed to read the data; however, the copy mechanism is different. The copy cannot be discarded at will. Every time a copy is lost, the service quality is reduced. Once the number of copies drops to a certain level, the service will often no longer be available. Analysis of the lease mechanism Definition of lease: Lease is a commitment granted by the issuer within a certain validity period. Once the issuer issues a lease, the issuer must strictly abide by the commitment, regardless of whether the recipient receives it or not, and regardless of the subsequent status of the recipient, as long as the lease does not expire. On the other hand, the recipient can use the issuer's commitment within the validity period of the lease, but once the lease expires, the recipient must not continue to use the issuer's commitment. The lease mechanism has a high fault tolerance capability. First, by introducing the validity period, the lease mechanism can be very good at tolerating network anomalies. The lease issuance process only relies on the network being able to communicate in one direction. Even if the recipient cannot send a message to the issuer, it will not affect the issuance of the lease. Since the validity period of lease is a certain time point, the semantics of lease have nothing to do with the specific time of sending the lease, the same lease can be sent to the recipient repeatedly by the issuer. Even if the issuer occasionally fails to send the lease, the issuer can simply solve it by resending. Once the lease is successfully accepted by the recipient, the subsequent lease mechanism no longer depends on network communication, and the lease mechanism will not be affected even if the network is completely interrupted. Furthermore, the Lease mechanism can better tolerate fault-tolerating node downtime. If the issuer fails, the issuer who is down usually cannot change the previous promise and will not affect the correctness of the lease. After the issuer is restored, if the issuer restores the previous lease information, the issuer can continue to comply with the lease commitment. If the issuer cannot restore the lease information, you only need to wait for a maximum lease timeout to make all leases invalid, thus not destroying the lease mechanism. For example, in the cache system example in the previous section, once the server goes down, the metadata will definitely not be modified. After resuming, you just need to wait for a maximum lease timeout time, and the cache information on all nodes will be cleared. For the situation where the recipient is down, the issuer does not need to do more fault-tolerant processing. Just wait for the lease to expire and expire, and then you can revoke the promise. In practice, it is to revoke the permissions, identities, etc. assigned previously. Finally, the lease mechanism does not depend on storage. The issuer can persist the issued lease information, so that the lease information can continue to be valid after the downtime recovery. However, this is only an optimization for the lease mechanism. For example, in previous analysis, even if the issuer does not persist the lease information, all the previously issued lease information can be invalidated by waiting for a maximum lease time, thereby ensuring that the mechanism continues to be valid. The Lease mechanism depends on the validity period, which requires the issuer and receiver clocks to be synchronized. On the one hand, if the issuer's clock is slower than the receiver's clock, when the recipient believes that the lease has expired, the issuer still believes that the lease is valid. The recipient can solve this problem by applying for a new lease before the lease expires. On the other hand, if the issuer's clock is faster than the receiver's clock, when the issuer believes that the lease has expired, the recipient still believes that the lease is valid. The issuer may issue the lease to other nodes, causing the commitment to fail and affect the correctness of the system. For this kind of clock out of synchronization, it is usually practice to set the issuer's validity period slightly larger than the receiver's, and only by a larger out-of-clock error can avoid the impact on the validity of the lease. Determine the node status based on the lease mechanism The distributed protocol relies on global consistency in the recognition of node state, that is, once node Q thinks that a certain node A is abnormal, node A must also consider itself an exception, so node A stops as a primary to avoid the occurrence of the "dual master" problem. There are two ways to solve this problem. First, the designed distributed protocol can tolerate "double master" errors, that is, it does not rely on the global consistency understanding of the node state, or the global consistency state is the result of the whole negotiation; Second, use the lease mechanism. The first idea is to abandon the use of centralized design and switch to decentralized design, which is beyond the scope of discussion in this section. The following focuses on using the lease mechanism to determine the node status. The central node sends a lease to other nodes. If a node holds a valid lease, it is considered that the node can provide services normally. In Example 2.3.1, nodes A, B, and C still periodically send heart beat to report their own status. After receiving the heart beat, node Q sends a lease, indicating that node Q confirms the status of nodes A, B, and C, and allows nodes to work normally during the lease validity period. Node Q can give the primary node a special lease, indicating that the node can work as a primary. Once node Q wants to switch to a new primary, it only needs to wait for the previous primary lease to expire, and a new lease can be issued to the new primary node safely without the "dual master" problem. In actual systems, it is also very risky to send a lease with a central node. Once the central node goes down or the network is abnormal, all nodes do not have a lease, which makes the system highly unavailable. For this reason, the actual system always uses multiple central nodes to be replicated by each other and becomes a small cluster, which has high availability and provides the function of issuing leases to the outside world. Both chubby and zookeeper are based on such designs. Lease's expiration date selection In the project, the commonly selected lease duration is 10 seconds, which is a proven experience value. In practice, it can be used as a reference and comprehensively selected the appropriate duration. 2.4 Quorum mechanism Let’s make the following agreement: the update operation (write) is a series of sequential processes, and the order of the update operation is determined through other mechanisms (for example, the primary-secondary architecture determines the order). Each update operation is marked as wi. i is the monotonically increasing sequence number of the update operation. After each wi is successfully executed, the data of the copy changes, which is called a different data version, and is called vi. Assume that each copy saves all versions of data in history. write-all-read-one Write-all-read-one (WARO for short) is the simplest copy control rule. As the name suggests, write all copies when updating. Only when updating on all copies is successful, will the update be considered successful, thus ensuring that all copies are consistent, so that data on any copy can be read when reading data. Since the update operation needs to be successful on all N replicas, the update operation can be successful, so once there is a replica exception, the update operation fails and the update service is unavailable. For the update service, although there are N replicas, the system cannot tolerate any replica exception. On the other hand, as long as one of the N replicas is normal, the system can provide a read service. For the read service, when there are N replicas, the system can tolerate N-1 replica exceptions. From the above analysis, it can be found that the availability of WARO read service is high, but the availability of the update service is not high. Even though a replica is used, the availability of the update service is equivalent to no replica. Quorum Definition Under the Quorum mechanism, once a certain update operation wi succeeds on W copies of all N replicas, the update operation is called "successfully submitted update operation" and the corresponding data is called "successfully submitted data". Let R>NW, since the update operation wi succeeds on W copies, when reading data, at most R copies need to be read, you must read the updated data vi. If a certain update wi succeeds on W copies, since W+R>N, the set composed of any R copies must have intersection with the set composed of successful W copies, so reading R copies must read the updated data vi. As shown in Figure 2-10, the principle of the Quorum mechanism can be represented by Vincent's figure. A certain system has 5 replicas, W=3, R=3, the data of the first 5 replicas are consistent, all of which are v1, and a certain update operation w2 succeeds on the first 3 replicas, and the replica situation becomes (v2 v2 v2 v1 v1). At this time, any three replicas must include v2. In the above definition, let W=N and R=1, you get WARO, that is, WARO is a special case of the Quorum mechanism. Similar to analyzing WARO, analyze the availability of the Quorum mechanism. Limit the Quorum parameter to W+R=N+1. Since the update operation needs to be successful on W replicas, the update operation can be successful. Therefore, once N-W+1 replicas are abnormal, the update operation will never succeed on W replicas, and the update service is unavailable. On the other hand, once N-R+1 replicas are abnormal, it is impossible to guarantee that a collection of replicas with intersections with W replicas can be read, and the consistency of read service will decrease. Again: Relying on the quorum mechanism alone cannot guarantee strong consistency. Because the latest successfully committed version number cannot be determined when the quorum mechanism is only, it is difficult to determine the latest successfully committed version number unless the latest committed version number is managed as metadata by a specific metadata server or metadata cluster. In the next section, under which cases can be discussed, the latest successfully committed version number can be determined only by the quorum mechanism. The three system parameters of the Quorum mechanism N, W, and R control the availability of the system and are also the system's service commitment to users: the data has at most N replicas, but if the data is updated successfully, the user will be successful. For Quorum systems with high consistency requirements, the system should also promise not to read unsuccessfully submitted data at any time, that is, the data read is data that has been successfully on W replicas. Read the latest successfully submitted data The Quorum mechanism only needs to successfully update W of N replicas. When reading R replicas, you can definitely read the latest successfully submitted data. However, due to unsuccessful updates, reading R replicas alone may not determine which version of the data is the latest submitted data. For a strongly consistent Quorum system, if there are fewer than W data, assuming it is X, then continue to read other replicas. Even if W copies of this version are successfully read, the data is the latest successfully submitted data; if the number of data in all replicas definitely does not meet W, then the second largest version number in R is the latest successfully submitted copy. Example: When reading (v2 v1 v1), continue to read the remaining copies. If you read the remaining two copies are (v2 v2), then v2 is the latest submitted copy; if you read the remaining two copies are (v2 v1) or (v1 v1), then v1 is the latest successfully submitted version; if you read the subsequent two copies with either timeout or failure, it is impossible to determine which version is the latest successfully submitted version. It can be seen that when using the Quorum mechanism simply, to determine the latest successfully submitted version, you need to read at most R+ (WR-1)=N replicas. When any replica exception occurs, the function of reading the latest successfully submitted version may be unavailable. In actual engineering, other technical means should be used as much as possible to avoid reading the latest successfully submitted versions through the Quorum mechanism. For example, when the quorum mechanism is used in combination with the primary-secondary control protocol, the latest submitted data can be read by reading primary. Select primary replica based on Quorum mechanism When reading data, different methods can be taken according to the consistency requirements: if you need to read the latest successfully submitted data immediately with strong consistency, you can simply read only the data on the primary copy, or you can read it through the previous section; If session consistency is required, selective readings can be performed on each replica based on the data version number that has been read before; if only weak consistency is required, any replica reading can be selected. In the primary-secondary protocol, when the primary is abnormal, a new primary needs to be selected, and then the secondary copy is synchronized with the primary. Generally speaking, the work of selecting a new primary is done by a central node. After introducing the quorum mechanism, the commonly used primary selection method is similar to the method of reading data, that is, the central node reads R replicas and selects the replica with the highest version number among the R replicas as the new primary. The new primary provides read and write services as the new primary after completing data synchronization with at least W replicas. First, the copy with the highest version number among the R replicas must contain the latest successfully submitted data. Furthermore, although it is not certain that the number of the highest version number is a successfully submitted data, the new primary then synchronizes the data with the secondary, so that the number of copies of the version reaches W, so that the data of the version becomes the successfully submitted data. Example: In a system with N=5, W=3, and R=3, the maximum version number of the copy is (v2 v2 v1 v1 v1). At this time, v1 is the latest successfully submitted data of the system, and v2 is an unsuccessful submitted data in the intermediate state. Suppose that the original primary copy is abnormal at this moment, and the central node performs primary switching. Whether this type of "intermediate" data is deleted as "dirty data" or becomes effective data after being synchronized as new data depends entirely on whether this data can participate in the election of the new primary. The following two situations are analyzed separately. First, as shown in Figure 2-12, if the central node communicates successfully with 3 of the replicas and the version number read is (v1 v1 v1), then any replica is selected as primary. The new primary takes v1 as the latest successfully submitted version and synchronizes with other replicas. When synchronizing data with the first and second replicas, since the version number of the first and second replicas is greater than primary, it is dirty data, and can be solved in accordance with the method of processing dirty data introduced in Section 2.2.2.4. In practice, the new primary may provide data services after synchronization with the last two replicas, and then its version number is updated to v2. If the system cannot guarantee that the subsequent v2 is exactly the same as the previous v2, then when synchronizing data with the first and second replicas, not only does it need to compare the data version number, but also needs to compare whether the specific content of the update operation is the same. Second, if the central node communicates successfully with the other 3 replicas and the version number read is (v2 v1 v1), then select the copy with version number v2 as the new primary. After that, once the new primary completes data synchronization with the other 2 replicas, the number of copies that meet v2 reaches W, becoming the latest successfully submitted replicas. The new primary can provide normal read and write services. 2.5 Logging Technology Log technology is one of the main technologies for downtime recovery. Log technology was originally used in database systems. Strictly speaking, log technology is not a technology of distributed systems, but in the practice of distributed systems, log technology is widely used for downtime recovery. Even systems such as BigTable save logs into a distributed system, further enhancing the system's fault tolerance. Redo Log and Check point Design a high-speed stand-alone query system, store all the data in memory to achieve high-speed data query, and update a small part of the data (such as a key in the key-value) for each update operation. The problem is to use log technology to achieve the downtime recovery of the memory query system. Unlike database transactions, each successful update operation in this problem model will take effect. This is equivalent to only one update operation for each transaction in the database, and each update operation can and must be submitted immediately (Auto commit). Redo Log
From the process of Redo Log, it can be seen that Redo writes the logs as the result after the update operation is completed (although this article does not discuss Undo Log, this is one of the differences from Undo Log). Moreover, since it adds the write log files in sequence, it is more efficient on storage devices such as disks that write sequentially. It is very simple to use Redo Log to restore downtime, you only need to "play back" the log. Process 2.5.2: Redo Log's downtime recovery Read the results of each update operation in the log file from scratch and use these results to modify the data in memory. It can also be seen from the Redo Log down recovery process that only the update result written to the log file can be restored after the downtime. This is also the reason why the log file needs to be updated first and then the data in memory is updated in the Redo Log process. If the data in memory is updated first, the user can read the updated data immediately. Once a downtime occurs between completing memory modification and writing to the log, the last update operation cannot be restored, but the user may have read the updated data before, causing inconsistency. Check point Under a simplified model, the process of check point technology is to completely dump the data in memory to disk in a way that is easy to reload, thereby reducing the log data that needs to be played back during downtime recovery. Process: check point
Process: Downtime recovery process based on check point Load the dump to disk data into memory. Scan the log file from back to forward to find the last "End Check Point" log. Find the most recent "Begin Check Point" log from the last "End Check Point" log and playback all update operation logs after that log. No Undo/No Redo log If data maintenance is on disk, a batch of updates consists of several update operations, which require atomic effect, that is, either take effect at the same time or none of them take effect. 0/1 There are two directory structures in the directory technology, called Directory 0 (Directory 0) and Directory 1 (Directory 1). Another structure is called Master record. The directory currently in use is called Active Directory. In the main record, either the use of Directory 0 or the use of Directory 1. The location of each data in the log file is recorded in Directory 0 or Directory 1. The data update process of the 0/1 directory is always carried out on the inactive directory. It only inverts the 0 and 1 values in the main record before the data takes effect, thereby switching the main record. Process: 0/1 Directory data update process
The update process of 0/1 directory is very simple. The main record switching of directories 0 and 1 makes the batch of modifications effective atomic. 0/1 directory attributes the atomicity of batch transaction operations to the atomic switching of the main record through directory means. Since atomic modifications of multiple records are generally difficult to implement, atomic modifications of a single record can often be implemented, thus reducing the difficulty of implementing the problem. In the project, the idea of the 0/1 directory is widely used, and its form is not limited to the above process. It can be switched back and forth between two data structures in memory, or it can be switched back and forth between two file directories on disk. 2.6 Two-stage submission agreement The two-stage submission protocol is a classic strongly consistent centralized copy control protocol. Although there are many problems in this protocol in engineering, studying this protocol can well understand several typical problems of distributed systems. Process Description The two-stage submission protocol is a typical "centralized replica control" protocol. In this protocol, the participating nodes are divided into two categories: a centralized coordinator node and N participant nodes. Each participant node is the node that manages the database replica in the background above. The idea of submitting two stages is relatively simple. In the first stage, the coordinator asks all participants whether they can submit transactions (please ask participants to vote), and all participants vote with the coordinator. In the second phase, the coordinator makes a decision based on the voting results of all participants whether the transaction can be submitted globally and notifies all participants to implement the decision. In a two-phase submission process, participants cannot change their voting results. The premise that the two-stage submission agreement can be submitted globally is that all participants agree to submit the transaction. As long as one participant votes to abandon the transaction, the transaction must be abandoned. Process: Two-stage Submission Coordinator Process
Process: Two-stage Submission Coordinator Process
Exception handling Downtime recovery
Protocol Analysis The two-stage submission agreement is actually used less in engineering practice, and the main reasons are the following:
Although there are some improved two-stage submission protocols that can improve fault tolerance and performance, such protocols are still a type of protocol that is less used in engineering, and their theoretical value is greater than that of practical significance. 2.7 MVCC MVCC (Multi-version Cocurrent Control) technology. MVCC technology was originally proposed in database systems, but this idea is not limited to a stand-alone distributed system, and is also effective in distributed systems. MVCC is a technology that implements concurrent control of multiple different versions of data. The basic idea is to generate a new version of data for each transaction. When reading the data, selecting different versions of data can achieve complete reading of transaction results. When using MVCC, each transaction is updated based on an effective basic version, and transactions can be carried out in parallel, thus creating a graph structure. The version of the basic data is 1, and two transactions are generated at the same time: Transaction A and Transaction B. Both transactions each make some local modifications to the data (these modifications are visible to the transaction itself and do not affect the real data). After that, transaction A first commits to generate data version 2; based on data version 2, transaction C is initiated, and transaction C continues to commit, generating data version 3; finally transaction B commits, and the result of transaction B needs to be merged with the result of transaction C. If the data does not conflict, that is, transaction B does not modify the variables modified by transaction A and transaction C, then transaction B can commit, otherwise transaction B fails to commit. The process process of MVCC is very similar to the process of version control systems such as SVN, or the version control systems such as SVN are the MVCC idea used. When a transaction makes local modifications based on the basic data version, in order not to affect the real data, there are usually two methods. One is to completely copy the data in the basic data version and then modify it. SVN uses this method, and SVN check out is the copying process; the other is that each transaction only records the update operation, but does not record the complete data. When reading the data, the update operation is applied to the data of the basic version to calculate the result. This process is similar to the incremental submission of SVN. 2.8 Paxos protocol The Paxos protocol is one of the few highly consistent and highly available decentralized distributed protocols proven in engineering practice. The process of the Paxos protocol is relatively complex, but its basic ideas are not difficult to understand, similar to the voting process of human society. In the Paxos protocol, there is a group of completely peer participating nodes (called accpetors), which each make decisions on a certain event. If a resolution is approved by more than half of the nodes, it will take effect. As long as more than half of the nodes in the Paxos protocol are normal, it can work and can fight against abnormal situations such as downtime and network differentiation. Role : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Memories (2.4) It is not difficult to understand that this is similar to the Quorum mechanism. A value needs to obtain the approval of the Acceptor of W=N/2 + 1, so learners need to read at least N/2 + 1 Acpetors, and after reading the results of at most N Acceptors, they can learn a passed value. The above three types of roles are just logical divisions. In practice, a node can act as these three types of roles at the same time. process The Paxos protocol is carried out one round after another, with a number for each round. Each Paxos protocol may approve a value, or it may not be possible to approve a value. If a Paxos protocol approves a value, then each Paxos round can only approve this value in the future. The above-mentioned protocol processes form a Paxos protocol instance, that is, only one value can be approved by a Paxos protocol instance, which is also an important manifestation of the strong consistency of the Paxos protocol. Each Paxos protocol is divided into stages, preparation stages and approval stages. In these two stages, Proposer and Acceptor have their own processing processes. Process: Proposer process (preparation stage)
Process: Accpetor Process (Preparation Phase)
example There are 5 Acceptors and 1 Proposer in the basic example, and there is no network or downtime exception. We focus on the changes of variables B and V on each Acpetor, and the changes of variable b on the Proposer.
In the same Paxos instance, the approved value cannot be changed, and even if the subsequent Proposer initiates the Paxos protocol with a higher sequence number, the value cannot be changed. The core of the Paxos protocol is that "the approved value cannot be changed", which is also the basis for the correctness of the entire protocol. The Paxos protocol is artificially designed, and its design process is also the derivation process of the protocol. The Paxos protocol uses the Quorom mechanism and selects W=R=N/2+1. Simply put, the protocol is the process of the Proposer updating the Acceptor. Once an Acceptor successfully updates more than half of the Acceptors, the update is successful. The Learner presses Quorum to read the Acceptor. Once a value is successfully read on more than half of the Proposers, it means that this is an approved value. The protocol introduces rounds, so that the proposal of high rounds prevails the proposal of low rounds to avoid deadlocks. The key point of the protocol design is how to satisfy the constraint "only one value is approved in a Paxos algorithm instance". 2.9 CAP The definition of CAP theory is very simple. The three letters of CAP represent three contradictory attributes in distributed systems:
CAP 理论指出:无法设计一种分布式协议,使得同时完全具备CAP 三个属性,即1)该种协议下的副本始终是强一致性,2)服务始终是可用的,3)协议可以容忍任何网络分区异常;分布式系统协议只能在CAP 这三者间所有折中。 热力学第二定律说明了永动机是不可能存在的,不要去妄图设计永动机。与之类似,CAP 理论的意义就在于明确提出了不要去妄图设计一种对CAP 三大属性都完全拥有的完美系统,因为这种系统在理论上就已经被证明不存在。
|
<<: Is HTTP really that difficult?
>>: The future is here: Will 5G users reach 2.6 billion by 2025?
[[355616]] This article comes from a real intervi...
Recently, the DECT-2020 NR standard launched by t...
On September 2 last year, ROG released a high-end...
According to Mobile World Live, Ookla's lates...
The tribe shared news about FantomNetworks twice ...
spinservers recently released several special-pri...
Only one year after commercial use, my country...
[51CTO.com Quick Translation] Web designers often...
RAKsmart provides independent servers and cloud s...
【51CTO.com Quick Translation】 Big data, as a set ...
Nowadays, everyone knows about 5G. 5G has taken o...
1. IPv6 Multicast Technology IP multicast is an e...
The blog has shared Vmiss discount information ma...
The hijacking we encounter in daily life is usual...
As 2022 is coming to an end, the Chinese business...