Difficulties and solutions faced by ONOS dynamic expansion

Difficulties and solutions faced by ONOS dynamic expansion

1. ONOS consistency guarantee

ONOS mainly includes two types of consistency mechanisms, eventual consistency and strong consistency. Eventual consistency is achieved by optimistic asynchronous replication and entropy reduction based on Gossip. Optimistic asynchronous replication can achieve eventual consistency efficiently, but once a node in the cluster leaves the cluster or restarts, the overall cluster will become more and more disordered. The entropy reduction solution based on Gossip is to solve this problem. The nodes in the cluster periodically (usually every three to five seconds) randomly select a node for data synchronization. In most cases, entropy reduction interactions are normal because each controller already knows every event that has occurred in the network. However, when a controller state drifts slightly, this mechanism will quickly detect this state and resynchronize the controller. This method also has the benefit of quickly synchronizing the newly added controller with other controllers. The first entropy reduction interaction between the newly added controller and the existing controller will quickly achieve node synchronization without the need for a separate backup/discovery mechanism.

In the case of dynamic expansion, the addition of dynamic nodes will have an impact on the final consistency, which is manifested as new nodes joining the cluster, and finally reaching consistency with the entire cluster through entropy reduction interactions with other nodes and optimistic replication. The subsystems involved in this part include the Device and Link subsystems.

The Device and Link subsystems will also affect the Topo subsystem. Therefore, when dynamically expanding nodes, the impact on newly added nodes will be small if they do not carry any business during the process of achieving final consistency.

Strong consistency is ensured through the Raft algorithm. ONOS takes into account fault tolerance and performance and chooses partitioning and backup redundancy mechanisms.

The partitioning mechanism refers to the fact that ONOS supports the partitioning mechanism for any distributed primitive that supports strong consistency (mainly including its distributed data structure), and supports backup redundancy between multiple nodes in each partition, realizing the compromise consideration of CAP theory.

2. ONOS Logical Clock

Clock is an important concept in distributed systems. ONOS uses MasterShip Term and local event sequence number to perform statistics. The theoretical basis is as follows:

  1. The control of the network controller is inseparable from the device, and all network events can ultimately be associated with the device.
  2. MasterShipTerm is globally consistent, and the reliability of clocks based on this data is high.
  3. The controller relies on the information received from the device to issue network events, and only the Master actually throws the event. The Master maintains the sequence number of the event reported by the corresponding device, which increases monotonically from 0 in each Term cycle.

3. Impact of Dynamic Expansion on Strong Consistency

Currently, most of ONOS subsystems use strong consistency, including FlowRule, Host, MasterShip, etc. MasterShip is the strong consistency of the overall cluster number, and other subsystems are based on the strong consistency of the internal nodes of the Partition. Therefore, the downtime risk of the ONOS cluster is related to the number of Partition Members. If a Partition Member has only three nodes, then the downtime of two devices will cause system problems.

In the scenario where nodes are dynamically added to the cluster, the most important issue is to prevent split-brain, which is a scenario where two leaders appear in a cluster at the same time. This scenario will not occur when the number of cluster nodes decreases, but it will occur when adding nodes to the cluster, as shown in the following figure:

In the scenario shown in the preceding figure, if the new server is started with the new configuration first, and the old server is gradually running with the new configuration, the majority of the new configuration and the majority of the old configuration will coexist. Careless operation may cause the cluster to have two leaders and thus a split-brain situation.

ONOS's raft algorithm is implemented using Copycat, which supports the addition of dynamic nodes. However, this method is different from the two-stage addition scheme mentioned in the Raft paper. Instead, it uses a single node addition scheme to avoid brain splits. This makes the scheme simpler but relatively more troublesome to operate. In addition, when the newly added node starts data synchronization, the business should try to avoid writing, so as not to affect the read and write performance.

4. ONOS with stateful restart

Stateful restart is also very important in production environments. Most distributed data structures in ONOS support persistence, and the ones that do not support it are mainly eventual consistency data structures. ECMap must be configured with a persistence option to write entries to disk, otherwise they will be lost when the cluster is shut down.

But most of the distributed primitives (strong consistency) use the Raft cluster, and they are persistent. ConsistentMap, ConsistentTreeMap, DocumentTree, DistributedSet, LeaderElector, and all Async* versions of these primitives use either a single Raft partition or all Raft partitions. These primitives are effectively backed by a persistent replicated log that is read from that /data directory and replayed when the cluster is restarted.

<<:  Sparks from blockchain and the Internet of Things

>>:  How to become a better person? Huawei Elite Competition helps ICT talents break out of their cocoons and become butterflies

Recommend

Why is the VR panoramic industry rising?

It has been some time since 5G was commercialized...

Revolutionizing Networking with Edge Computing

Compared with cloud computing, edge computing foc...

We haven’t experienced 5G yet, but 6G is coming?

The latest 6G speed created by Chinese scientists...

How will 6G change the workplace?

The next generation of connectivity is coming, pr...

China Mobile: All new mobile terminals must support 700MHz from October 1

At the launch ceremony of China Mobile's 2021...

Can the heavy fine on Alibaba serve as a wake-up call for the Internet giant?

The State Administration for Market Regulation ha...

TCP, it’s finally here!

[[394208]] Previous articles have been talking ab...

Nokia wins 5G network contracts in three European countries

On August 25, Finnish telecommunications equipmen...