Relationship-aware routing and global traffic scheduling · SOSP 2019

Relationship-aware routing and global traffic scheduling · SOSP 2019

[[345832]]

"Read the Papers" is a series of articles analyzing papers in the field of computer and software engineering. In each article in this series, we will read a paper from top conferences such as OSDI and SOSP. We will not introduce all the details here, but will screen the key content in the paper. If you are very interested in the relevant papers, you can directly click on the link to read the original text.

This article is about a paper published in the 2019 SOSP journal, Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge[^1]. Taiji, introduced in the paper, is Facebook's system for dynamically managing user traffic in large-scale Internet services. Through this system, we can balance the resource utilization of each data center and reduce the network latency of user requests.

Taiji will use the relationship between users to route a group of strongly related users to the same data center. This relationship-aware routing strategy can reduce the backend storage load by 17%. Most services use static mapping to route user requests from edge nodes to data centers, but static mapping is difficult to handle large-scale global data centers. The different loads and peak hours of edge nodes in different regions will bring the following challenges:

  • Capacity crunch — If the capacity of a service exceeds the procurement plan, we may not be able to serve all user requests, and static mapping scheduling may cause some data centers to be under-resourced or others to be over-provisioned;
  • Product heterogeneity — As products evolve, we may improve the user experience in terms of interaction. When requests require the user device to maintain a fixed session with the data center, static mapping makes it difficult to manage user traffic.
  • Hardware heterogeneity — The evolution of the underlying infrastructure affects the amount of traffic that a data center can handle, such as server updates, capacity increases and decreases, and performance improvements of network equipment.
  • Fault tolerance — As infrastructure becomes more complex, edge nodes or data centers are more likely to experience network errors, power outages, and software errors. Static mapping makes it difficult to respond quickly to problems and mitigate failures.

To solve the above problems, Facebook engineers designed a global user traffic management system. This system regards the routing process from edge nodes to data centers as an assignment problem. We will generate a routing table through a constraint optimizer. The routing table can determine the data center that processes user requests. Edge nodes will follow the routing table to forward user requests.

Architecture

In a mature content delivery network (CDN), edge nodes can handle most of the user's requests. They can provide static resources quickly and at low cost. There are some relatively mature solutions for the distribution and traffic scheduling of static resources. However, dynamic content often requires data centers with more computing and storage resources to process. It is a relatively complex problem to quickly and stably balance the load between different data centers. Taiji will use the following architecture to control the traffic from edge nodes to data centers to solve this problem:

Figure 1 - Taiji Architecture

The Taiji system consists of two main components: runtime and traffic pipeline. These two components play the following roles in the system:

  • At runtime, a routing table is generated based on traffic load, data center utilization, latency between edge nodes and data centers, and service level policies;
  • The traffic pipeline translates the routing table into fine-grained rules, uses relationship-aware routing to generate routing configurations for load balancing on edge nodes and forwards user requests to specific data centers;

As shown in the figure above, the load balancing of the edge node parses all user requests and maps users to the appropriate buckets, and then forwards the user requests to the data center corresponding to the bucket according to the routing configuration.

Runtime The runtime component is responsible for generating a routing table that specifies the data center to which some user traffic is forwarded from the edge node. The routing table consists of the following set of tuples:

  1. {edge:{datacenter:fraction}}

To generate the routing table consisting of the above tuples, Taiji's runtime reads two types of dynamic data from the monitoring system:

Infrastructure operation and maintenance status, such as capacity, health status, and utilization of edge nodes and data centers;

Measuring data, such as traffic at edge nodes and access latency from edge nodes to data centers;

The above data are all unprocessed raw data. We will introduce a read layer abstraction in the entire system to help us read, standardize and aggregate the raw data. This can reduce the workload of other parts of the runtime and better model the problem:

runtime-architecture

Figure 2 - Runtime architecture

Taiji's runtime measures the traffic of stateless services using requests per second (RPS) and the traffic of stateful services using the number of user sessions. This model allows stateless traffic to be routed to any available data center and constrains stateful traffic to the same machine to avoid affecting established sessions.

In each time period (Epoch), Taiji will re-evaluate the global traffic allocation decision based on the resource utilization of the data center. During the re-evaluation process, the system will use the following steps to change the routing of the data center to meet the service-specified policy: switch one unit of traffic between the two data centers, identify the current optimal routing policy, and iterate until no better results can be achieved. In addition, in order to ensure the stability of the entire system, the frequency of traffic scheduling needs to be controlled during operation to avoid large jitters in a single data center that affect the entire system.

Traffic pipeline

The traffic pipeline consumes the routing table output by the runtime module and uses relationship-aware routing to generate specific routing rules in the configuration file. After we generate new routing rules, the module will synchronize the rules to all edge load balancers through the Configuration Management Service (CMS). The whole process only takes about one minute.

Relationship-aware routing is built on the local nature of user traffic requesting the same content, which has a positive effect on data caching and other backend systems. It is precisely because of this characteristic of traffic that we route highly associated users to the same data center for processing, reducing system latency.

connection-aware-routing

Figure 3 - Relationship-aware routing

As shown in the figure above, relationship-aware routing divides users into different buckets according to the communities they belong to. It uses a balanced tree to ensure that there are similar numbers of users in each user bucket, while also maximizing the degree of association among users in the bucket. However, the size of the bucket has a significant impact on the performance of the system. Too small buckets may cause large communities to be divided, thereby reducing the effectiveness of the cache; while too large buckets will increase the effectiveness of the cache, but will affect the accuracy of routing. Taiji expects the traffic of each bucket to account for 0.01% of the global traffic, so as to strike a balance between cache effectiveness and routing accuracy.

Figure 4 - Users, buckets, and data centers

The entire system not only needs to divide users into different buckets, but also needs to distribute users in buckets to different data centers, which requires a very large amount of computing. In order to solve the problem of relationship-aware routing, the traffic pipeline module will balance locality and stability through the following two parts:

  • Offline tasks — user, bucket allocation;
  • Online tasks — bucket, data center allocation;

The above method can increase the traffic forwarded from the same community to the same data center from 55% to 75%, which can significantly improve the resource utilization of backend services. Next, we will introduce the above two different tasks and the process of edge node load balancing request processing.

Offline module

The community level in the system will be represented by a complete binary tree structure. The leaf nodes of the binary tree are buckets composed of users, and the internal nodes represent all buckets in the subtree. We will use social hashing to construct the entire tree. Because the tree construction is offline and requires a very large amount of computing, the system will regularly update the binary tree at a frequency of weeks.

Although using a binary tree to segment users is effective, it is not applicable to all scenarios. It does not work well in the following two situations:

  • Users access highly connected entities, such as various stars and celebrities, which is more like publishing and subscribing messages;
  • One-time interactions, such as payment and other temporary operations, are not considered by the system as relationships between users.

Online Modules

The online module of the traffic pipeline will assign appropriate data centers to buckets based on the tree structure generated by the offline module, the traffic of each bucket, and the capacity of the data center. This process will try to assign a group of adjacent buckets to the same data center to improve data access locality.

Figure 5 - Buckets and segments

We can divide the buckets in the tree into segments, where is the height of the community level. The online module will allocate data to different data centers in segments. As shown in the figure above, if segment 2 is allocated to a data center, then buckets 1 and 2 will be allocated to that data center. The choice needs to take into account stability and locality. The lower the height, the higher the locality of the data, but the lower the stability, and the data segments are more likely to be divided into different buckets.

Edge Load Balancing

Since edge nodes are responsible for forwarding user requests to specific data centers, all edge nodes need to store routing tables to determine the data center that handles the current request. However, directly storing the routing table from user to data center requires a very large amount of memory, so edge nodes only store the mapping from bucket to data center. When a user sends a dynamic content request to an edge node, the following process will be performed:

The initial user request has no bucket information, and the user will route it to the nearest data center;

The nearest data center will store the bucket corresponding to the user, and we will write the user's bucket into the cookie;

All subsequent requests will carry cookies, and the edge load balancing will route the requests directly to a specific data center;

Relationship-aware routing will bring negligible additional overhead to the original request processing path. The mapping file from bucket to data center will be synchronized every five minutes and is about 15KB in size. The transmission overhead of such content in the network is almost negligible.

Summarize

The Taiji system introduced by Facebook in this paper can adjust the overall load of the system in the dimension of the data center. As a giant in social networking, many of Facebook's businesses are related to social and user relationships, and the relationship-aware routing based on this system is also tailored for its business. It is precisely because highly related users will request similar content that relationship-based routing can optimize the system and route user requests to the nearest, low-utilization cluster, which can also improve the system's access locality and reduce the consumption of computing resources.

This article is reprinted from the WeChat public account "Really No Logic", which can be followed through the following QR code. To reprint this article, please contact the WeChat public account "Really No Logic".

<<:  The three major operators have removed multiple 4G packages from their shelves. Is the real 5G era coming?

>>:  China and Europe should strengthen 5G cooperation and jointly promote the development of digital economy

Recommend

What is OSI model?

Today I tweeted some thoughts about how the OSI m...

Application of SRv6 Technology in Home Network

Labs Guide In order to adapt to the development o...

Report: Global Satellite IoT Market Users to Reach 26.7 Million in 2028

According to a recent research report released by...

Cabling technology continues to evolve to meet rapidly growing network needs

[[413152]] Commercial building renovation Commerc...

How does Spanning Tree Protocol prevent network loops and ensure security?

Spanning Tree Protocol (STP) is one of the key me...