"Read the Papers" is a series of articles analyzing papers in the field of computer and software engineering. In each article in this series, we will read a paper from top conferences such as OSDI and SOSP. We will not introduce all the details here, but will screen the key content in the paper. If you are very interested in the relevant papers, you can directly click on the link to read the original text. This article is about a paper published in the 2019 SOSP journal, Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge[^1]. Taiji, introduced in the paper, is Facebook's system for dynamically managing user traffic in large-scale Internet services. Through this system, we can balance the resource utilization of each data center and reduce the network latency of user requests. Taiji will use the relationship between users to route a group of strongly related users to the same data center. This relationship-aware routing strategy can reduce the backend storage load by 17%. Most services use static mapping to route user requests from edge nodes to data centers, but static mapping is difficult to handle large-scale global data centers. The different loads and peak hours of edge nodes in different regions will bring the following challenges:
To solve the above problems, Facebook engineers designed a global user traffic management system. This system regards the routing process from edge nodes to data centers as an assignment problem. We will generate a routing table through a constraint optimizer. The routing table can determine the data center that processes user requests. Edge nodes will follow the routing table to forward user requests. Architecture In a mature content delivery network (CDN), edge nodes can handle most of the user's requests. They can provide static resources quickly and at low cost. There are some relatively mature solutions for the distribution and traffic scheduling of static resources. However, dynamic content often requires data centers with more computing and storage resources to process. It is a relatively complex problem to quickly and stably balance the load between different data centers. Taiji will use the following architecture to control the traffic from edge nodes to data centers to solve this problem: Figure 1 - Taiji Architecture The Taiji system consists of two main components: runtime and traffic pipeline. These two components play the following roles in the system:
As shown in the figure above, the load balancing of the edge node parses all user requests and maps users to the appropriate buckets, and then forwards the user requests to the data center corresponding to the bucket according to the routing configuration. Runtime The runtime component is responsible for generating a routing table that specifies the data center to which some user traffic is forwarded from the edge node. The routing table consists of the following set of tuples:
To generate the routing table consisting of the above tuples, Taiji's runtime reads two types of dynamic data from the monitoring system: Infrastructure operation and maintenance status, such as capacity, health status, and utilization of edge nodes and data centers; Measuring data, such as traffic at edge nodes and access latency from edge nodes to data centers; The above data are all unprocessed raw data. We will introduce a read layer abstraction in the entire system to help us read, standardize and aggregate the raw data. This can reduce the workload of other parts of the runtime and better model the problem: runtime-architecture Figure 2 - Runtime architecture Taiji's runtime measures the traffic of stateless services using requests per second (RPS) and the traffic of stateful services using the number of user sessions. This model allows stateless traffic to be routed to any available data center and constrains stateful traffic to the same machine to avoid affecting established sessions. In each time period (Epoch), Taiji will re-evaluate the global traffic allocation decision based on the resource utilization of the data center. During the re-evaluation process, the system will use the following steps to change the routing of the data center to meet the service-specified policy: switch one unit of traffic between the two data centers, identify the current optimal routing policy, and iterate until no better results can be achieved. In addition, in order to ensure the stability of the entire system, the frequency of traffic scheduling needs to be controlled during operation to avoid large jitters in a single data center that affect the entire system. Traffic pipeline The traffic pipeline consumes the routing table output by the runtime module and uses relationship-aware routing to generate specific routing rules in the configuration file. After we generate new routing rules, the module will synchronize the rules to all edge load balancers through the Configuration Management Service (CMS). The whole process only takes about one minute. Relationship-aware routing is built on the local nature of user traffic requesting the same content, which has a positive effect on data caching and other backend systems. It is precisely because of this characteristic of traffic that we route highly associated users to the same data center for processing, reducing system latency. connection-aware-routing Figure 3 - Relationship-aware routing As shown in the figure above, relationship-aware routing divides users into different buckets according to the communities they belong to. It uses a balanced tree to ensure that there are similar numbers of users in each user bucket, while also maximizing the degree of association among users in the bucket. However, the size of the bucket has a significant impact on the performance of the system. Too small buckets may cause large communities to be divided, thereby reducing the effectiveness of the cache; while too large buckets will increase the effectiveness of the cache, but will affect the accuracy of routing. Taiji expects the traffic of each bucket to account for 0.01% of the global traffic, so as to strike a balance between cache effectiveness and routing accuracy. Figure 4 - Users, buckets, and data centers The entire system not only needs to divide users into different buckets, but also needs to distribute users in buckets to different data centers, which requires a very large amount of computing. In order to solve the problem of relationship-aware routing, the traffic pipeline module will balance locality and stability through the following two parts:
The above method can increase the traffic forwarded from the same community to the same data center from 55% to 75%, which can significantly improve the resource utilization of backend services. Next, we will introduce the above two different tasks and the process of edge node load balancing request processing. Offline module The community level in the system will be represented by a complete binary tree structure. The leaf nodes of the binary tree are buckets composed of users, and the internal nodes represent all buckets in the subtree. We will use social hashing to construct the entire tree. Because the tree construction is offline and requires a very large amount of computing, the system will regularly update the binary tree at a frequency of weeks. Although using a binary tree to segment users is effective, it is not applicable to all scenarios. It does not work well in the following two situations:
Online Modules The online module of the traffic pipeline will assign appropriate data centers to buckets based on the tree structure generated by the offline module, the traffic of each bucket, and the capacity of the data center. This process will try to assign a group of adjacent buckets to the same data center to improve data access locality. Figure 5 - Buckets and segments We can divide the buckets in the tree into segments, where is the height of the community level. The online module will allocate data to different data centers in segments. As shown in the figure above, if segment 2 is allocated to a data center, then buckets 1 and 2 will be allocated to that data center. The choice needs to take into account stability and locality. The lower the height, the higher the locality of the data, but the lower the stability, and the data segments are more likely to be divided into different buckets. Edge Load Balancing Since edge nodes are responsible for forwarding user requests to specific data centers, all edge nodes need to store routing tables to determine the data center that handles the current request. However, directly storing the routing table from user to data center requires a very large amount of memory, so edge nodes only store the mapping from bucket to data center. When a user sends a dynamic content request to an edge node, the following process will be performed: The initial user request has no bucket information, and the user will route it to the nearest data center; The nearest data center will store the bucket corresponding to the user, and we will write the user's bucket into the cookie; All subsequent requests will carry cookies, and the edge load balancing will route the requests directly to a specific data center; Relationship-aware routing will bring negligible additional overhead to the original request processing path. The mapping file from bucket to data center will be synchronized every five minutes and is about 15KB in size. The transmission overhead of such content in the network is almost negligible. Summarize The Taiji system introduced by Facebook in this paper can adjust the overall load of the system in the dimension of the data center. As a giant in social networking, many of Facebook's businesses are related to social and user relationships, and the relationship-aware routing based on this system is also tailored for its business. It is precisely because highly related users will request similar content that relationship-based routing can optimize the system and route user requests to the nearest, low-utilization cluster, which can also improve the system's access locality and reduce the consumption of computing resources. This article is reprinted from the WeChat public account "Really No Logic", which can be followed through the following QR code. To reprint this article, please contact the WeChat public account "Really No Logic". |
【51CTO.com original article】 Do you know how much...
Today I tweeted some thoughts about how the OSI m...
Labs Guide In order to adapt to the development o...
Recently, Bwh81 has added a limited edition speci...
According to a recent research report released by...
Telecom operators are not only the builders and o...
On September 6, the Ministry of Industry and Info...
[[184481]] Juniper Networks, a leader in automate...
[[334500]] Recently, China Mobile, China Unicom a...
[[413152]] Commercial building renovation Commerc...
SmartHost has just launched a Memorial Day 40% OF...
Spanning Tree Protocol (STP) is one of the key me...
I have shared iWebFusion many times in my blog. I...
51CTO conducts dozens of procurement surveys for ...
As the construction of 5G networks continues to a...