DAGW: Exploration and Practice of Data Aggregation Gateway

DAGW: Exploration and Practice of Data Aggregation Gateway

Business Background

Bilibili is a video community based on PUGV, and the main scenario for users is to watch videos on the video details page. As the business grows, there will be more and more extended businesses on this "main battlefield", such as topics, video honors, notes, user costumes, etc.

picture

(Figure 1: All traffic will be aggregated to the video details page)

As can be seen from Figure 1, we can divide the functional pages on the APP into two categories: ListView Page, such as recommendation, search, dynamic, partition, etc. Most of the pages are list-type, which provides users with rich content filtering and preview scenarios; the other is the detail page (DetailView Page). When users click on the content they are interested in on any list page, they will be directed to the detail page for viewing.

picture

(Figure 2: The video details page gathers various information and function entrances related to the video)

As shown in Figure 2, the video details page gathers the attributes and function portals related to the video, such as: popular, site-wide rankings, weekly must-sees and other manuscript honors, video shooting templates, video collections, video soundtracks, and related topics, etc. This information and portals can help users further explore related subject content and functions.

Current situation and problems

In terms of technical implementation, the user-oriented application architecture of Station B is mainly divided into four layers:

  • Terminal layer: Clients that interact directly with users, including mobile apps, H5, Web and clients on PCs, and other screen terminals, such as TVs, cars, speakers, and PS, etc.
  • Access Gateway: Generally, it is LB (Load Balance) plus AGW (API-Gateway). AGW is mainly responsible for request routing, protocol conversion, protocol unloading, current limiting and fuse breaking, security blocking, etc.
  • BFF (Backend for Frontend): Due to the increase in terminals, in order to ensure that the client-specific logic can be better isolated, it is usually practiced to split applications by terminal, such as web-interface (for web pages), app-interface (for apps), tv-interface (for TVs), etc. In addition, as the page logic becomes more and more complex and the traffic increases, the BFF logic of the page will also be split into separate applications to isolate the release and deployment, such as app-feed (home page), app-view (video details page), etc.
  • Business Service: An interface responsible for business domains or capabilities, usually split according to functions/capabilities and business areas.

picture

(Figure 3: Application architecture layering)

As shown in Figure 3, the main logic of the video details page is concentrated in the BFF layer. With the growth of DAU and the continuous expansion of business, we are faced with two problems:

Problem 1: As the business expands, the number of fanout reads increases, which brings huge traffic load and complexity to BFF itself and downstream businesses. As shown in the figure below, in order to display the functional entry of associated videos, the business service needs to carry the traffic of all video detail requests and the resulting CPU resource consumption; on the other hand, it also needs to implement a bloom filter mechanism to avoid a large number of back-source queries caused by all unassociated video requests.

picture

(Figure 4: The load is indiscriminately amplified to all services as the BFF reads, which complicates the implementation of the services)

Problem 2: Perhaps we can solve problem 1 by adding machines and increasing the complexity of the implementation, but as the number of fanout reads continues to increase, the latency of a single video detail request will continue to deteriorate until it becomes unacceptable to users. (Figure 4.a [Reference 1], an increase in the number of fanouts will greatly increase the probability of overall request timeouts. Figure 4.b is the fanout request topology of the actual Bilibili APP video detail BFF, which is already quite large (the figure is no longer clear), and the number of fanouts continues to increase as the business increases.)

picture

(Figure 4.a Correlation between fanout number and timeout rate, excerpted from "The Tail At Scale")

picture

  1. (Figure 4.b The actual fanout situation of BFF on the video details page, drawn by the internal trace system)

Analysis and Modeling

As mentioned above, many downstream business services of video details only cover some videos, that is, only some videos have associated data, so a BloomFilter-like mechanism is often used to filter requests for unassociated videos.

We bucketed and counted the size of the downstream response of the video detail BFF request (using Prometheus Histogram). After analysis, we found that the responses returned by many business services showed the distribution shown in the following figure:

picture

(Figure 5: Distribution of packet sizes returned by BFF requesting a service)

It can be seen that more than 90% of the responses of many service interfaces accessed by BFF are "empty", which means that the requested video is not associated with the service. However, in practice, the video detail BFF will request these services every time it obtains video detail information. The fundamental reason is that the BFF layer does not know which services the video is associated with when processing the request.

If we can know in advance at the BFF layer which services are associated with the video being requested, we can significantly reduce the number of read diffusions of the BFF and the load on the business service, and achieve on-demand access.

We can create a sparse vector containing the associated services for each video, called a video-service index, as shown in the following figure:

picture

(Figure 6: Index model of video ID and associated services)

In actual implementation, the video service index does not necessarily store the relationship between the video and the service in the form of sparse vectors. Some existing KV systems can be used. For example, we use the hash key of redis to implement it. Another thing to consider is that when the relationship between the service and the video changes, there needs to be a mechanism to notify the index service of the change in full (initial stage) and incrementally.

accomplish

Based on the previous problem analysis and modeling, we optimized the architecture of the video details BFF as shown in the following figure:

picture

(Figure 7: Optimized architecture and processing flow)

In the BFF request processing flow, ① the business-related index service is introduced. Before BFF requests the downstream business service, the index of the video-related business is obtained. ② The business services that should be accessed by this request are obtained in advance to filter out irrelevant business requests. The index is implemented through the hashmap of redis, and the company's internal KV storage is also used for persistence and redis fault degradation. The key setting example of redis is as follows:

 HMSET index_vid1234 biz1 0 biz2 1 bizM "hot"

The index of video-related services is constructed by importing all and incremental related information of downstream services. In order to facilitate downstream services to import heterogeneous data into indexes more efficiently, we provide a backend system that supports online business change message cleaning and import function writing. As shown in the following figure:

picture

(Figure 8: Business change event processing function and index update push backend)

Schema extension

After further investigation, we found that not only video details, but also Story (short video), live broadcast, dynamic and my page details pages all present similar aggregation scenarios, and as shown in Figure 3, these aggregation scenarios will also appear in the BFFs corresponding to multiple terminals such as APP, TV, and Web. Is it possible to use a more standard and universal solution to uniformly solve the aggregation problem of similar video details?

As shown in Figure 3 above, the main processing logic of BFF is divided into: parameter processing, aggregation logic, and assembly of return objects (VO). We can abstract complex aggregation logic such as video, live broadcast, and user into a more general aggregation service and provide it to all BFFs. To achieve this, the general aggregation service needs to have the following capabilities:

  1. Supports different terminal BFFs to obtain aggregation models on demand.
  2. Support a more flexible expansion aggregation model, that is, the cost of expanding a new business is as low as possible on the basis of satisfying 1.
  3. Supports the ability to reduce load based on business-related indexes.

Regarding point 1, common practices in the industry include the following:

  • GraphQL: Filter the required information through field selectors. Although GraphQL is comprehensive and flexible, its introduction will dramatically increase the complexity of system implementation and troubleshooting, which is not conducive to long-term maintenance and iteration. (See reference 2 for details)
  • Protobuf field mask: Google APIs proposed to add a field of type google.protobuf.FieldMask to the request parameters to specify the required return range, aiming to reduce the network transmission and server computing costs caused by unnecessary return fields. However, Google APIs has announced that read_mask has been deprecated.
  • View Enum: In order to meet the on-demand acquisition mechanism of field mask, Google APIs provides a better alternative (see reference 3 for details). By defining View Enum, the service provider defines common on-demand access scenarios, for example: BASIC returns basic information and is used in list scenarios, and ALL is used to return details for detail page scenarios. At the same time, it also supports richer enumeration definitions, which just meets our needs.

The following is our View Enum definition for the video details page:

 enum ArchiveView { //未指定,不返回数据UNSPECIFIED = 0; // 以下是最常见场景的视图定义// 返回稿件简易信息(用于信息查询) SIMPLE = 1; // 返回稿件基础信息(可用于首页、搜索列表查询) BASIC = 2; // 返回稿件基础信息+分P信息(最简版详情,用于分享等场景) BASIC_WITH_PAGES = 3; // 返回APP端视频详情所有信息ALL_APP = 4; // 返回WEB端视频详情所有信息ALL_WEB = 5; // 返回TV端视频详情所有信息ALL_TV = 6; // 可以持续增加新的场景}

Regarding the second point, we abstract the aggregation logic into a DAG graph. The reason for using the DAG model is that some business services may have dependencies between each other. For example, the attributes of some videos depend on the video author information in the basic video information (obtained by accessing the basic video information service). In this way, any new business only needs to: 1. Specify other nodes to depend on, 2. Write the logic within the node, including accessing the service and business logic processing, and 3. Configure which View Enums the node should be used in.

Regarding point 3, the implementation principle has been introduced before. We only need to expand the index from video-service index to live broadcast and user-service index.

In summary, we name the general data aggregation service DAGW (Data Aggregate Gateway). The internal structure of DAGW and its interaction with the BFF layer and Service are shown in the following figure:

picture

(Figure 9: Introducing the universal data aggregation gateway layer DAGW to uniformly meet the needs of aggregation scenarios)

Effect

After the launch of DAGW universal data aggregation gateway and business association index, it supports the aggregation of video, user and other information. Nearly 30 business services have been connected and helped reduce the traffic and load of business services by more than 90% on average. The following is the effect of the access of the video's high-energy viewing business and the user's fan medal business:

1. In the traffic of the video's high-energy highlight service, the traffic from the playback page (app-view) reaches 100k+ QPS during the peak period. After connecting to DAGW for optimization, the effect is very significant. The monitoring in the following figure shows that the request QPS has been reduced by 99%.

picture

2. The fan badge is a wearable hardcore fan honor that users obtain by watching the anchor's live broadcast for a long time and participating in the interaction. Because the threshold for obtaining it is high and it is only displayed under specific anchor content, after connecting to DAGW, it can effectively reduce more than 85% of the access traffic.

refer to

1. The Tail at Scale: https://research.google/pubs/pub40801/

2. GraphQL: From Excitement to Deception: https://betterprogramming.pub/graphql-from-excitement-to-deception-f81f7c95b7cf

3. View Enum: https://google.aip.dev/157

Authors of this issue

Huang Shancheng

Senior Development Engineer at Bilibili

Xia Linjuan

Senior Development Engineer at Bilibili

Zhao Dandan

Senior Development Engineer at Bilibili

<<:  What is Fiber to the Home (FTTH)?

>>:  Five-minute technical talk | Understand how computers send and receive information in one article

Recommend

Research shows: 5G will drive the development of the digital economy

How does 5G fit into this? As remote work, video ...

The battle for future wireless network communication technology

Wireless communication is closely related to our ...

G Suite vs. Office 365: Which is the right productivity suite for your business?

Choosing an office suite used to be a simple matt...

In-depth understanding of UDP programming

What is UDP? UDP is the abbreviation of User Data...

Easy-to-understand illustrated network knowledge - Part 2

Continuing from the previous article "​​Easy...

6G will usher in a new era for all industries

At this early stage, 6G wide-area wireless has fe...

5G is not about mobile phones, but about the Internet of Things.

[[320662]] Recently, new infrastructure has conti...

Telenor launches 5G network in more than 60 locations in Bulgaria

Telecom operator Telenor has officially launched ...

#Has run away#Limewave: $16.82/year-AMD RYZEN9 3900X/512MB/25G SSD/1TB/Seattle

【Attention】This merchant has run away!!! Limewave...