Exploration and practice of full-link grayscale solution based on Istio

Exploration and practice of full-link grayscale solution based on Istio

background

Under the microservice software architecture, it is quite time-consuming and labor-intensive to build a complete set of test systems for verification before the launch of new business functions. As the number of split microservices increases, the difficulty increases. The machine cost required for this entire test system is often not low. In order to ensure the efficiency of functional correctness verification before the launch of the new version of the application, this system must be maintained separately. When the business becomes large and complex, it is often necessary to prepare multiple sets. This is a common and difficult cost and efficiency challenge faced by the entire industry. If the functional verification of the new version before the launch can be completed in the same production system, the manpower and financial resources saved are considerable.

In addition to functional verification in the development phase, the introduction of grayscale release in the production environment can better control the risk and explosion radius of the new version of the software. Grayscale release is to allocate production traffic with certain characteristics or proportions to the service version that needs to be verified to observe whether the operating status of the new version after it is launched meets expectations.

Alibaba Cloud ASM Pro (see the end of the article for related links) is a full-link grayscale solution built based on Service Mesh, which can help solve the problems in the above two scenarios.

ASM Pro product functional architecture diagram:

The core capabilities used are the expanded traffic labeling, label-based routing, and traffic fallback capabilities shown in the preceding figure. The following is a detailed description.

Scenario Description

Common scenarios for full-link grayscale release are as follows:

Taking Bookinfo as an example, the ingress traffic will carry the expected tag group. The sidecar obtains the expected tag in the request context (Header or Context) and distributes the traffic to the corresponding tag group. If the corresponding tag group does not exist, it will fallback to the base group by default. The specific fallback strategy can be configured. The following describes the specific implementation details in detail.

The tag label of the ingress traffic is generally based on the tag plug-in method at the gateway level to label the request traffic. For example, userids within a certain range are marked with a grayscale tag. Considering the selection and implementation diversity of gateways in actual environments, the implementation of the gateway is not within the scope of this article.

Next, we will focus on how to achieve full-link traffic labeling and full-link grayscale based on ASM Pro.

Implementation principle

Inbound refers to the inbound traffic that requests are sent to the App, and Outbound refers to the outbound traffic that the App sends outbound requests.

The figure above shows a typical traffic path of a business application after mesh is enabled: the business app receives an external request p1, and then calls the interface of another service it depends on. At this time, the traffic path of the request is p1->p2->p3->p4, where p2 is the forwarding of p1 by Sidecar, and p4 is the forwarding of p3 by Sidecar. In order to achieve full-link grayscale, both p3 and p4 need to obtain the traffic label coming from p1 to route the request to the backend service instance corresponding to the label, and p3 and p4 must also carry the same label. The key is how to make the transfer of labels completely imperceptible to the application, so as to achieve full-link label transparent transmission, which is the key technology of full-link grayscale. The implementation of ASM Pro is based on traceId in distributed link tracking technology (such as OpenTracing, OpenTelemetry, etc.) to achieve this function.

In distributed link tracing technology, traceId is used to uniquely identify a complete call chain. Every fanout call issued by an application on the link will carry the source traceId through the distributed link tracing SDK. The implementation of the ASM Pro full-link grayscale solution is based on the widely adopted practice of this distributed application architecture.

In the above figure, the inbound and outbound traffic that Sidecar originally sees are completely independent. It is unable to perceive the corresponding relationship between the two, nor is it clear whether an inbound request leads to multiple outbound requests. In other words, Sidecar is not aware of whether there is a corresponding relationship between the two requests p1 and p3 in the figure.

In the ASM Pro full-link grayscale solution, the p1 and p3 requests are associated through traceId, specifically relying on the x-request-id trace header in the Sidecar. Sidecar maintains a mapping table that records the correspondence between traceId and labels. When Sidecar receives the p1 request, it stores the traceId and label in the request in this table. When the p3 request is received, it queries the mapping table to obtain the label corresponding to the traceId and adds this label to the p4 request, thereby achieving full-link labeling and label-based routing. The following figure roughly illustrates this implementation principle.

In other words, the full-link grayscale function of ASM Pro requires the application to use distributed link tracking technology. If the application that wants to use this technology does not use distributed link tracking technology, it will inevitably involve certain modifications. For Java applications, you can still consider using Java Agent in AOP to enable the business to implement traceId transparent transmission between inbound and outbound without modification.

Realize traffic marking

ASM Pro introduces a new TrafficLabel CRD to define where the traffic label that Sidecar needs to pass through is obtained. The YAML file listed below defines the source of the traffic label and the need to store the label in OpenTracing (specifically the x-trace header). The traffic label is named trafficLabel, and the value is obtained from $getContext(x-request-id) and finally from $(localLabel) in the local environment.

  1. apiVersion: istio.alibabacloud.com/v1beta1kind: TrafficLabelmetadata: name: defaultspec: rules: - labels: - name: trafficLabel valueFrom: - $getContext(x-request-id) //If aliyun arms is used, the corresponding value is x-b3-traceid - $(localLabel) attachTo: - opentracing # Indicates the effective protocols. Empty means none of them are effective, and * means all of them are effective protocols: "*"

The CR definition consists of two parts, namely, label acquisition and storage.

Acquisition logic: First, obtain the traffic label according to the protocol context or the defined field in the header (Header part). If not, it will be obtained through the map recorded locally by Sidecar according to the traceId. The map table stores the mapping of traceId to traffic identifier. If the corresponding mapping is found in the map table, the traffic will be marked with the corresponding traffic label. If it cannot be obtained, the traffic label will be set to the localLabel of the local deployment environment. The localLabel corresponds to the associated label of the local deployment, and the label name is ASM_TRAFFIC_TAG.

The tag name of the local deployment environment is "ASM_TRAFFIC_TAG". The actual deployment can be associated with the CI/CD system.

Storage logic: attachTo specifies the corresponding field stored in the protocol context, such as the Header field for HTTP and the rpc context part for Dubbo. The specific field to be stored is configurable.

With the definition of TrafficLabel, we know how to label and transfer traffic, but this alone is not enough to achieve full-link grayscale. We also need a function that can perform routing based on trafficLabel traffic identifiers, that is, "routing by label", as well as routing fallback and other logic, so that when the destination of the route does not exist, the degradation function can be implemented.

Routing by traffic label

The implementation of this feature extends Istio's VirtualService and DestinationRule.

Defining Subsets in DestinationRule

The custom group subset corresponds to the value of trafficLabel

  1. ---apiVersion: networking.istio.io/v1alpha3kind: DestinationRulemetadata: name: myappspec: hosts: myapp/* subsets: - name: myproject # Project environment labels: env: abc - name: isolation # Isolation environment labels: env: xxx # Machine grouping - name: testing-trunk # Main trunk environment labels: env: yyy - name: testing # Daily environment labels: env: zzz---apiVersion: networking.istio.io/v1alpha3kind: ServiceEntrymetadata: name: myappspec: hosts: myapp/* ports: - number: 12200 name: http protocol: HTTP endpoints: - address: 0.0.0.0 labels: env: abc - address: 1.1.1.1 labels: env: xxx - address: 2.2.2.2 labels: env: zzz - address: 3.3.3.3 labels: env:yyy

Subset supports two specification forms:

Labels are used to match nodes (endpoints) with specific tags in the application;
ServiceEntry is used to specify the IP addresses belonging to a specific subset. Note that this method is different from the labels specification logic. They can be specified directly through configuration instead of getting the addresses from the registry (K8s or other). It is suitable for Mock environments, where nodes are not registered with the service registry.

Based on subset in VirtualService

1) Global default configuration

The route part can specify multiple destinations in order, and the traffic is distributed among the multiple destinations according to the proportion of the weight value.
You can specify a fallback strategy for each destination. case identifies the circumstances under which the fallback is executed. The values ​​are: noinstances (no service resources), noavailabled (service resources are available but the service is unavailable). target specifies the target environment for the fallback. If fallback is not specified, it is forced to be executed in the environment of the destination.
By modifying the label routing logic, we make subset support the placeholder $trafficLabel by modifying VirtualService. The placeholder $trafficLabel indicates that the target environment is obtained from the request traffic label, corresponding to the definition in TrafficLabel CR.

The global default mode corresponds to the swimlane, which is closed in a single environment, and specifies the fallback strategy at the environment level. The custom group subset corresponds to the value of trafficLabel

The configuration sample is as follows:

  1. apiVersion: networking.istio.io/v1alpha3kind: VirtualServicemetadata: name: default-routespec: hosts: # Valid for all applications - */* http: - name: default-route route: - destination: subset: $trafficLabel weight: 100 fallback: case: noinstances target: testing-trunk - destination: host: */* subset: testing-trunk # Main environment weight: 0 fallback: case: noavailabled target: testing - destination: subset: testing # Daily environment weight: 0 fallback: case: noavailabled target: mock - destination: host: */* subset: mock # Mock center weight: 0

2) Personal development environment customization

First, attack the daily environment. When there are no service resources in the daily environment, attack the main environment.

  1. apiVersion: networking.istio.io/v1alpha3kind: VirtualServicemetadata: name: projectx-routespec: hosts: # only valid for myapp - myapp/* http: - name: dev-x-route match: trafficLabel: - exact: dev-x # dev environment: x route: - destination: host: myapp/* subset: testing # daily environment weight: 100 fallback: case: noinstances target: testing-trunk - destination: host: myapp/* subset: testing-trunk # trunk environment weight: 0

3) Support weight configuration

For traffic marked with the backbone environment and whose local environment is dev-x, 80% is sent to the backbone environment and 20% is sent to the daily environment. When there are no available service resources in the backbone environment, the traffic is sent to the daily environment.

sourceLabels is the label corresponding to the local workload

  1. apiVersion: networking.istio.io/v1alpha3kind: VirtualServicemetadata: name: dev-x-routespec: hosts: # For which applications this route is effective (multi-application configuration is not supported) - myapp/* http: - name: dev-x-route match: trafficLabel: - exact: testing-trunk # Trunk environment label sourceLabels: - exact: dev-x # Traffic comes from a certain project environment route: - destination: host: myapp/* subset: testing-trunk # 80% of the traffic goes to the trunk environment weight: 80 fallback: case: noavailabled target: testing - destination: host: myapp/* subset: testing # 20% of the traffic goes to the daily environment weight: 20

Routing by (environment) label

This solution relies on the business deploying applications with related labels (the corresponding label in the example is ASM_TRAFFIC_TAG: xxx), which are usually environment labels. The labels can be understood as meta information related to service deployment. This relies on the connection of the upstream deployment system CI/CD system. The schematic diagram is as follows:

In the K8s scenario, the corresponding environment/group label can be automatically added during business deployment, that is, K8s itself is used as the metadata management center.
In non-K8s scenarios, integration can be achieved through the service registration center or metadata configuration management service (metadata server) that has been integrated with the microservice.

Note: ASM Pro has developed its own ServiceDirectory component (see the ASM Pro product functional architecture diagram), which enables the connection of multiple registration centers and the dynamic acquisition of deployment metadata;

Application scenario extension

The following is a typical example of a multi-development environment governance function based on traffic labeling and label-based routing. Each developer's corresponding Dev X environment only needs to deploy services with updated versions. If joint debugging with other developers is required, the service request can be transferred to the corresponding development environment by configuring fallback. As shown in the figure below, B of Dev Y environment -> C of Dev X environment.

Similarly, it is possible to equate the Dev X environment with the online grayscale version environment, which can solve the problem of full-link grayscale release in the online environment.

Summarize

The "traffic tagging" and "routing by tag" capabilities introduced in this article are a general solution that can better solve problems such as test environment management and online full-link grayscale release. Based on service mesh technology, it is independent of the development language. At the same time, this solution is suitable for different 7-layer protocols and currently supports HTTP/gRpc and Dubbo protocols.

Other vendors also have some solutions for full-link grayscale. Compared with other solutions, the advantages of ASM Pro are:

  • Support multiple languages ​​and multiple protocols.
  • The unified configuration template TrafficLabel is simple and flexible to configure, and supports multi-level configuration (global, namespace, and pod levels).
  • Supports routing fallback to achieve downgrade.

The capabilities of "traffic tagging" and "routing by tag" can also be used in other related scenarios:

  1. Performance stress testing before a big promotion. In the scenario of online stress testing, in order to isolate the stress test data from the official online data, a common method is to use shadows for message queues, caches, and databases. This requires traffic tagging technology, which uses tags to distinguish whether the request is test traffic or production traffic. Of course, this requires Sidecar to support middleware such as Redis and RocketMQ.
  2. Unitized routing. A common unitized routing scenario may be to determine the corresponding unit based on some metadata in the request traffic, such as uid, through configuration. In this scenario, we can add a "unit label" to the traffic by extending TrafficLabel to define a function to obtain the "unit label", and then route the traffic to the corresponding service unit based on the "unit label".

<<:  Serverless Engineering Practice | Quickly Build Kubeless Platform

>>:  The Brazilian government announced plans to achieve full 5G coverage across the country by 2029

Recommend

...

Practice: Can changing the “region” really enhance wireless signals?

Around mid-July, we published an article about th...

Artificial Intelligence in the Data Center: Seven Things You Need to Know

Artificial intelligence and machine learning are ...

LOCVPS 20% off: 29.6 yuan/month - 1GB/30GB/400GB@100Mbps/Osaka, Japan

LOCVPS is a domestic hosting company founded in 2...

Does iPhone 12 mini not have 5G?

Although Apple held a press conference recently, ...

NTT and Cisco jointly attended the 2021 China CIO Alliance Annual Summit Forum

[[435879]] The China CIO Alliance (CCA) was held ...

Can this be considered? TCP is awesome.

Hello everyone, I am Xiaolin. I saw an old man as...

How SD-WAN is reconfiguring enterprise services

As software-defined wide area networks (SD-WAN) h...