1. Background On July 9, 2018, I joined Alibaba Cloud through campus recruitment and started my career. I was fortunate to participate in the entire design, development, and testing of the resource orchestration service from 1.0 to 2.0, which enlightened me to understand cloud services. Of course, this article is based on my thoughts and insights during the design and development process. In traditional software architecture, apart from the business layer code, it is necessary to deploy computing nodes, storage resources, network resources, and then install and configure the operating system, etc. Cloud services are essentially the realization of IT architecture software and IT platform intelligence. These hardware resources are defined in the form of software, and their operation interfaces are fully abstracted and encapsulated. Any resource can directly call the relevant API to complete operations such as creation, deletion, modification, and query. Thanks to Alibaba Cloud's full abstraction of resources and highly unified OpenAPI, it is possible to build a complete IT architecture based on Alibaba Cloud and manage the lifecycle of various resources. Customers provide resource templates as required, and the orchestration service will automatically create and configure all resources according to the orchestration logic. 2. Architecture Design With the increase in business scenarios and the exponential growth of business scale, the original architecture has gradually exposed problems such as large tenant isolation granularity, small concurrency, and serious service dependence. The reconstruction of the service architecture is imminent, and the three most important aspects are topology design, concurrency model design, and workflow design. 1. Architecture design The core issue of topology design is to clarify the product form and user needs and solve the data path problem. Points to consider from the product perspective include:
Resource owners are divided into service accounts and user accounts. The mode in which resources belong to service accounts is also called the big account mode. The advantages of this mode are: 1. Stronger control capabilities; 2. Easier billing. However, bottlenecks include: 1. Resource quotas; 2. Service-dependent interface flow control. Obviously, full resource hosting is unrealistic. For example, VPC, VSwitch, SLB, SecurityGroup and other resource customers often need to connect with other systems. These resources are usually provided by users, while ECS instances are more suitable for creation through big accounts. Multi-tenant isolation is a very important issue in the large account model. It is necessary to ensure that the resources of a certain user can access each other, and to ensure that there is no cross-border behavior between multiple customers. A common example is that all users' ECS are opened in the same service VPC. Instances in the same VPC can access each other by default, which poses a security risk. Therefore, it is necessary to consider the response plan for related issues in the early stage of system design. Our design for the above problem is to create ECS instances in the resource VPC under the service account through the big account mode, and implement access isolation for different user instances through enterprise-level security groups. When operations involving access to user data (NAS, RDS, etc.) require users to provide the VPC and Vswitch where these access points are located, and access to user data is achieved by creating ENI on the instance and binding it to the user VPC. The specific data path is shown in the figure. Common service architecture 2. Concurrency model design The core of model design is to solve the problems of high concurrency, high performance, and high availability. The main indicator of high concurrency in resource orchestration is QPS (Queries-per-second). For resource orchestration logic that is often measured in minutes, the synchronous model obviously cannot support high concurrent requests. The main indicator of high performance in resource orchestration is TPS (Transactions-per-second). In the process of orchestrating resources according to user resource templates, there is a certain dependency between resources. Linear creation of resources will cause a lot of time in a busy waiting state, and service throughput will be severely limited. The main indicator of high availability in resource orchestration is SLA (Service Level Agreement). If CRUD's dependence on internal services can be decoupled on the basis of HA, the impact on SLA can be reduced when the service is upgraded or an exception occurs. Our design for the above problem is to write the user template to the persistence layer immediately after a simple parameter check on the service front end. After successful writing, the resource ID is immediately returned. The persisted resource template will be regarded as an unfinished task waiting for scheduling. Subsequently, we periodically scan the table to detect tasks, create resources in order and synchronize their status. If the resource status does not meet the conditions for moving forward, it will be returned immediately. After multiple rounds of processing, the desired state is finally achieved. A simplified distributed model is shown in the figure. Distributed concurrency model In order to avoid lock contention when there are many tasks, we designed a task discovery + lease renewal mechanism. Once a cluster is grabbed by a node from the database pool, it will be added to the scheduling pool of the node and a lease will be set. The lease management system will renew (lock) the lease that is about to expire. This ensures that a cluster is only processed by a certain node until the next service is started. If the service is restarted, the task will be automatically unlocked due to timeout and captured by other nodes. 3. Workflow design The core of process design is to solve dependency problems. Dependency problems include two situations: the status of the previous resource does not meet expectations and the status of the resource itself does not meet expectations. We assume that the status of each resource is only available or unavailable, and assume that the available resource will not jump to the unavailable state. The simplest case is a linear task, as shown in the figure. Considering that the orchestration work of some sub-resources can be parallelized, the orchestration process can be regarded as a directed acyclic graph (DAG) task. Resource Linear Orchestration Structure The world is not just black and white, and the same is true for the state of resources. Directed acyclic has become a beautiful wish, and directed cyclic is in line with the operating rules of the real world. In this case, it is difficult for a simple workflow to cover a complex process. Only by further abstracting the workflow and designing a finite state machine (FSM) that meets the requirements. Finite state machines are too abstract, but everyone has come into contact with the state transition of ECS instances. The figure below is the state transition model of an ECS instance. ECS instance state transfer model In combination with actual business needs, I designed a cluster state transition model as shown in the figure below. This model simplifies the state transition logic. There is only one steady state, Running, and the other three states (Rolling, Deleting, and Error) are intermediate states. Resources in the intermediate state will try to migrate to the steady state according to the current resource state. Each state transition process executes related operations according to a certain workflow. Cluster state transfer model From this point on, the overall architecture and design ideas of the service were basically established. 3. Core Competitiveness The problem of resource (ECS) shortage is becoming increasingly serious. In addition, coarse-grained scaling and upgrading functions can no longer meet customer needs. Resource pooling, auto scaling, and rolling updates have been put on the agenda and have become a powerful tool to enhance product competitiveness. 1. Resource pooling Resource pooling simply means reserving certain resources in advance for emergencies. Obviously, the premise of resource pooling must be the large account model. For developers, thread pool is not a strange term, but resource pool is relatively far away. In fact, resource pool solves the problem of high resource creation and deletion time overhead and uncontrollable inventory. Of course, another assumption of pooled resources is that the pooled resources will be frequently used and can be recycled (specifications and configurations are relatively simple). Since computing resource creation cycles are long and often plagued by issues such as resource inventory, and because the product expects to expand in business, we designed a resource pooling model as shown in the figure and abstracted a variety of computing resources, providing a set of processing logic that can handle heterogeneous resources. Resource pooling model Resource pooling can greatly shorten the waiting time for resource creation and solve the problem of insufficient inventory. In addition, it can help the upper-level services that use resources decouple complex state transfer logic, and the resource states provided to the outside world can be simplified to Available and Unknown, so that what is obtained can be used. However, the following issues must be considered: Whether the creation of ECS instances is restricted by user resources (for example, the VSwitch provided by the user will limit the ECS availability zone). 2. Automatic scaling The biggest attraction of cloud computing is cost reduction. For resources, the biggest benefit is that you can pay by volume. In fact, almost all online services have their peaks and valleys, and automatic scaling solves the cost control problem. It increases ECS instances to ensure computing power when the customer's business grows, and reduces ECS instances to save costs when the business declines, as shown in the figure. Automatic scaling diagram My design idea for automatic scaling is to first trigger the scheduled task for the time slice, and then configure the scaling strategy within the time period. The scaling strategy also consists of two parts. One part is the maximum ECS scale and the minimum ECS scale, which specifies the floating range of the cluster scale within the time period, and the other part is the monitoring indicators, tolerance and step rules, which provide the basis and standards for scaling. The monitoring indicators here are more interesting. In addition to collecting the CPU and Memory utilization of cloud monitoring, you can also calculate the proportion of working nodes by marking the idle and busy states of ECS. Once it exceeds the tolerance range, you can trigger an expansion or reduction event according to the step size. 3. Rolling upgrade Modifications to the customer service architecture often involve complex reconstruction logic, which will inevitably affect service quality during the reconstruction process. How to perform upgrades and downgrades gracefully and smoothly has become a rigid demand of many customers. Rolling upgrades are precisely the solution to the problem of upgradation and downgrades without service interruption and in a controllable manner. Rolling upgrade diagram A simplified rolling upgrade process is shown in the figure above. The core of rolling upgrade is to grayscale the upgrade, open Standby resources in a certain proportion until they can be successfully served, and then take the corresponding number of resources offline. After multiple rolling, all resources are updated to the latest expectations, and the upgrade is achieved without stopping the service through redundancy. Observability Service observability will surely become one of the core competitiveness of cloud services in the future. It includes two parts: user-oriented observability and developer-oriented observability. To this day, I still remember the fear of being dominated by customer calls in the middle of the night, the helplessness of investigating problems in massive logs, and the confusion of having no idea what to do after a customer complained. 1. User-oriented Yes, I hope that when users report problems to us, the information they provide is effective and can even directly point to the cause of the problem. For users, being able to directly obtain the stage of resource orchestration and the status information of the corresponding resources in each stage through the API can indeed greatly improve the user experience. In response to this problem, I analyzed the system processing flow and designed a running status collector for "stage-event-status". Specifically, it includes: splitting the business process into multiple processing stages, organizing the events (resources and their status) that each stage depends on, and making a structured definition of the possible status of each event (especially the abnormal status). A typical example is shown in the code sample.
Code example: Cluster dimension status collection 2. For developers For developers, observability includes monitoring and logging. Monitoring can help developers view the operating status of the system, while logging can assist in the troubleshooting and diagnosis of problems. The product monitors and aggregates data from four dimensions: infrastructure, container services, services themselves, and customer business. The specific components used are shown in the figure. Monitoring and alarm systems at all levels The infrastructure mainly relies on Cloud Monitor to track CPU, Memory and other usage rates; the container service mainly relies on Prometheus to monitor the K8S cluster where the service is deployed. For the service itself, we have connected Trace to each operation stage for fault location; for the most difficult customer business, we collect customer usage through SLS, aggregate data through UserId and ProjectId, and organize Prometheus's DashBoard to quickly analyze a user's usage. In addition to monitoring, cloud monitoring alarms, Prometheus alarms and SLS alarms have been connected. Different alarm priorities are set for the system and business respectively, and a variety of emergency response plans have been compiled. V. Others From being ignorant to being able to independently take charge of the design and development of resource orchestration services, Alibaba Cloud provides a valuable learning platform. |
<<: China’s 5G leads the world!
>>: When is the right time to buy Wi-Fi 6E?
[[343025]] As the global energy crisis intensifie...
At the beginning of the month, RackNerd released ...
VPSMS is currently holding a two-year anniversary...
[51CTO.com original article] On August 7, 2017, N...
Translator | Li Rui Proofread by Sun Shujuan Ther...
WeChat and QQ have become the strongest kings in ...
Recently, Jimmy Yu, vice president and analyst at...
The final and award ceremony of the 2024 3rd Ruij...
Technology has changed the way we conduct diagnos...
A few days ago, the blog just shared information ...
China Mobile is too powerful. Do China Unicom and...
UFOVPS has released the biggest discount this yea...
On March 13, China Telecom Corporation Limited an...
Recently, GSA announced the latest progress of LT...
As networks become increasingly software-based, l...