The technical support behind the 11.11 promotion: in-depth analysis and practical cases of SLA and SLO

background

It's the 11.11 promotion day again. Recently, many teams have sent emails to upstream and downstream to confirm SLA. Do you still not understand the concepts of service quality SLA, SLO, etc.? This article shares theoretical knowledge and practical experience based on SLO alarm management. It details how to set SLO, effective alarm flood management, and how to optimize service performance and reliability based on SLO indicators.

Question (1)

This was the first time I came across the concept of SLA (Service Level Agreement), where the promised response time was 200 milliseconds. However, the TP99 (99% request completion time) of the service interface exceeded 100 milliseconds, and the upstream timeout configuration was 2000 milliseconds. What was the connection between them? I was a little confused. Later, I gradually figured out the concept of SLA at work.

Question (2)

At the beginning of the year, there was a question that kept bothering me. For example, the availability rate of an API interface in the system I was responsible for was 99.99%. In the department, there were N systems and M interfaces, and the quarterly availability rate of the department was 99.98%. How was this number calculated? What were the statistical rules? I asked XXX to help me solve my doubts. Thank you very much!

Question (3)

For example, I purchased 100 cloud hosts from XXX Cloud. During the 5 minutes from 10:00 to 10:05, 10 hosts failed, resulting in the API's external availability rate being only 90% (the total number of requests during these 5 minutes was 10,000, and the number of failed requests was 1,000). If this 5-minute failure occurs once a day for 30 days a month, what is the availability rate?

With these questions, I studied the service quality indicators: SLI (Service Level Indicator), SLO (Service Level Objective) and SLA (Service Level Agreement). If you are also interested in the above questions and want to find the answers, welcome to read this article. The following is my research results for your reference. If there is anything wrong, please correct me.

1. Service Quality Terminology

If you don’t understand the importance of the various behaviors within the system you are responsible for, and cannot measure whether these behaviors are correct, you will not be able to operate and maintain the system correctly, let alone operate and maintain it stably and reliably. Whether it is external services or internal APIs, you need to set a service quality goal for users and work hard to achieve this goal.

In this process, we need to define some service quality indicators (SLIs), service quality objectives (SLOs), service quality agreements (SLAs) and the expected values and response plans of these indicators based on historical experience, subjective judgment and understanding of the service.

1) Service Level Indicator (SLI)

Definition: A specific quantitative indicator of the service quality of a service

Common SLIs include:

Availability (the percentage of requests that are successfully responded to)
99% of request latency (TP99 request processing time)
Throughput (requests per second)
Persistence (the complete time that data can be saved, commonly used in data storage systems)

2) Service Level Objective (SLO)

Definition: The target value or target range of a service's SLI.

The definition of SLO is SLI < target value, or lower limit of range < SLI < upper limit of range.

The benefits of setting service level objectives (SLOs) are mainly reflected in the following aspects:

1. For clients, SLO provides a predictable quality of service, which makes the client's system design simpler and more stable. Customers can plan and adjust their business processes based on SLO to reduce the impact of uncertainty.

2. For service providers, the benefits of SLO include:

Predictable service quality: SLO can help service providers clarify service quality standards and goals, so as to better manage and optimize services.
Better cost/benefit trade-off: By setting reasonable SLOs, service providers can find a balance between resource input and service quality, achieving more efficient resource utilization and cost control.
Better risk control: When resources are limited or failures occur, SLO can help service providers better control risks and avoid the impact of service quality degradation on the business.
Faster response and corrective measures in case of failure: SLO can be used as a reference standard to help service providers detect problems faster and take corrective measures when failures occur, thereby reducing failure recovery time.

Choosing a suitable SLO is a very complex process. The difficulty is that it may not be possible to determine a specific value . For example, the number of queries per second (QPS) indicator of the external API is determined by the user, and we cannot set an SLO for this indicator. However, we can specify that the request TP99 time is less than 20ms. Determining this goal can encourage developers to optimize code performance.

3) Service Level Agreement (SLA)

Definition: An explicit or implicit agreement between a service and its users. It describes the consequences of meeting or not meeting the SLO. These consequences can be financial (compensation) or other.

A simple way to differentiate between an SLO and an SLA is to ask, “What are the consequences if the SLO is not met?” If there are no clearly defined consequences, then we are definitely talking about an SLO, not an SLA.

Google search service is a typical service that does not have a public SLA.

4) Cloud Service Level Agreement (CSLA)

See Chapter 3 for details

picture

2. SLO Practice Case

We can use SLO practice cases to guide our stability construction.

1) Service classification & core interface classification

Grading rules

Applications are divided into levels 0-3 based on business impact
The interfaces in the application are further subdivided into level 0/1. Due to many historical reasons, many level 0/1 systems have N interfaces inside, but M interfaces are non-core. Or there are N level 0 interfaces in a level 2 application. Pay attention to the robustness of the core interface (whether the availability management is completed, whether it can be downgraded, and whether the current limit value is configured reasonably)

Grading purpose

•Based on service levels, core services are required to comply with corresponding standards (such as code review, online release, change process, and promotion management).

• Alarms are graded based on service levels

•Different levels of stability require different SLOs, such as the ability to perform circuit breaker downgrades.

Precautions

The whole link application classification is not unified : For example, the whole link has both level 0 applications and level 1 applications, and there are also many level 3 applications

Solution: Push downstream to update application levels

As shown in the figure below, the downstream level application of level 0/1 dependency scanned by the tool is level 3.

Inconsistent interface classification : For example, for a historical interface, Department A believes that the interface is not important from its own perspective, but the upstream department has a strong dependency on it. Sometimes the importance of the interface is not known until an online failure occurs.

Solution: Link tracking is needed to see the impact from the user's perspective (user shopping, logistics front-line operations), but this is sometimes difficult, especially for historical interfaces where the link is too long.

3. Inaccurate interface availability : For example, internal exceptions are caught by catch and not reflected to the outermost layer of the interface. For example, the service fails for half an hour, but the ump availability is still 100%

Solution: Conduct code availability management to ensure that the API availability is real.

2) SLI Practical Application

Select APIs that represent the core functions of the business and rank them according to the business grading model above. Generally, only L0 and L1 level businesses and their interfaces are measured. So what indicators should be defined? Why define these indicators? Are these indicators the most important for the service?

For example, UMP monitoring indicators include:

1) Method performance: TP50, TP90, TP99, TP999, TP9999, MIN, MAX, AVG

2) Method availability

3) Number of method calls

Do we need all these monitoring indicators as SLI? The answer is definitely not. Only by understanding the real needs of users for the system can we truly decide which indicators are useful. Too many indicators will affect important indicators, and too few indicators will cause important behaviors to be ignored. Generally speaking, 3-5 representative indicators are enough to evaluate and pay attention to the health of the system.

Common service SLIs are divided into the following categories

picture

3) SLO Practical Application

We should start with what users care about most, rather than what can be monitored and measured currently (of course, UMP currently has many dimensions for monitoring and measurement, and some indicators can be used as SLIs). The first step in developing appropriate SLOs is to discuss what the SLOs should be and what they should cover. SLOs set a target reliability level for the customers of the service.

For a clearer definition, SLOs should specify how they are measured and the conditions under which they are valid. For example: 99% (aggregate of requests in a 1 minute timeframe) of JSF-API calls will complete within 200ms (including all backend servers)

picture

Select target SLO strategy recommendations

1. Don't pursue perfection : Don't choose a goal based on the current operating status, but rather trigger it from the global perspective. You can set a loose goal at the beginning (provided that users are satisfied) and gradually tighten it. For example, the current TP99 of the API is 8ms, and the TP99 of the external SLO is 10ms. If the API's tp99 is 20ms due to the subsequent transformation of the internal logic of the interface and changes in requirements, the SLO will not be met.

2. Leave a safe zone: Use a higher SLO internally and a lower SLO externally. This leaves some room to respond to demand, etc. The buffer protects us from having a direct impact on users.

Your first attempt at SLI and SLO may not be correct. The most important goal is to make appropriate measurements and establish a feedback loop so that you can make improvements. As SLAs change, system architectures continue to evolve (for example, the TP99 of the previous SLA was 1000ms, and you used MySQL architecture to store data. Later, it changed to 20ms. At this time, MySQL architecture is not suitable and needs to evolve to a form such as redis cache)

4) SLA practical application

R&D needs to be considered together with business product colleagues. Currently, the agreement is reached internally through email. The external cloud vendors reach an agreement through a contract (protecting the compensation clause for failure to meet the SLO).

3. Cloud Service Agreement (optional)

If you are not interested in this chapter, you can skip it.

1) National Standard: Basic Requirements for Information Technology Cloud Computing Service Level Agreement

picture

2) JD Cloud Host Service Level Agreement (SLA)

picture

3) Alibaba Cloud Server ECS Service Level Agreement

picture

4) AWS & Google Cloud Product Service Agreement

picture

4. API Gateway Service Level Agreement

The following is a definition of the availability of API gateway services from different vendors.

1) JD Cloud API Gateway Service Level Agreement (SLA)

picture

2) Alibaba Cloud API Gateway Service Level Agreement (SLA)

picture

3) Amazon Cloud API Gateway Service SLA

picture

5. SLO-based Alert Governance Practice

SLO is a key factor in reliability decision making. Applications receive numerous alarm notifications every day, such as CPU, memory, disk, availability, MAX, tp99, tp999, etc., which generate a lot of noise. But which alarms will affect availability and require SRE and R&D attention? This is one of the core values of SLO.

The goal of alert setting is to generate actionable alerts for important events based on SLO.

Basis for alarm setting: Based on service quality indicators (SLI) and error budget, an alarm notification of a major event is issued for each event that consumes a large amount of error budget.

The alarm information for the SLO configuration set above is as follows:

1) API alert configuration

We can set reasonable thresholds and rules to filter out unnecessary alarm information, thereby avoiding the interference of alarm noise on the development and operation team and allowing them to focus more on real problems.

•Use UMP to achieve 3 golden monitoring indicators (availability, call volume, TP99)

When configuring the alarm mechanism, we can comprehensively consider factors such as availability, TP99, and call volume for evaluation. The comprehensive evaluation of these indicators can help us understand the system operation more comprehensively, so as to discover potential problems in time and take corresponding measures.

It is recommended that you adopt a stricter strategy when configuring alarms, that is, tighten first and then loosen, and gradually adjust to the best state . This ensures that problems can be discovered in time at the very beginning to avoid major failures. However, as the system gradually stabilizes, we can also appropriately relax the alarm threshold according to actual conditions to improve system availability and efficiency.

It should be noted that when configuring alarms, we need to adjust and optimize them based on specific business scenarios and system characteristics . Different systems may have different risk points and bottlenecks, so we need to formulate corresponding alarm strategies based on actual conditions to ensure the stability and reliability of the system.

For example, SLO: minute-level TP99 is 200ms, but your daily TP99 is 80ms. Based on the daily interface behavior, the critical alarm must be configured to be <= 200ms, and the warning can be configured to be 100ms or 150ms, etc., and continuously adjusted to the optimal state.

critical alarm mode: Dongdong, email, instant message (JingME), voice Availability: (minute level) Availability < 99.95% An alarm is triggered if the threshold is exceeded for 3 consecutive times, and an alarm is triggered once within 3 minutes. Performance: (minute level) TP99 >= 200.0ms An alarm is triggered if the threshold is exceeded for 3 consecutive times, and an alarm is triggered only once within 3 minutes.

Warning mode: Dongdong, email, instant message Availability: (minute level) Availability < 99.99% An alarm is triggered when the threshold is exceeded for 3 consecutive times, and an alarm is triggered once within 30 minutes. Performance: (minute level) TP99 >= 100.ms An alarm is triggered when the threshold is exceeded for 3 consecutive times, and an alarm is triggered only once within 30 minutes. Number of calls: When the total number of method calls in 1 minute is greater than 2,000,000 for 3 consecutive times, an alarm is triggered, and an alarm is triggered only once within 3 minutes.

Note: The number of calls configured for the voice critical alarm is generally meaningless, because if the availability of your interface is fine and your TP99 is within expectations, what does it matter if the number of calls is high? However, there is an advantage to configuring the number of calls for warning. For example, if you have configured JSF current limiting, the current JSF current limiting will trigger an alarm. You can increase the threshold alarm for the number of calls in UMP or Pfinder , such as setting it to 80% of the current limiting value to start the alarm. This way, you can find the risk points caused by current limiting in advance. At the same time, the number of calls can also be used to know the traffic growth trend.

2) MQ alarm configuration

You need to configure corresponding alarms and emergency plans based on MQ latency (how long does it take for data to be produced and consumed)

picture

• Producer T(1) monitoring: producer to MQ time monitoring

• Consumer T (2.1) monitoring: Monitored through MQ backlog alarm configuration.

• Consumer T (2.2) Processing Logic Monitoring: Configure UMP alarms for processing logic.

Normal scenario: T(3) > T(1) + T(2.1) + T(2.2) where T(3) includes the time required to read the final data.

The configuration of the backlog alarm needs to be combined with the above time consumption formula and determined through stress testing.

Assume that when there is a backlog of 20,000 messages, the existing service capacity can process them within N milliseconds and meet the condition [ T(3) > T(1) + T(2.1) + T(2.2) ]. Then, the alarm threshold can be set below the backlog of 20,000 messages, such as 10,000 warnings and 15,000 critical alarms to reserve processing time.

For example, if [ T(3) = 2000ms ], the daily maximum value of [ T(1) ] is 20ms, and when there are 20,000 messages in the backlog, [ T(2.1) = 980ms ] and [ T(2.2) = 1000ms ].

3) Scheduled task alarm configuration

Since scheduled tasks are different from regular JSF-API interfaces or MQ, scheduled tasks are executed within the time agreed upon by the rules. If UMP is a scheduled task, the most important thing is to determine the monitoring period . Only by correctly configuring the monitoring period can we ensure that UMP is executed normally within the expected time period. In this way, once UMP fails to execute within the expected time period, the alarm mechanism will be automatically triggered to detect and solve the problem in time.

For example, xxx is executed at 1 o'clock every day

I need to monitor whether point 1 is executed

picture

6. Questions and Answers

After reading this article, I believe you already know how to answer the 3 questions at the beginning of the article.

1) What is the relationship between SLA, TP99, and timeout?

1. There is nothing much to say about TP99. Just look at the UMP mark of the interface. For example, the TP99 of the interface in this picture is 10-15ms.

picture

Since the upstream is the golden link before placing an order, the performance requirements are relatively high. The SLA (actually SLO) we promised to the public is 20ms TP99 (minutes), with a 6ms buffer reserved.
Timeout setting: The timeout can be set based on TP99 (for high business requirements, refer to TP999, TP9999, MAX) (including network delay) plus a certain buffer time. The buffer time needs to be based on experience or daily behavior statistics such as monitoring data, and a safety buffer time needs to be added. For example, if the TP99 of the downstream interface is 200ms, you may set the timeout to 300ms to 400ms.
Retry times setting:

◦ The downstream interface must support idempotency before retrying

◦ Retry strategy: Determine the number of retries based on the business scenario and the stability of the downstream interface. Generally speaking, the number of retries should not be too many (1-3 times) to avoid increasing the burden on the system. At the same time, consider whether the retries will not meet your external SLA indicators, especially when there are serious timeouts downstream.

◦ Exponential backoff: Using an exponential backoff strategy, the interval between each retry gradually increases (for example, wait 1ms for the first time, 2ms for the second time, and 4ms for the third time).

4. Implementation and monitoring: Monitor actual retry and timeout situations, and regularly evaluate the impact of retry and timeout strategies on system performance. Adjust strategies to suit actual conditions.

Please note that setting timeouts and retries is a decision process that requires balancing multiple factors. It needs to consider aspects such as system stability, response time, resource usage, and user experience. In addition, any retry strategy should avoid situations that may lead to cascading failures or resource exhaustion.

2) How is the team system availability calculated?

Question 2: The availability rate of an API interface in the system is 99.99%. In a department, there are N systems and M interfaces. The quarterly availability rate of the department is 99.98%. How is this number calculated?

If system A is a golden process production system, and a service failure occurs in the system, causing the production business to be unable to operate, the failure duration is A, and the business impact ratio is B, then the final availability failure duration is C=A*B;

Please refer to the calculation formula in question 3

For example, if the golden link fails for 150 minutes and affects 10% of the services, the system unavailability time is equivalent to 15 minutes.

Then the monthly system availability rate = 1-(total fault duration/total statistical period duration)*100%=1-(15/30*24*60)=99.96%

3) Cloud host & gateway API availability

Question 3: For example, if you purchased 100 cloud hosts from XXX Cloud (according to the cloud host SLA), 10 of them failed in the 5 minutes from 10:00 to 10:05, resulting in an API availability rate of 90% (total number of requests within 5 minutes was 10,000, and number of failed requests was 1,000). This happens once every 5 minutes for 30 days in a month. What is the availability rate for users?

Answer from two dimensions

The answer is based on the cloud service level agreement that provides basic services (this is basically the same for different cloud vendors).
Answer from the user-oriented API perspective (different manufacturers define different formulas)

No product is, and should not be, 100% reliable, because for users, there is no essential difference between 99.999% and 100% availability. So what is the right level? Factors to consider when setting reliability goals:

What level of service reliability is required for users to be satisfied?
If reliability is not enough, are there other alternatives?
Does the reliability of a service affect the user's usage patterns of the service?
Is it directly related to income?
Is the service for consumers or businesses?

7. Technical indicators & business indicators

As mentioned above, SLA defines technical indicators such as service availability and performance . So what problems do business indicators solve? They solve problems that technical indicators (availability, tp99) cannot see. They focus on the correctness and integrity of data .

The technical indicators of SLA and the data accuracy of business monitoring are usually interrelated. If the technical indicators are unavailable, the business indicators will definitely be unavailable. On the contrary, if the business indicators are unavailable with abnormalities, it does not necessarily mean that the technical indicators are unavailable. For example, if the availability of a system is lower than the threshold defined in the SLA, then this may affect the normal operation of the business process, resulting in data errors or loss. Therefore, in order to ensure business continuity and data accuracy, SLA and business monitoring usually need to be considered and managed together.

8. "Start with the end in mind" SLA guides preparations for the 11.11 promotion

SLA can be used as a powerful tool to guide the preparations for the 11.11 promotion. The details are as follows:

1. Clarify service goals: Upstream and downstream parties should clarify commitments in terms of SLA (service performance TP99, availability, peak QPS), etc. These goals should match the business needs of the 11 promotion, such as system stability and rapid response capabilities under peak traffic.

2. Develop a contingency plan: Based on the requirements of the SLA, develop a detailed contingency plan, including resource allocation, system optimization, and how to quickly recover from disasters. For example, if the SLA requires the system to have 99.99% availability during peak hours, then the contingency plan needs to consider how to achieve this goal, your fault tolerance plan, downgrade plan, and emergency plan, etc.

3. Military exercise full-link stress testing and capacity planning: In order to ensure that SLA requirements are met during the promotion period, performance stress testing, military exercise full-link stress testing and capacity planning are required. Through stress testing of high-traffic scenarios, the performance and stability of the system are verified, and resource configuration and system optimization strategies are adjusted according to the test results.

4. Monitoring and alert system: Establish a comprehensive monitoring and alert system to track the performance and health of the service in real time. If a metric exceeds the SLA threshold, the system should automatically trigger an alert to notify the relevant team to take action.

5. Priority management: During the promotion period, set the priority of the issues according to the importance and impact scope of the SLA to ensure that the most critical services are given priority.

6. Teamwork and communication: SLA requires close collaboration and efficient communication between teams. Establish a cross-departmental collaboration mechanism, clarify the responsibilities and SLA goals of each team, and ensure that all teams work towards the same goal.

7. Continuous improvement: After the 11.11 promotion is over, review the entire process, analyze the SLA achievement, and find room for improvement. Apply these experiences and lessons to the preparation for the next year and continuously improve service quality.

Appendix: 1) SLO document example

Service Overview

Common ones include API and MQ messages

picture

Instructions and Warnings

• Request metrics are measured at the load balancer. This metric may not accurately measure situations where user requests do not reach the load balancer.

•We only consider HTTP 5XX status or the Code error code message in the agreed API Response as error codes; everything else is considered success.