Slow SQL Management Sharing

Slow SQL Management Sharing

[[415145]]

Why do you want to do this?

1 What is Slow SQL?

This refers to MySQL slow queries, specifically SQL that runs for longer than the long_query_time value.

We often hear that in common MySQL, there are binary logs binlog, relay logs relaylog, redo rollback logs redolog, undolog, etc. For slow queries, there is also a slow query log slowlog, which is used to record statements whose response time exceeds the threshold in MySQL.

Do not be misled by the name of slow query and think that the slow query log only records select statements. In fact, it also records DML statements such as insert and update whose execution time exceeds the threshold set by long_query_time.

  1. # Check whether slow SQL is enabled show variables like   "slow_query_log%" ;# View the threshold value for slow queries. Unit: seconds show variables like   "long_query_time" ;

For the AliSQL-X-Cluster or XDB we use, slow query is enabled by default and long_query_time is set to 1 second.

2 Why do slow queries cause failures?

Real slow SQL is often accompanied by a large amount of row scanning, temporary file sorting, or frequent disk flushing, which directly affects the increase of disk IO. Normal SQL also becomes slow SQL, and a large area of ​​execution timeouts.

After Double 11 last year, in response to the problems exposed on the technical side, Cainiao's CTO line launched a number of special governance projects, with CTO-D leading each project as a sponsor. The large team I was in was responsible for the special project of slow SQL governance.

Second, to what extent

1 How to measure the severity of slow SQL of an application?

Micro average

sum(aone application slow SQL execution times)-----------------------sum(aone application SQL execution times)
We believe that the larger the value, the greater the impact; the smaller the value, the smaller the impact may be.

In an extreme case, all SQL statements executed in the application are slow SQL statements, and the value is 1. In an extreme case, all SQL statements executed in the application are not slow SQL statements, and the value is 0.

However, the problem with this indicator is that it has poor discrimination, especially when the SQL QPS is very high and most SQL queries are not slow queries, the occasional slow SQL will be overwhelmed.

Another question is, are occasional slow SQL statements really slow SQL statements? We have encountered many SQL statements recorded in slow query logs. In fact, they may be affected by other slow SQL statements, MySQL disk jitter, optimizer selection, etc., which may turn out to be slow SQL statements that are not slow SQL statements in regular queries into slow SQL statements.

Macro average

  1. sum (slow SQL 1 execution times) sum (slow SQL n execution times) ----------------- + ------------------sum(SQL 1 execution times) sum(SQL n execution times)--------------------------------------- n  

This algorithm is based on the assumption that the slow SQL statements that are caught have been executed a certain number of times, which can reduce the impact of false slow SQL statements.

When the QPS of some applications is very low, that is, the number of SQL executions per day is very small, if false SQL is encountered, it will cause statistical errors.

Execution times

sum(aone application slow SQL execution times)----------------------- 7
By counting the average number of slow SQL executions per day in the past week, you can eliminate false SQL problems caused by macro averaging.

Number of slow SQL templates

The above dimensions all have a time limit. In order to trace the historical processing of slow SQL, we also introduced the dimension of the number of global slow SQL templates.

count(distinct(aone application slow SQL template) )

2 Objectives

Core application: Solve all slow SQL
Common applications: Micro-average indicators decreased by 50%

3 CTO Report

Based on the weighted average of the above multi-dimensional indicators, the CTO-D is used as the unit to rank the applications from low to high, with the top 3 highlighted and reported weekly.

Why should I do it?

I guess it may be related to my background. I have a C/C++ background and was responsible for the design and implementation of the company-level multi-site active-active architecture in my previous company. I have a better understanding of MySQL.

In addition, it may be irrelevant to interests. The business of my small team has just started, and there is no slow SQL, so it can be inserted into various business lines.

Four action support

1 Group MySQL Protocol

Excerpt from the index convention:

[Mandatory] Do not join more than three tables. The data types of the fields to be joined must be absolutely consistent; when querying multiple tables, ensure that the associated fields have indexes.

Note: Even when joining two tables, you must pay attention to table indexes and SQL performance.

[Mandatory] When creating an index on a varchar field, you must specify the index length. It is not necessary to index the entire field. The index length is determined based on the actual text discrimination.

Note: The length and discrimination of an index are contradictory. Generally, for string data, the discrimination of an index with a length of 20 will be as high as 90% or more. You can use count(distinct left(column name, index length))/count(*) to determine the discrimination.

[Mandatory] Page search is strictly prohibited from being left-fuzzy or fully blurred. If necessary, please use a search engine to resolve the issue.

Note: The index file has the leftmost prefix matching feature of B-Tree. If the value on the left is not determined, this index cannot be used.

[Recommended] Prevent implicit conversions caused by different field types, which may cause index invalidation.

[Reference] Avoid the following extreme misunderstandings when creating indexes:

1) It is better to have too many indexes than too few

It is believed that one query requires one index.

2) Be stingy with index creation

It is believed that indexes will consume space and seriously slow down the update and addition speed.

3) Resist unique index

It is believed that unique indexes must all be solved at the application layer through the "check first, then insert" method.

2 DB change standards

DDL needs to control the change speed, pay attention to grayscale and concurrency control, and the change release must be within the specified change release window.

5. Share some examples of optimization I participated in

1. Uneven data distribution

1) Unreasonable database and table division

The business data is divided into 8 databases, each of which has 16 tables. By checking the table space, we can see that the data is almost distributed in two tables in each database. There is a problem with the strategy of dividing the database and table, and the business increment is overestimated. I have reservations about this.

2) Improper indexing

A joint index of idx_logistics_corp_id_special_id is created for a single table, but even so, the discrimination is still too low. According to experiments and business feedback, the combination of the (logistics_corp_id, transport_type_id) fields has a very high discrimination, and the business has a single query scenario for transport_type_id.

2 Indexing issues

  1. SELECT    COUNT (0) AS `tmp_count` FROM ( SELECT `table_holder`.`user_id`, `table_holder`.`sc_item_id`, SUM ( CASE `table_holder`.`inventory_type` WHEN 1 THEN `table_holder`.`quantity` ELSE 0 END ) AS `saleable_quantity`, SUM ( CASE `table_holder`.`inventory_type` WHEN 1 THEN `table_holder`.`lock_quantity` ELSE 0 END ) AS `saleable_lock_quantity`, SUM ( CASE `table_holder`.`inventory_type` WHEN 401 THEN `table_holder`.`quantity` ELSE 0 END ) AS `transfer_on_way_quantity`, `table_holder`.`store_code`, MAX (`table_holder`.`gmt_modified`) AS `gmt_modified` FROM `table_holder` WHERE (`table_holder`.`is_deleted` = 0) AND (`table_holder`.`quantity` > 0) AND `table_holder`.`user_id` IN (3405569954) AND `table_holder`.`store_code` IN ( 'ZJJHBHYTJJ0001' , '...more than 1000' ) GROUP   BY `table_holder`.`user_id`, `table_holder`.`sc_item_id` ORDER   BY `table_holder`.`user_id` ASC , `table_holder`.`sc_item_id` ASC ) `a`;

The table corresponding to this case has a store_code index, so we think there is no problem and no way to optimize it. In fact, through the execution plan, we found that MySQL chose a full table scan. In practice for this case, we found that when the number of range queries exceeds 200, the index optimizer will no longer use the field index.

Finally, after pulling relevant query SQL in the recent period and combining it with the data distribution of the business, we found that the problem can be solved by using (is_deleted, quantity).

Determine the index length used in the execution plan: key_len length calculation formula (>=5.6.4)

  1. char (10) allows NULL = 10 * ( character   set : utf8mb4=4,utf8=3,gbk=2,latin1=1) + 1( NULL ) char (10) NULL is not allowed = 10 * ( character   set : utf8mb4=4,utf8=3,gbk=2,latin1=1) varchr(10) allows NULL = 10 * ( character   set : utf8mb4=4,utf8=3,gbk=2,latin1=1) + 1( NULL ) + 2(variable length field) varchr(10) does not allow NULL = 10 * ( character   set : utf8mb4=4,utf8=3,gbk=2,latin1=1) + 2 (variable length field) int allows NULL = 4 + 1 ( NULL ) int does not allow NULL = 4timestamp allows NULL = 4 + 1 ( NULL ) timestamp does not allow NULL = 4datatime allows NULL = 5 + 1 ( NULL )datatime does not allow NULL = 5

3. Being influenced by others

Even though the index was used, it still required scanning 20 million rows:

The index fields are highly discriminative:

At the same time, regular SQL becomes a slow query:

DB data disk access status:

After checking other instances on the shared physical machine, we found that a library had many slow SQL statements that needed to be sorted around the problem time, and the temporary file just wrote 2 GB:

Multiple MySQL instance leader nodes are deployed on the same physical machine. Although Docker isolates CPU, MEM and other resources, buffer IO isolation is not yet achieved.

4 Unable to resolve

Slow queries encountered daily can often be solved by summarizing and analyzing high-frequency queries and combining them with the business to derive appropriate indexes, but this is not a panacea.

For example, there may be more and more indexes, even like this:

In some scenarios, such as supporting combined queries of multiple fields without mandatory items, it is obviously unreasonable to support all of them through indexes.

In the query scenario, it is a good habit to set the fields with higher discrimination as required items; when there are many query combinations, consider using storage or search engines with better search support.

Six daily treatments

With the in-depth management of each CTO-D line, various indicators have improved significantly compared with before. For example, the core application has completed the clearing of slow queries, and some of the slow SQLs with the greatest impact have been resolved. The ranking of my team has also moved from the initial top 3 at the tail to the top 3 at the head.

Slow SQL governance has become a daily routine. By regularly pushing slow SQL work orders every week, and the owner taking over the processing and closing the orders, a habit and atmosphere of regular zeroing has basically been formed. The special project on slow SQL governance has also been named and praised many times.

Seven Conclusions

This is a belated summary. Looking back now, I think the process of strategy formulation, problem analysis and solution is worth sharing with you.

<<:  What is active monitoring without embedding code? Technology exploration behind improving user experience

>>:  Teach you Socket communication (TCP/IP) step by step

Blog    

Recommend

Who is responsible for the rampant online black industry?

[[188973]] A set of data: According to the 38th &...

Ke Jie: This is my last man vs. machine match, watch and cherish it

The China Wuzhen Go Summit will officially kick o...

Gcore (gcorelabs) Hong Kong VPS simple test

A few days ago, we did a simple test of Gcore'...