How to Cut an Oxen - Illustrated MySQL 8.0 Optimizer Query Parsing

How to Cut an Oxen - Illustrated MySQL 8.0 Optimizer Query Parsing

[[423739]]

1. Background and Architecture

We all know that we can use programs to dynamically implement the logic required by our applications, so that we can get the results we need when the programs are executed. Then the database is an application that quickly obtains data by inputting SQL strings. Of course, assuming there is no system application such as a database, how can it be implemented with a program? We may find that no matter how the data is stored or whether the data is accessed concurrently, we still need to constantly modify the program to handle different requests for data from different applications. For example, in the field of big data, we usually obtain data through the API of non-relational databases. However, although this method is easy to get started, it is extremely difficult to maintain and is not very versatile. Even if you continue to design the software architecture or abstract reconstruction, you still need to constantly change the application. This is why non-relational databases turn back to embrace database SQL optimizers.

The SQL optimizer is essentially an implementation of a highly abstract data interface. With this design, customers can use the more general and easy-to-understand SQL language to operate and process data without having to pay attention to and abstract their own data interfaces, which greatly liberates customer applications.

This article will use graphical explanations to explain how the MySQL 8.0 SQL optimizer turns a simple string (SQL) into an execution sequence that the database executor can understand, and finally returns the data to the customer. A powerful optimizer does not require customers to pay attention to how to write SQL better to obtain the required data faster, so the optimizer will definitely make some equivalent changes to the original SQL. In the article "Detailed Explanation of the Latest Architecture of the Server Layer in MySQL 8.0", we focused on the overall introduction of the latest version of MySQL on the server layer parser, optimizer, and executor, including a detailed display of some code structures and changes, and used the simple_joins function to show how the MySQL optimizer simplifies the optimization of nested Joins in logical transformation. In this article, we will take you step by step into the magical optimizer details, and understand in detail how each step of the optimizer optimization part changes the final execution of a SQL.

This article is based on the latest MySQL version 8.0.25. Since the optimizer conversion part is quite long, we divide it into two articles. The first part introduces the parsing conversion process based on the basic structure of Setup and Resolve. The second part introduces the complex conversion process of more complex subqueries, partition tables and joins. The outline is as follows:

Setup and Resolve

  1. setup_tables : Set up table leaves in the query block based on list of tables.
  2. resolve_placeholder_tables/merge_derived/setup_table_function/setup_materialized_derived : Resolve derived table, view or table function references in query block.
  3. setup_natural_join_row_types : Compute and store the row types of the top-most NATURAL/USING joins.
  4. setup_wild : Expand all '*' in list of expressions with the matching column references.
  5. setup_base_ref_items : Set query_block's base_ref_items.
  6. setup_fields : Check that all given fields exists and fill struct with current data.
  7. setup_conds : Resolve WHERE condition and join conditions.
  8. setup_group : Resolve and set up the GROUP BY list.
  9. m_having_cond- > fix_fields : Setup the HAVING clause.
  10. resolve_rollup : Resolve items in SELECT list and ORDER BY list for rollup processing.
  11. resolve_rollup_item : Resolve an item (and its tree) for rollup processing by replacing items matching grouped expressions with Item_rollup_group_items and updating properties (m_nullable, PROP_ROLLUP_FIELD). Also check any GROUPING function for incorrect column.
  12. setup_order : Set up the ORDER BY clause.
  13. resolve_limits : Resolve OFFSET and LIMIT clauses.
  14. Window::setup_windows1: Set up windows after setup_order() and before setup_order_final().
  15. setup_order_final: Do final setup of ORDER BY clause, after the query block is fully resolved.
  16. setup_ftfuncs : Setup full-text functions after resolving HAVING.
  17. resolve_rollup_wfs : Replace group by field references inside window functions with references in the presence of ROLLUP.

2. Detailed conversion process

The entire framework of the conversion is a top-down or bottom-up process from Query_expression to Query_block calling the prepare function (sql/sql_resolver.cc) according to the requirements of different conversion rules.

1 Pass null to the inner table list of join (propagate_nullability)

Prepare starts with processing nullable tables, which means that a table may contain rows that are all null. According to the JOIN relationship (top_join_list), null rows can be propagated. If a table is determined to be nullable, some optimizations will be degraded, such as the access method cannot be EQ_REF, and outer join cannot be optimized to inner join.

2 Parsing the leave_tables (setup_tables) of the setup query block

  1. SELECT t1.c1FROM t1, ( SELECT t2.c1 FROM t2, ( SELECT t3.c1 FROM t3 UNION            SELECT t4.c1 FROM t4) AS t3a) AS t2a;

Before setup_table is called, leaf_tables of each Query_block is 0.

The function is to build leaf_tables, including base tables and derived tables, for subsequent optimization. setup_tables will not be called recursively, but only solve the tables of this layer and count the number of derived tables of this layer. However, it will then call resolve_placeholder_tables()->resolve_derived()->derived(Query_expression)::prepare->Query_block::prepare to specifically recursively process the Query_expression corresponding to the derived table.

Next, we will continue to look at the resolve_placeholder_tables function for derived table processing according to the call order of prepare.

3. Resolve query block Derived Table, View, Table function (resolve_placeholder_tables)

This function is used to process derived tables, views, and table functions. If the table has been merged, or if it is called using transform_grouped_to_derived() and the materialized table method has been decided, it will be ignored.

The function of resolve_derived() has been introduced before. Now let’s focus on the merge_derived() function. merge_derived changes the Query_expression/Query_block framework structure and merges the derived table or view into the query block.

merge_derived processes and merges Derived table

1) Prerequisites for merge_derived transformation

Whether the outer query block allows merge (allow_merge_derived) The outer query block is nullptr The subquery of the outer query expression is nullptr, and the derived table is the first-level subquery The outer query block of the outer layer can have allow_merge_derived=true, or if the outer query block does not include the outer layer, is it SELECT/SET
Whether the outer lex can support merge (lex->can_use_merged()+lex->can_no_use_merged())
Whether the derived table has been marked as requiring materialization, for example, the method for creating a view is CREATE ALGORITHM=TEMPTABLE VIEW (derived_table->algorithm == VIEW_ALGORITHM_TEMPTABLE)
The query expression unit where the entire dervived table is located cannot be (Query_expression::is_mergeable()): The Union query contains aggregation, HAVING, DISTINCT, WINDOWS or LIMIT without any table list
HINT or optimizer_switch does not prohibit derived_merge
Heuristic suggestion merge (derived_query_expressionmerge_heuristic()) is not supported if the derived table contains a subquery SELECT list that depends on its own columns. It is not supported if the dependant subquery needs to be executed multiple times.
If the query block in the derived table contains SEMI/ANTI-JOIN and STRAIGHT_JOIN is specified, it is not supported. If the leaf table count of the derived table to be merged and the existing query block is approximately MAX_TABLES, it is not supported.
2) The transformation process of merge_derived transformation

Use the derived_table->nested_join structure to assist in handling OUTER JOIN situations.
Merge the tables in the derived table into the NESTED_JOIN structure (derived_table->merge_underlying_tables())
Connect all tables in the derived table to the table_list list of the parent query, and delete the derived table from the parent query.
All related data structures of the parent query are recalculated (leaf_table_count, derived_table_count, table_func_count, materialized_derived_table_count, has_sj_nests, has_aj_nests, partitioned_table_count, cond_count, between_count, select_n_having_items)
Propagate settings to the parent query OPTION_SCHEMA_TABLE (add_base_options()) and if it is the inner table of the outer query JOIN, propagate settings nullable attributes (propagate_nullability())
Merge the derived table’s where conditions into the outer query (merge_where())
Create a reference to the derived table column to be retrieved (create_field_translation())
Remove the derived table structure from the parent query (exclude_level())
Merge the renaming of columns or tables in the derived table into the parent query (fix_tables_after_pullout()/repoint_contexts_of_join_nests())
Because the tables included in the derived table have been merged into the parent query, the tables in TABLE_LIST need to be relocated (remap_tables())
After merging the derived table into the parent query, you need to modify all references to all columns in the derived table in the original derived table (fix_tables_after_pullout())
If the derived table contains an ORDER BY clause, the derived table will retain the ORDER BY and be merged into the parent query if the following conditions are met. Otherwise, the ORDER BY will be ignored: If the parent query allows sorting and it happens that only the derived table is not a UNION and can have a WHERE condition, but cannot have a group by or the aggregate function itself is not ordered, the process is simplified to:

merge_derived diagram process

It seems that the official derived merge is still not perfect and cannot perform a bottom-up recursive merge.
Included opt trace:

  1. trace_derived.add_utf8_table(derived_table) .add( "select#" , derived_query_block->select_number) .add( "merged" , true );trace_derived.add_alnum( "transformations_to_derived_table" , "removed_ordering" );

This optimization can be controlled by set optimizer_switch="derived_merge=on/off".

setup_materialized_derived Set up the materialized Derived Table

For the remaining derived tables that cannot be processed by the merge algorithm, they will be processed by the materialize method. However, at this time, only some preprocessing such as variable settings is performed, and the actual materialization execution is performed in the executor stage.

setup_materialized_derived_tmp_table(): Sets up a temporary table containing all row data of the materialized Derived Table.
check_materialized_derived_query_blocks(): Sets the query block structure belonging to the current Derived Table.

  1. trace_derived.add_utf8_table( this ) .add( "select#" , derived->first_query_block()->select_number) .add( "materialized" , true );

setup_table_function handles table functions

If there is a table function in the query block, the whole process will be processed twice. The first time, the table of the table function will be skipped, and the second time, the above logic will be executed again on the table of the table function. The consideration here should be to resolve the external environment (relative to the table function) first, because it is possible that the function parameters will depend on the external derived table.

  1. trace_derived.add_utf8_table( this ) .add_utf8( "function_name" , func_name, func_name_len) .add( "materialized" , true );

4 Expand the wildcard of SELECT * into specific fields (setup_wild)

5. Set up base_ref_items at the query_block level (setup_base_ref_items)

base_ref_items records the location of all items, so that other items in the query block can refer to them, or directly refer to them through Item_ref and its Item_ref subclasses, such as subquery references (Item_view_ref), aggregate function references (Item_aggregate_ref), outer query column references (Item_outer_ref), and subquery reference assistance for NULL values ​​(Item_ref_null_helper).

Let's take an example to illustrate a more complex Item_outer_ref:

6 Fix_fields() and column permission check for select_fields (setup_fields)

The following figure shows a more complex fixed field process with subqueries. Some fields are associated with tables, and some need to add corresponding Item_xxx_ref references.

7 Parsing and fixed_fields WHERE conditions and Join conditions (setup_conds)

If there is nested_join in setup_join_cond, setup_join_cond will be recursively called for parsing and setting. Here we also introduce the function of simplify_const_condition function. If a const Item that can be deleted is found, Item_func_true/Item_func_false will be used to replace the entire condition, as shown in the figure.

8 Parsing and setting ROLLUP statements (resolve_rollup)

In a database query statement, adding the WITH ROLLUP statement after the GROUP BY expression allows you to perform analysis and statistics on data at different levels through a single query statement.

  1. SELECT   YEAR , country, product, SUM (profit) AS profitFROM salesGROUP BY   YEAR , country, product WITH   ROLLUP ;+ ------+---------+------------+--------+| year | country | product | profit |+------+---------+------------+--------+| 2000 | Finland | Computer | 1500 || 2000 | Finland | Phone | 100 || 2000 | Finland | NULL | 1600 || 2000 | India | | 1200 || 2000 | India | NULL | 1350 || 2000 | USA | Calculator | 75 || 2000 | USA | Computer | 1500 || 2000 | USA | NULL | 1575 || 2000 | Finland | NULL | 10 || 2001 | USA | Calculator | 50 || 2001 | USA | Computer | 2700 || 2001 | USA | TV | 250 || 2001 | USA | NULL | 3000 || 2001 | |+------+---------+------------+--------+ is equivalent to the following query: SELECT *FROM (SELECT YEAR, country, product, SUM(profit) AS profit FROM sales GROUP BY YEAR, country, product UNION ALL SELECT YEAR, country, NULL, SUM(profit) AS profit FROM sales GROUP BY YEAR, country UNION ALL SELECT YEAR, NULL, NULL, SUM(profit) AS profit FROM sales GROUP BY YEAR UNION ALL SELECT NULL, NULL, NULL, SUM(profit) AS profit FROM sales) AS sum_tableORDER BY YEAR, country, product;+------+---------+------------+--------+| YEAR | country | product | profit |+------+----------+------------+--------+| NULL | 2000 | Finland | Computer | 1500 || 2000 | Finland | Phone | 100 || 2000 | India | NULL | 1350 || 2000 | India | Calculator | 150 || 2000 | India | 75 || 2000 | USA | Computer | 1500 || 2001 | NULL | NULL | 3010 || 2001 | Finland | NULL | 10 || 2001 | Finland | Phone | 10 || 2001 | USA | NULL | 3000 || 2001 | USA | || 2001 | USA | TV | 250 |+------+---------+------------+--------+  

Sorting has the problem of NULL, so the hierarchical aggregation is very difficult to achieve. In addition, the SQL complexity changes when the group column changes. ROLLUP can achieve the effect very easily. Let's see what kind of conversion rollup does in the parsing process to achieve unexpected results.

9 Parsing and setting GROUP BY/ORDER BY statements (setup_group/setup_order)

One of the functions, find_order_in_list(), tries to find a column that can be mapped in the select fields. Otherwise, the current column must be added to the final projected all fields, and fix_fields is also performed.

m_having_cond->fix_fields: fix_fields for the having condition.
resolve_limits : Processes OFFSET and LIMIT clauses (Items of offset_limit and select_limit).
setup_ftfuncs: If there is a full-text function, perform fix_fields on the related Item.

remove_redundant_subquery_clause : For the expression of Table Subquery, usually IN/ANY/ALL/EXISTS/etc, if there is no aggregate function and Having clause, you can usually consider removing unnecessary ORDER/DISTINCT/GROUP BY. This function supports three types: REMOVE_ORDER | REMOVE_DISTINCT | REMOVE_GROUP. If it is a subquery of SINGLEROW_SUBS, only consider removing REMOVE_ORDER.

  1. select c1 from t1 where t1.c2 in ( select   distinct c1 from t2 group   by c1, c2 order   by c1); is transformed into => select c1 from t1 where t1.c2 in ( select c1 from t2);

Determine whether unnecessary DISTINCT statements can be deleted. The deletion condition is that all GROUP BY columns are in the SELECT list and there are no ROLLUP and Window functions.

  1. is_grouped() && hidden_group_field_count == 0 && olap == UNSPECIFIED_OLAP_TYPE

For example, scenario:

SELECT DISTINCT c1, max(c2) from t1 group by c1;

10 Parse and set Window function (Window::setup_windows1)

<<:  Number One "Thousand Faces" - What you see may not be me

>>:  Summary of the State Council Information Office press conference, involving 5G, chips, etc.

Recommend

6G brings an innovative revolution, but also hides many crises and challenges

Although 6G will not be implemented until 2030, t...

Exclusive reveal! How 5G can help secure large-scale events

The wonderful opening ceremony of the 2022 Beijin...

Powered by EMUI 9.1, Huawei Enjoy 10S brings users a brand new smart experience

In the era of information explosion, consumers ar...

Analysis of Python's new string format vulnerability

Preface This article conducts an in-depth analysi...

A quick overview of 5G industry developments in March 2021

After the rapid development in 2020, 2021 is a cr...

Interview blitz: Is TCP reliable? Why?

Author | Lei Ge Source | Java interview questions...

How to configure floating routing?

[[416937]] Experimental requirements ISP-1 and IS...

2018 Yunnan-Huawei Software Industry Summit was held on December 20

The 2018 Yunnan-Huawei Software Industry Summit w...