This article is first published in the medium MPP plan. If you are a medium user, please follow me in medium. Thank you very much.
Learn how to optimize SQL queries for better performance
Thanks to Moore's Law, computer performance has greatly improved, along with advancements in databases and various anti-pattern designs advocated by microservices. As a result, we now have fewer opportunities to write complex SQL queries. The industry (yes, even Google) has started advocating against specialized SQL optimization, as the resources saved do not outweigh the cost of employee salaries. However, as engineers, we should strive for technical excellence to become rocket scientists in our field.
In this article, I will introduce eight common SQL slow query statements and explain how to optimize their performance. I hope this will be helpful to you.
LIMIT Statement
Pagination is one of the most commonly used scenarios, but it is also prone to problems. For example, for the simple statement below, a typical solution suggested by DBAs is to add a composite index on the type
, name
, and create_time
fields. This way, the conditions and sorting can effectively utilize the index, resulting in a significant performance improvement.
SELECT *
FROM operation
WHERE type = 'SQLStats'
AND name = 'SlowLog'
ORDER BY create_time
LIMIT 1000, 10;
Okay, this might solve the problem for over 90% of DBAs. However, when the LIMIT clause becomes "LIMIT 1000000, 10", programmers still complain, "Why is it slow when I'm only fetching 10 records?" You see, the database doesn't know where the 1,000,000th record starts, so even with an index, it still needs to calculate from the beginning. In most cases, this performance issue is caused by lazy programming.
In scenarios such as frontend data browsing or exporting large data in batches, you can use the maximum value of the previous page as a parameter for querying. The SQL can be redesigned as follows:
SELECT *
FROM operation
WHERE type = 'SQLStats'
AND name = 'SlowLog'
AND create_time > '2017-03-16 14:00:00'
ORDER BY create_time
LIMIT 10;
With this new design, the query time remains constant and does not change with the increasing data volume.
Implicit Conversion
Another common mistake in SQL statements is when the types of query variables and field definitions do not match. Take the following statement as an example:
mysql> explain extended SELECT *
> FROM my_balance b
> WHERE b.bpn = 14000000123
> AND b.isverified IS NULL ;
mysql> show warnings;
| Warning | 1739 | Cannot use ref access on index 'bpn' due to type or collation conversion on field 'bpn'
In this case, the field bpn
is defined as varchar(20)
, and MySQL's strategy is to convert the string to a number before comparing. This causes the function to be applied to the table field, rendering the index ineffective.
Such cases may be caused by parameters automatically filled in by the application framework, rather than the programmer's intention. Nowadays, application frameworks are often complex, and while they provide convenience, they can also create pitfalls.
Join Updates and Deletions
Although MySQL 5.6 introduced materialization, it only optimizes SELECT statements. For UPDATE or DELETE statements, you need to manually rewrite them using JOIN.
For example, consider the following UPDATE statement. MySQL actually performs a loop/nested subquery (DEPENDENT SUBQUERY), and you can imagine the execution time.
UPDATE operation o
SET status = 'applying'
WHERE o.id IN (SELECT id
FROM (SELECT o.id,
o.status
FROM operation o
WHERE o.group = 123
AND o.status NOT IN ( 'done' )
ORDER BY o.parent,
o.id
LIMIT 1) t);
The execution plan is as follows:
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+-----------------------------------------------------+
| 1 | PRIMARY | o | index | | PRIMARY | 8 | | 24 | Using where; Using temporary |
| 2 | DEPENDENT SUBQUERY | | | | | | | | Impossible WHERE noticed after reading const tables |
| 3 | DERIVED | o | ref | idx_2,idx_5 | idx_5 | 8 | const | 1 | Using where; Using filesort |
+----+--------------------+-------+-------+---------------+---------+---------+-------+------+-----------------------------------------------------+
After rewriting it as a JOIN, the subquery's select type changes from DEPENDENT SUBQUERY to DERIVED, significantly speeding up the execution time from 7 seconds to 2 milliseconds.
UPDATE operation o
JOIN (SELECT o.id,
o.status
FROM operation o
WHERE o.group = 123
AND o.status NOT IN ( 'done' )
ORDER BY o.parent,
o.id
LIMIT 1) t
ON o.id = t.id
SET status = 'applying';
The simplified execution plan is as follows:
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------------------------------------+
| 1 | PRIMARY | | | | | | | | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | o | ref | idx_2,idx_5 | idx_5 | 8 | const | 1 | Using where; Using filesort |
+----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------------------------------------+
Mixed Sorting
MySQL cannot utilize indexes for mixed sorting. However, in certain scenarios, there are still opportunities to improve performance using special methods.
SELECT *
FROM my_order o
INNER JOIN my_appraise a ON a.orderid = o.id
ORDER BY a.is_reply ASC,
a.appraise_time DESC
LIMIT 0, 20;
The execution plan shows a full table scan:
+----+-------------+-------+--------+-------------+---------+---------+---------------+---------+-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+--------+-------------+---------+---------+---------------+---------+-+
| 1 | SIMPLE | a | ALL | idx_orderid | NULL | NULL | NULL | 1967647 | Using filesort |
| 1 | SIMPLE | o | eq_ref | PRIMARY | PRIMARY | 122 | a.orderid | 1 | NULL |
+----+-------------+-------+--------+---------+---------+---------+-----------------+---------+-+
Since is_reply
only has two states, 0 and 1, we can rewrite it as follows, reducing the execution time from 1.58 seconds to 2 milliseconds:
SELECT *
FROM ((SELECT *
FROM my_order o
INNER JOIN my_appraise a
ON a.orderid = o.id
AND is_reply = 0
ORDER BY appraise_time DESC
LIMIT 0, 20)
UNION ALL
(SELECT *
FROM my_order o
INNER JOIN my_appraise a
ON a.orderid = o.id
AND is_reply = 1
ORDER BY appraise_time DESC
LIMIT 0, 20)) t
ORDER BY is_reply ASC,
appraisetime DESC
LIMIT 20;
EXISTS Statement
When dealing with EXISTS clauses, MySQL still uses nested subqueries for execution. Take the following SQL statement as an example:
SELECT *
FROM my_neighbor n
LEFT JOIN my_neighbor_apply sra
ON n.id = sra.neighbor_id
AND sra.user_id = 'xxx'
WHERE n.topic_status < 4
AND EXISTS(SELECT 1
FROM message_info m
WHERE n.id = m.neighbor_id
AND m.inuser = 'xxx')
AND n.topic_type <> 5;
+----+--------------------+-------+------+-----+------------------------------------------+---------+-------+---------+ -----+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+--------------------+-------+------+ -----+------------------------------------------+---------+-------+---------+ -----+
| 1 | PRIMARY | n | ALL | | NULL | NULL | NULL | 1086041 | Using where |
| 1 | PRIMARY | sra | ref | | idx_user_id | 123 | const | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | m | ref | | idx_message_info | 122 | const | 1 | Using index condition; Using where |
+----+--------------------+-------+------+ -----+------------------------------------------+---------+-------+---------+ -----+
By removing the EXISTS clause and changing it to a JOIN, we can avoid nested subqueries and reduce the execution time from 1.93 seconds to 1 millisecond.
SELECT *
FROM my_neighbor n
INNER JOIN message_info m
ON n.id = m.neighbor_id
AND m.inuser = 'xxx'
LEFT JOIN my_neighbor_apply sra
ON n.id = sra.neighbor_id
AND sra.user_id = 'xxx'
WHERE n.topic_status < 4
AND n.topic_type <> 5;
The new execution plan is as follows:
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
| 1 | SIMPLE | m | ref | | idx_message_info | 122 | const | 1 | Using index condition |
| 1 | SIMPLE | n | eq_ref | | PRIMARY | 122 | ighbor_id | 1 | Using where |
| 1 | SIMPLE | sra | ref | | idx_user_id | 123 | const | 1 | Using where |
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
Condition Pushdown
There are cases where external query conditions cannot be pushed down to complex views or subqueries:
- Aggregated subqueries
- Subqueries with LIMIT
- UNION or UNION ALL subqueries
- Subqueries in output fields
Consider the following statement, where the condition affects the aggregated subquery:
SELECT *
FROM (SELECT target,
Count(*)
FROM operation
GROUP BY target) t
WHERE target = 'rm-xxxx';
+----+-------------+------------+-------+---------------+-------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+-------------+---------+-------+------+-------------+
| 1 | PRIMARY | n | ALL | NULL | NULL | NULL | NULL | 1086041 | Using where |
| 1 | PRIMARY | sra | ref | NULL | idx_user_id | 123 | const | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | m | ref | NULL | idx_message_info | 122 | const | 1 | Using index condition; Using where |
+----+-------------+------------+-------+---------------+-------------+---------+-------+------+-------------+
By removing the EXISTS clause and changing it to a JOIN, we can avoid nested subqueries and reduce the execution time from 1.93 seconds to 1 millisecond.
SELECT *
FROM my_neighbor n
INNER JOIN message_info m
ON n.id = m.neighbor_id
AND m.inuser = 'xxx'
LEFT JOIN my_neighbor_apply sra
ON n.id = sra.neighbor_id
AND sra.user_id = 'xxx'
WHERE n.topic_status < 4
AND n.topic_type <> 5;
The new execution plan is as follows:
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
| 1 | SIMPLE | m | ref | | idx_message_info | 122 | const | 1 | Using index condition |
| 1 | SIMPLE | n | eq_ref | | PRIMARY | 122 | ighbor_id | 1 | Using where |
| 1 | SIMPLE | sra | ref | | idx_user_id | 123 | const | 1 | Using where |
+----+-------------+-------+--------+ -----+------------------------------------------+---------+ -----+------+ -----+
Narrowing the Scope in Advance
Let's take a look at the following partially optimized example (main table in the left join acts as a primary query condition):
SELECT a.*,
c.allocated
FROM (
SELECT resourceid
FROM my_distribute d
WHERE isdelete = 0
AND cusmanagercode = '1234567'
ORDER BY salecode limit 20) a
LEFT JOIN
(
SELECT resourcesid, sum(ifnull(allocation, 0) * 12345) allocated
FROM my_resources
GROUP BY resourcesid) c
ON a.resourceid = c.resourcesid;
Does this statement still have other issues? It is clear that subquery c is an aggregate query on the entire table, which can cause performance degradation when dealing with a large number of tables.
In fact, for subquery c, the left join result set only cares about the data that can be matched with the primary table's resourceid
. Therefore, we can rewrite the statement as follows, reducing the execution time from 2 seconds to 2 milliseconds:
SELECT a.*,
c.allocated
FROM (
SELECT resourceid
FROM my_distribute d
WHERE isdelete = 0
AND cusmanagercode = '1234567'
ORDER BY salecode limit 20) a
LEFT JOIN
(
SELECT resourcesid, sum(ifnull(allocation, 0) * 12345) allocated
FROM my_resources r,
(
SELECT resourceid
FROM my_distribute d
WHERE isdelete = 0
AND cusmanagercode = '1234567'
ORDER BY salecode limit 20) a
WHERE r.resourcesid = a.resourcesid
GROUP BY resourcesid) c
ON a.resourceid = c.resourcesid;
However, the subquery a
appears multiple times in our SQL statement. This approach not only incurs additional costs but also makes the statement more complex. We can simplify it using the WITH statement:
WITH a AS
(
SELECT resourceid
FROM my_distribute d
WHERE isdelete = 0
AND cusmanagercode = '1234567'
ORDER BY salecode limit 20)
SELECT a.*,
c.allocated
FROM a
LEFT JOIN
(
SELECT resourcesid, sum(ifnull(allocation, 0) * 12345) allocated
FROM my_resources r,
a
WHERE r.resourcesid = a.resourcesid
GROUP BY resourcesid) c
ON a.resourceid = c.resourcesid;
Conclusion
The database compiler generates execution plans that determine how SQL statements are actually executed. However, compilers can only do their best to serve, and no database compiler is perfect. The scenarios mentioned above also exist in other databases. Understanding the characteristics of the database compiler allows us to work around its limitations and write high-performance SQL statements.
When designing data models and writing SQL statements, it is important to bring algorithmic thinking or awareness into the process. Developing the habit of using the WITH statement when writing complex SQL statements can simplify them and reduce the burden on the database.
Finally, here is the execution order of SQL statements:
FROM
<left_table>
ON
<join_condition>
<join_type>
JOIN
<right_table>
WHERE
<where_condition>
GROUP BY
<group_by_list>
HAVING
<having_condition>
SELECT
DISTINCT
<select_list>
ORDER BY
<order_by_condition>
LIMIT
<limit_number>