Many big data calculations are implemented in SQL. When running slowly, we have to optimize SQL, but we often encounter situations that we can do nothing about it.

For example, there are three statements in the stored procedure, which are roughly like this, and execute very slowly:

select a,b,sum(x) from T group by a,b where …;
select c,d,max(y) from T group by c,d where …;
select a,c,avg(y),min(z) from T group by a,c where …;

T is a huge table with hundreds of millions of rows. It needs to be grouped by three methods, and the grouped result sets are not large.

The grouping operation needs to traverse the data table. These three SQL statements will traverse the huge table three times. It takes a long time to traverse hundreds of millions of rows of data once, not to mention three times.

In this grouping operation, the CPU calculation time is almost negligible relative to the time of traversing the hard disk. If we can calculate multiple group aggregations in one traversal, although the amount of CPU calculation is not reduced, it can greatly reduce the amount of data read from the hard disk and double the speed.

If SQL could support syntax like this:

from T    
       select a,b,sum(x) group by a,b where …            -- the first grouping in the traversal
       select c,d,max(y) group by c,d where …            -- the second grouping in the traversal
       select a,c,avg(y),min(z) group by a,c where …;  -- the third grouping in the traversal

It would be able to return multiple result sets in one traversal, and the performance can be greatly improved.

Unfortunately, SQL does not have this syntax and cannot code like this. We can only use an alternative method, that is, use group a,b,c,d to calculate a more detailed grouping result set first, but first save it into a temporary table before we can further calculate the target results with SQL. The SQL statements are rough as follows:

create table T_temp as select a,b,c,d,
       sum(case when … then x else 0 end) sumx,
       max(case when … then y else null end) maxy,
       sum(case when … then y else 0 end) sumy,
       count(case when … then 1 else null end) county,
       min(case when … then z else null end) minz
group by a,b,c,d;
select a,b,sum(sumx) from T_temp group by a,b where …;
select c,d,max(maxy) from T_temp group by c,d where …;
select a,c,sum(sumy)/sum(county),min(minz) from T_temp group by a,c where …;

In this way, we only need to traverse once, but we have to transfer different where conditions to the previous case when, the code is much more complex and the amount of calculation will be increased. Moreover, when calculating the temporary table, the number of grouping fields becomes large, and the result set may be large. The temporary table is traversed many times, and the calculation performance is not good. Large result set grouping calculation needs hard disk buffer, and its performance is also very poor.

We can also use the database cursor of the stored procedure to fetch the data one by one, but we have to implement the actions of where and group by ourselves. It's too cumbersome to code, and the performance of the database cursor traversing the data will only be worse!

We can do nothing about it!

TopN operation will also encounter this helpless situation. For example, top5 written in Oracle SQL is rough as follows:

select * from (select x from T order by x desc) where rownum<=5

There are 1 billion pieces of data in table T. As can be seen from the SQL statement, the way to get the top five is to sort all the data and then get the first five, and the remaining sorting results are useless! The cost of large sorting is very high. The amount of data is too large to be loaded into memory. There will be multiple hard disk data buffering, and the computing performance will be very poor!

It is not difficult to avoid large sorting. Keep a small set of 5 records in memory. When traversing the data, save the top 5 calculated data in this small set. If the new data is larger than the current fifth, insert it and discard the current fifth. If it is smaller than the current fifth, no action will be taken. In this way, we only need to traverse 1 billion pieces of data once, and the memory occupation is very small, and the computing performance will be greatly improved.

In essence, this algorithm regards TopN as the same aggregate operation as sum and count, but returns a set rather than a single value. If the SQL could be written like this: select top (x, 5) from T, it would have been able to avoid large sorting.

Unfortunately, SQL does not have an explicit set data type. Aggregate functions can only return a single value and cannot write such statements!

However, fortunately, the TopN of the whole set is relatively simple. Although the SQL is written like that, the database can usually do some optimization in practice, and the above method is adopted to avoid large sorting. As a result, Oracle is not slow to calculate that SQL statement.

However, if the situation of TopN is complex, the optimization engine usually doesn't work when it is used in subqueries or mixed with join. For example, to calculate the TopN of each group after grouping, it is a little difficult to write it in SQL. The SQL of Oracle is written as follows:

select * from
       (select y,x,row_number() over (partition by y order by x desc) rn from T)
where rn<=5

In this case, the database optimization engine will faint and will no longer use the above method of understanding TopN as an aggregate operation. It can only do the big sorting, and the operation speed drops sharply!

If only the SQL statement could be written as follows:

select y,top(x,5) from T group by y

Considering top as an aggregate function like sum, it would have been not only easier to read, but also easy to calculate at high speed.

Unfortunately, No.

We still can do nothing about it!

Join calculation is also very common. Take the filtering calculation after the order table is associated with multiple tables as an example. The SQL is basically like this:

select o.oid,o.orderdate,o.amount
       from orders o
              left join city ci on o.cityid = ci.cityid
              left join shipper sh on o.shid=sh.shid
              left join employee e on o.eid=e.eid
              left join supplier su on o.suid=su.suid
       where ci.state='New York'
              and e.title = 'manager'
              and ...

There are tens of millions of data in the order table, and the data in city, shipper, employee, supplier and other tables are not large. The filter condition fields may come from these tables, and the parameters are given from the front end and will change dynamically.

Generally, SQL uses the hash join algorithm to implement these associations. The hash values will be calculated and compared. Only one join can be resolved at a time, and the same action will have to be performed n times if there are n joins. After each join, the intermediate results need to be kept for the next round. The calculation process is complex, the data will be traversed many times, and the calculation performance is poor.

Usually, these associated tables are small and can be read into memory first. If each associated field in the order table is serialized in advance, for example, convert the employee id field value to the sequence number of the corresponding employee table record, when calculating, we can use the employee id field value (that is, the sequence number of employee table) to directly get the record at the corresponding position of the employee table in memory. The performance is much faster than hash join, and we only need to traverse the order table once, and the speed will be greatly improved!

That is, the SQL should be written as follows:

select o.oid,o.orderdate,o.amount
       from orders o
              left join city c on o.cid = c.#       
              left join shipper sh on o.shid=sh.#     
              left join employee e on o.eid=e.#      
              left join supplier su on o.suid=su.#    
       where ci.state='New York'
              and e.title = 'manager'
              and ...

Unfortunately, SQL uses the concept of unordered sets. Even if these ids have been serialized, the database can't take advantage of this feature. It can't use the mechanism of rapid sequence number positioning on these unordered sets of the corresponding associated tables. It can only use index search. Moreover, the database doesn't know that the ids have been serialized, and it still calculates hash values and makes comparisons, and the performance is still very poor!

Although there are good methods, they cannot be implemented. And we can still do nothing about it!

There are also highly concurrent account queries. This operation is very simple:

select id,amt,tdate,… from T
              where id='10100'
              and tdate>= to_date('2021-01-10', 'yyyy-MM-dd')
              and tdate<to_date('2021-01-25', 'yyyy-MM-dd')
              and …

In the hundreds of millions of historical data in the T table, quickly find several to thousands of details of an account. It is not complicated to code with SQL. The difficulty is that the response time should reach the second level or even faster in case of large concurrency. In order to improve the query response speed, the ID field of the T table is generally indexed:

create index index_T_1 on T(id)

In the database, the speed of using the index to find a single account is very fast, but it will be significantly slower in the case of large concurrency. The reason is also the theoretical unordered basis of SQL mentioned above. The total amount of data is huge and cannot be totally read into memory, and the database cannot ensure that the data of the same account is physically stored continuously. The hard disk has the smallest reading unit. When reading discontinuous data, many irrelevant contents will be fetched, and the query will be slow. If each query under high concurrency is a little slower, the overall performance will be very poor. Who dares to let users wait for more than ten seconds at a time when user experience is so very important?!

An easy way to think of is to sort hundreds of millions of data according to accounts in advance and ensure the continuous storage of data of the same account. In this way, almost all of the data blocks read out from the hard disk during query are target values, and the performance will be greatly improved.

However, the relational database using SQL system does not have this awareness and will not force the physical order of data storage! This problem is not caused by SQL syntax, but is related to the theoretical basis of SQL. It is still impossible to implement these algorithms in a relational database.

Now, what can we do? Can we do anything about it?

We can no longer use SQL and relational databases. We need to use other computing engines.

Based on the innovative theoretical basis, the open-source esProc SPL supports more data types and operations and can describe the new algorithms in the above scenarios. To code with simple and convenient SPL can greatly improve the computing performance in a short time!

The code examples of the above tasks written in SPL are as follows:

l Calculate multiple groupings in one traversal

l Calculate top5 by aggregation method

Top5 of the total set (multithreaded parallel computing)

Top5 of each group (multithreaded parallel computing)

l Join：

System initialization

Query

l High concurrency account query：

Data preprocessing and orderly storage

Account query

In addition to these simple examples, SPL can also implement more high-performance algorithms, such as orderly merging for the association between orders and details, pre-association technology for multi-layer dimension table association in multidimensional analysis, bit storage technology for the statistics of thousands of tags, Boolean set technology to speed up the query of multiple enumeration values filtering conditions, timing grouping technology for complex funnel analysis and so on.

Friends who are having a headache for SQL performance optimization, come and discuss with us:

The impasse of SQL performance optimizing