When it comes to database, it always makes people think that it is primarily for storage since its name has a part “base”. But in fact, it is not the case, database can achieve two important functions: calculation and transaction, which are what we often call OLAP and OLTP. The storage of database is intended for these two functions, and just serving as a storing role is not the objective of database.
As we know, SQL is currently the mainstream database language. So, is it convenient to do such two things in SQL?
The transaction function is mainly to solve the consistency of data during writing and reading. Although it is hard to achieve, its interface is very simple for applications, and the code for manipulating the reading and writing of database is also very simple. If it is assumed that the current logical storage scheme of relational database was reasonable (that is, using the data tables and records to store data. Whether it is reasonable or not is another complicated issue, which will not be discussed in detail here), then it would not be a big problem for SQL to describe the transaction function, because there is no need to describe complex action, and the complexity is already solved in the database.
As for the calculation function, however, the situation will be different.
The calculation we are talking about here is a broader concept. It is not just simple addition and subtraction, the search and association can all be regarded as some calculation.
So here comes a question, what kind of computing system is good?
Two characteristics needed: easy in writing, fast in running.
Easy in writing is easy to understand, which is to allow programmers to write code quickly so that more work can be done per unit of time; while for fast in running, it is easier to understand since we definitely hope to get the calculation results in a shorter time.
Actually, the Q in SQL represents query. The original purpose of inventing SQL is to query (i.e., calculation), which is the main goal of SQL. However, it is hard to say that SQL is very competent when describing computing tasks.
Why SQL is not competent
Let’s start with easy in writing.
The code written in SQL is very much like English, and some queries can be read and written in English (there are so many examples on the Internet, so we won’t give examples here). This should be regarded as satisfying the requirement of easy in writing.
Wait a minute! The code written in SQL we see in textbooks often has only two or three lines, it is indeed simple, but what if we try to solve some slightly more complicated problems?
Here is an example that is actually not very complicated: calculate the maximum consecutive days that a stock keeps rising. Write it in SQL is like this:
select max (consecutive_day)
from (select count(*) (consecutive_day
from (select sum(rise_mark) over(order by trade_date) days_no_gain
from (select trade_date,
case when closing_price>lag(closing_price) over(order by trade_date)
then 0 else 1 END rise_mark
from stock_price ) )
group by days_no_gain)
The working principle of this statement won’t be explained here, it's a little confusing anyway. You can try it yourself.
This is a recruitment test of Raqsoft company, with a pass rate of less than 20%; Because it is too difficult, it is later changed to another testing way: ask the candidate to explain what the written SQL statement is, but unfortunately, the pass rate is still not high.
What does it reveal? It reveals that the situation is slightly complicated, and SQL becomes difficult to both understand and write!
Let's look at the issue of fast in running, and take the simple task that is often used as an example: take the top 10 out of 100 million pieces of data. This task is not complicated to write in SQL:
SELECT TOP 10 x FROM T ORDER BY x DESC
However, the execution logic corresponding to this statement is to perform the big sorting for all the data first, and then take the top 10, and discard the remaining data. As we all know that sorting is a very slow action, and will traverse the data many times. If the amount of data is too large to be loaded into memory, it also needs to buffer the data in external storage, resulting in a further sharp decrease in performance. If the logic embodied in this statement is strictly followed, the operation will not run fast anyway. Fortunately, many programmers know that this operation does not need the big sorting, nor the external storage to buffer since it can be done by traversing only once and only occupying a little memory space, it means a higher performance algorithm exists. Regrettably, such algorithm can't be implemented in SQL. We can only hope the database optimizer is smart enough to convert this SQL statement to a high-performance algorithm to execute, but the database optimizer may not be reliable when the situation is complicated.
It seems that SQL is not doing well in both aspects. Although these two examples are not very complicated, SQL does not perform well in either example. In reality, the difficult-to-write and slow running situation abounds in SQL codes with thousands of lines.
Then, why these two aspects cannot be well achieved in SQL?
To answer this question, we need to analyze what exactly the implementation of calculation with program code does.
Essentially, the process of programming is the process of translating problem-solving idea into a precise formal language executable by the computer. For example, just like solving an applied problem by a primary school student, the student also needs to list an expression relating to four basic arithmetic operations after analyzing the problem and coming up with a solution. Likewise, for the calculation with the program, not only does the solution need to be come up with, but it also needs to translate the solution into actions that can be understood and executed by the computer.
For the formal language used to describe calculation method, its core lies in the algebraic system adopted. To put it simply, the so-called algebraic system includes two key elements: data types and corresponding operation rules. For instance, the key elements of arithmetic we learned in primary school is the integer and the operations including the addition, subtraction, multiplication and division. Once we get both key elements, we can write the operation we want with the symbols stipulated in the algebraic system to something, i.e., the code, and then the computer can execute.
If an algebraic system is not well designed, causing the provided data types and operations to be inconvenient, it will make it very difficult to describe the algorithm. In this case, a strange phenomenon will occur: the difficulty of translating the solution into the code is far more than solving the problem itself.
For example, we learned to use Arabic numerals for daily calculation in our childhood, and using such numerals is very convenient to do addition, subtraction, multiplication and division, and hence everyone naturally believes that numerical operation should be like this. Are all numerical operations so convenient? Not necessarily! It is estimated that many people know there is another numeral called Roman numeral. Do you know how to add, subtract, multiply and divide with Roman numerals? And how did the ancient Romans go to the streets for shopping?
The reason why coding is difficult is largely due to algebra.
Let's look at the reason for not running fast.
Software cannot change the performance of hardware; the speed of CPU and hard disk depends on their own configuration. However, we can design an algorithm of low complexity, that is, an algorithm with a smaller amount of calculation, so that the computer executes less actions, and thus the running speed will be faster naturally. Yet, just working out the algorithm is not enough, we also need to program the algorithm in some formal language, otherwise the computer won't execute. Moreover, it needs to be relatively simple to code. If the code in a certain formal language is very long, it will be very troublesome and no one will use such formal language. Therefore, for the program, easy in writing and fast in running are actually the same problem, behind which is the algebra adopted by the formal language. If the algebra is not good, it will make it difficult or even impossible to implement high-performance algorithm, as a result, there is no way to run fast. As mentioned above, our desired algorithm of occupying a little memory space and traversing only once cannot be implemented in SQL. Consequently, if you want it to run fast, you can only place hope on the optimizer.
Let's make another analogy:
Students who have gone to primary school probably know the story of Gauss calculating 1+2+3+…+100. Ordinary students adopted the most primitive method, which was to add 100 times step by step, while little Gauss was very smart, he found that 1+100=101, 2+99=101,…,50+51=101, from which he multiplied 50 by 101, and hence quickly figured out the result and then headed home for lunch.
After hearing this story, we all felt that Gauss was so clever that he thought of such an ingenious solution, which is simple and fast. Yes, that’s right, but it is easy to overlook one point: in the days of Gauss, multiplication already existed in the human arithmetic system (also an algebra)! As mentioned earlier, since we learned four arithmetic operations in our childhood, and hence we would take it for granted that multiplication should be used. But it is not, actually! Multiplication was invented after addition. If multiplication had not yet been invented in the days of Gauss, he wouldn’t have found a way to solve this problem quickly no matter how clever Gauss was.
At present, the mainstream database is the relational database, and the reason why it is called this way is because its mathematical basis is called relational algebra. SQL is exactly a formal language developed from the theory of relational algebra.
Now we can answer why SQL is not competent in both aspects we expect. The problem lies in relational algebra, and the relational algebra is just like an arithmetic system with only addition and no multiplication. Therefore, it is inevitable that many things cannot be done well.
Relational algebra has been invented for fifty years. The difference between the application requirements and hardware environments of fifty years ago and today is very huge. Continuing to apply the theory of fifty years ago to solve today's problems, does it sound too outdated? However, this is the reality. Due to the large number of existing users and the lack of mature new technologies, SQL, based on relational algebra, is still the most important database language today. Although some improvements have been made in recent decades, the foundation has not changed. In the face of contemporary complex requirements and hardware environments, it is reasonable that SQL is incompetent.
And, unfortunately, this problem is at the theoretical level, and it won't help no matter how optimized it is in practice, it can only be improved in a limited way, not eradicated. Regrettably, most database developers do not think of this level, or, in order to take care of the compatibility of existing users, they do not intend to think about this level. As a result, the mainstream database industry has been going around in circles in this limited space.
Why SPL is competent
Now then, how to make the calculation Easier in Writing and Faster in Running?
Invent new algebra! An algebra with “multiplication”, and then design a new language based on the new algebra.
This is where SPL comes from. Its theoretical basis is no longer the relational algebra, but something called discrete dataset. The formal language designed based on this new algebra is named SPL (structured process language).
Innovations against the shortcomings of SQL have been made to SPL (more precisely, innovations against various deficiencies of relational algebra have been made to the discrete dataset). SPL redefines and extends many operations of structured data, specifically, it adds the discreteness, enhances ordered computation, implements a thorough set orientation, supports object references, and advocates stepwise operation.
Recoding the previous problems in SPL will give you a direct feeling.
Calculate the maximum consecutive days that a stock keeps rising:
stock_price.sort(trade_date).group@i(closing_price<closing_price[-1]).max(~.len())
Although the calculation idea is the same as the previous SQL, it is much easier to express and no longer confusing, because of the introduction of ordering characteristic.
Take the top 10 out of 100 million pieces of data:
T.groups(;top(-10,x))
SPL has richer set data types, it is easy to describe the efficient algorithm that implements simple aggregation on a single traversal, without involving big sorting action.
Due to space limitations, we will not introduce SPL (discrete dataset) in an all-round way here, but will list some differential improvements of SPL (discrete dataset) against SQL (relational algebra):
Discrete records
The records in the discrete dataset are a basic data type that can exist independently of the data table. The data table is a set constituted by records, and the records that make up a certain data table can also be used to make up other data tables. For example, the filtering operation is to use the records that meet the conditions in original data table to make up a new data table, in this way, it has more advantages in both space occupation and operation performance.
The relational algebra has no computable data type to represent the record. A single record is actually a data table with only one row, and records in different data tables must not be same. For example, during the filtering operation, new records will be duplicated to form a new data table, it will result in an increase in the costs of space and time.
In particular, because there are discrete records, the discrete dataset allows record’s field value to be a certain record, which makes it easier to implement foreign key join.
Ordering characteristic
Relational algebra is designed based on unordered sets, and the set members do not have the concept of sequence number. Moreover, it doesn’t provide the mechanism of positioning calculation and adjacent reference. In practice, SQL has made some partial improvements, allowing the modern SQL to easily do some ordered operations.
On the contrary, the sets in discrete dataset are ordered, and all set members have the concept of sequence number and can be accessed with sequence number. Moreover, the discrete dataset defines the positioning operation so as to return the sequence number of members in the set. Also, the discrete dataset provides symbols to implement adjacent reference in set operation, and supports the calculation according to the position of a certain sequence number in the set.
Ordered operation is very common, but it has always been a difficult for SQL. The implementation of ordered operation in SQL is still very cumbersome even with window functions available. SPL has greatly improved this situation, which can be illustrated by the previous example of stock rising.
Discreteness and set orientation
The relational algebra defines rich set operations, that is, it can take the set as a whole to participate in operations such as aggregation and grouping. This is where SQL is more convenient than high-level programming languages like Java.
However, the relational algebra has very poor discreteness and no discrete records, while high-level programming languages such as Java have no problem in this regard.
As for the discrete dataset, it is equivalent to combining discreteness with set orientation, which means that it has not only the set data type and related operations, but also the set members that separate out of the set to do independent operation or form other sets. Therefore, it can be said that SPL integrates the advantages of both SQL and Java.
Ordered operation is a typical scenario that combines discreteness with set orientation. The concept of order is meaningful only for a set and meaningless for a single member, which reflects the set orientation; the ordered operation needs to calculate a certain member and its adjacent members, which requires discreteness.
Only with the support of discreteness can we obtain more thorough set orientation, and solve problems like ordered operation.
In short, the discrete dataset is an algebraic system with both discreteness and set orientation, while relational algebra has only set orientation.
Understanding of grouping
The original intention of the grouping operation is to split a large set into several subsets according to some rules. In relational algebra, since there is no data type that can represent the set of sets, it has to do the aggregation operation after grouping.
Conversely, the discrete dataset allows the set of sets, it can represent reasonable grouping operation result. The grouping operation and the aggregation operation after grouping are split into two-step independent operations. In this way, more complex operation can be performed on the grouped subsets.
In relational algebra, there is only one kind of equivalence grouping, that is, the sets are divided according to the grouping key value. The equivalence grouping is a complete division.
For the discrete dataset, however, it thinks that any method of splitting a large set is a grouping operation. In addition to the conventional equivalence grouping, it also provides the ordered grouping combined with ordering characteristic, as well as the aligned grouping that may get incomplete division result.
Understanding of aggregation
There is no explicit set data type in relational algebra. The results of aggregation calculation are all a single value, so does the aggregation operation after grouping, only including SUM, COUNT, MAX, MIN etc. Particularly, the relational algebra cannot regard TOPN operation as an aggregation. The TOPN operation performed on the whole set can only take the first N items following the sorting when outputting the result set. However, for the grouped subsets, it is difficult to implement TOPN, in this case, it needs to change idea and work out the sequence number to achieve.
The discrete dataset advocates the universal set, and the aggregation operation result is not necessarily a single value, but may still be a set. In discrete dataset, the TOPN operation has the same status as SUM and COUNT, etc., that is, it can be performed on the whole set or on the grouped subsets.
After SPL regards TOPN as the aggregation operation, it can also avoid the sorting of all data in practice, hereby obtaining high performance. However, the TOPN in SQL is always accompanied by ORDER BY action. In theory, it can only be implemented by big sorting, in this case, you need to place hope on the optimization of database in practice.
High performance supported by ordering characteristic
The discrete dataset places special emphasis on ordered set, and can implement many high-performance algorithms using ordered characteristic. This cannot be implemented by relational algebra based on unordered sets, and you can only hope for optimization in practice.
The following are some low-complexity operations that can be implemented using ordering characteristic:
1) The data table is ordered by the primary key, which is equivalent to having an index naturally. Filtering of key fields can often be quickly located to reduce the traversal amount in external storage. When fetching the data randomly by key value, the binary search can also be used for positioning; the index information can be reused in case of data-fetching by multiple key values at the same time.
2) Usually, the grouping operation is implemented with the HASH algorithm. If we are sure that the data are ordered by the grouping key value, we only need to do the adjacent comparison, hereby avoiding the calculation of HASH value, in this way, there will be no HASH conflict problem, and it is very easy to perform parallel computing.
3) The data table is ordered by the key, the merge algorithm with higher performance can be used for the alignment join between two large tables. In this way, we only need to traverse the data once, without the need to buffer the data, and thus it makes the memory occupation less. Yet, not only is the conventional HASH value partitioning method relatively complicated, requiring more memory and data buffering in external storage, but it also may cause secondary HASH and re-buffering due to improper hash function.
4) The join of large table as foreign key table. When the fact table is small, the foreign key table can be used to order, from which the data corresponding to the associated key value are quickly taken out to achieve join, and there is no need to do HASH partitioning. When the fact table is also large, we can divide the foreign key table into multiple logical segments by quantile, and then partition the fact table by logical segment. In this way, only one table needs to be partitioned, and the secondary partitioning, which may occur during HASH partitioning, will not occur in the process of partitioning, and thus the computational complexity greatly decreases.
Items 3) and 4) above exploit the modification of the join operation in discrete dataset. If we continue to use the definition in relational algebra (which may produce many-to-many), it is difficult to implement such low-complexity algorithms.
In addition to theoretical differences, SPL has many engineering-level advantages such as: easier to write parallel computing code, large memory pre-association to improve foreign key join performance, unique column storage mechanism to support arbitrary segmentation and parallel computing, etc.
In the era of big data, we are often interested in high performance computation. Here are some big data algorithms implemented in SPL:
- Performance optimization skill: Multi-purpose traversa
- Performance optimization skill: TopN
- Performance optimization skill: Pre-Joining
- Performance optimization skill: Numberizing Foreign Key
- Performance optimization skill: Attached Table
- Performance optimization skill: One-side Partition
- ... And some high performance cases:
- Open-source SPL Speeds up Query on Detail Table of Group Insurance by 2000+ Times
- Open-source SPL improves bank’s self-service analysis from 5-concurrency to 100-concurrency
- Open-source SPL speeds up intersection calculation of customer groups in bank user profile by 200+ times
- Open-source SPL turns pre-association of query on bank mobile account into real-time association
- Open-source SPL speeds up batch operating of bank loan agreements by 10+ times
- Open-source SPL optimizes batch operating of insurance company from 2 hours to 17 minutes