It is arguable that, in recent decades, data has become the #1 resource in the world, dethroning oil as the most valuable asset. However, it may only reach its full potential if properly processed — that is, extracted, stored and analyzed in a dynamic and productive fashion. Thus, in order to collect all the knowledge portrayed by your data, it is crucial to have effective data processing pipelines.
Throughout this blog post, we will cover the fundamentals that will enable you to build efficient data processing mechanisms, emphasizing analytical solutions in particular. To kick things off, let’s look at the two main data processing systems.
OLAP vs. OLTP
OLTP, Online Transaction Processing, is the most traditional processing system. It is able to manage transaction-oriented applications and can be characterized by a large number of short, atomic database operations, such as inserts, updates, and deletes, quite common in your day-to-day application. Common examples include online banking and e-commerce applications.
OLAP, Online Analytical Processing, manages historical or archival data. It is characterized by a relatively low volume of transactions. OLAP systems are typically used for analytical purposes — to extract insights and knowledge from bulk data, merged from multiple sources. Unlike OLTP, the goal for OLAP systems is to have a limited amount of transactions, each consisting of mostly bulk reads and writes. Data warehouses are the typical infrastructure to maintain these systems.
OLTP | OLAP |
Low volume of data | High volume of data |
High volume of transactions | Low volume of transactions |
Typically normalized data | Denormalized data |
ACID compliance | Not necessarily ACID-compliant |
Require high availability | Don’t usually require high availability |
Hopefully, by now you are able to distinguish both data processing systems easily. Both OLTP and OLAP systems have been around for quite some time, but recently, with the boom of data mining and machine learning techniques, the demand for OLAP systems has increased. Choosing a suitable technological infrastructure to host either system is also a crucial step to ensure your system or application is delivering the best performance for your needs.
Row vs. Columnar Databases
You might have never heard of the terms row and columnar associated with databases, but you have most certainly come across them before. Row-oriented databases are your typical transactional ones, able to handle a large number of transactions, whereas column-oriented databases usually have fewer transactions, with a higher volume of data. By now, you probably have already guessed which type of database is more suitable for each processing system. Row-oriented DBs are commonly used in OLTP systems, whereas columnar ones are more suitable for OLAP. Some examples include:
Row-oriented | Columnar |
MySQL | Amazon Redshift |
PostgreSQL | MariaDB |
Oracle | ClickHouse |
... | ... |
But what makes them different internally?
The key difference between both data stores is the way they are physically stored on disk. Common HDDs are organized in blocks and rely on expensive head seek operations, thus sequential reads and writes tend to be much faster than random accesses.
Row-oriented databases store the whole row in the same block, if possible. Columnar databases store columns in subsequent blocks.
But what does this mean in practice?
Amazon Redshift provides a simple and concise explanation highlighting the differences between both databases. The figure shown below consists of row-wise database storage, where each row is stored in a sequential disk block. Picking the ideal block size is important to grant optimal performance, since having a block size that’s too small or too big results in inefficient use of disk space.
The figure below portrays columnar database storage, where each disk block stores values of a single column for multiple rows. In this example, using columnar storage requires one-third of I/O disk operations to extract the columns’ values, compared to a row-wise database.
Performance Example
Let’s look at a simple example to fully understand the differences between both databases.
Consider a database storing a total of 100GB of data, with 100 million rows and 100 columns (1GB per column). For simplification purposes, consider that the database administrator is a rookie and hasn’t implemented any indexes, partitioning, or any other optimization process on the database. With this in mind, for the analytical query:
What is the average age of males?
We would have these possibilities:
Row wise DB -> Has to read all the data in the database — 100GB to read
Columnar DB -> Has to read only the columns age and gender — 2GB to read
This is obviously an extreme example — hardly any database will completely lack indexes or other optimization processes, but the goal is to give you an overview of columnar databases’ true potential for analytical purposes.
Row and Columnar Databases Wrap-up
Now that you have an overview of row-oriented and columnar databases, as well as the main differences between them, highlighting their advantages and disadvantages shouldn’t be too hard:
Row-oriented DB: for OLTP | Columnar DB: for OLAP |
Performs fast multiple read, write, update, delete operations | Performs slowly if required to perform multiple read, write, update, delete operations |
Bulk operations (read and write) aren’t particularly fast | Performs fast bulk operations (mostly read and write) |
ACID compliance | No ACID compliance |
Typically inefficient data compression | Improved data compression — columns are individually compressed and each one is of the same type |
High storage size (indexes, etc.) | Low storage size |
Typically require multiple indexes, depending on queries | Relies on a single index per column (self-indexing) |
Easy to add a single row (1 insert operation) | Hard to add a single row (multi-column insert operation) |
Hard to add a single column (multi-row insert operation) | Easy to add a single column (1 insert operation) |
Benchmarks
Everyone loves theoretical concepts (well, at least let’s suppose so), but why not put them to practice? With this in mind, we set up an experiment to compare their performance in a real case scenario. For this purpose, we used different technologies, aiming to compare their performance on both OLTP and OLAP systems. The technologies selected were:
OLTP
- MongoDB — a NoSQL database, widely used on several applications. Excels at delivering fast and dynamic transactions, made possible by its schema-free document-oriented mechanism.
- PostgreSQL — a free and open-source database, widely dynamic and extensible. Uses range from small applications to data warehouses.
- Citus — a “worry-free Postgre built to scale-out”. It is a PostgreSQL extension that distributes data and queries across several nodes.
OLAP
- Cstore_fdw — an open-source Postgres extension, developed by Citus Data, that transforms the original row-oriented Postgres database into a columnar database.
- ClickHouse — a recent, open-source columnar database developed by Yandex. It claims to be capable of delivering real-time analytical data reports using a SQL-like syntax.
Queries Performance
To conduct the benchmark, we used a database with approximately 135 million records of web event logs, distributed across 4 different tables. All the technologies were used on-premise, on a local machine with an i5-9600k processor clocked at 3.70 GHz coupled with 32GB of RAM.
We made a set of queries that are typically considered analytical (i.e. focus mostly on columns rather than rows). In order to make the process as homogeneous as possible, for each query we ran, we flushed the respective storage engine’s cache and restarted the machine each time we switched technology. Additionally, for the row-wise databases, we built the most efficient indexes for the queries we created. We built no indexes for the columnar storages other than the ones that are created by default upon populating these databases. The results are displayed in the table below.
Queries (s) |
Mongo DB |
PSQL |
Citus |
PSQL cstore_fdw |
Click House |
Q1: Total events |
1 |
92 |
46 |
3 |
< 1 |
Q2: Total events of type ‘A’ |
207 |
94 |
46 |
3 |
< 1 |
Q3: Daily events distribution |
462 |
87 |
51 |
18 |
3 |
Q4: Events distribution by operation |
357 |
96 |
15 |
8 |
< 1 |
Q5: Top 10 events distribution by operation |
288 |
91 |
26 |
5 |
< 1 |
Q6: # of events containing images |
266 |
99 |
97 |
216 |
4 |
Q7: # of events containing scripts |
270 |
276 |
143 |
456 |
14 |
By looking at the data above, it is no surprise that columnar databases (i.e. PostgreSQL cstore_fdw and ClickHouse) have considerably shorter running times when compared to the other technologies. However, cstore_fdw is under-optimized for queries that require joining tables (e.g. left join) and perform a text search, as denoted by the running times for queries Q6 and Q7.
ClickHouse, on the other hand, outperforms every single technology by far, especially when it comes to cstore_fdw’s caveats: joins combined with text search. While ClickHouse doesn’t inherently support joining tables, this issue can be bypassed by using subqueries. The image below displays a comparative view of all the technologies and their performance on the analytical queries (lower is better).
Storage Size
As you might recall from our first comparison table, columnar storage has improved compression mechanisms over row-wise databases. This happens essentially because each column, compressed as a single file, has a unique data type. Besides, as mentioned in the previous section, these databases don’t have indexes besides the one created by default for the (primary) key. Such features allow for highly efficient storage space, as shown on the graph below, with ClickHouse taking 10GB of space, followed by the raw data itself with 76GB occupied and, finally, PostgreSQL, with 78GB of taken storage space. This is a 780% increase over ClickHouse.
Note that for PostgreSQL (and other row-wise databases), besides the space data itself occupies, indexes also contribute to the substantial increase of storage space.
Final Remarks
Columnar databases are rather incredible at what they do — processing massive amounts of data in a matter of seconds or even less. There are many examples out there — Redshift, BigQuery, Snowflake, MonetDB, Greenplum, MemSQL, ClickHouse — and all offer optimal performance for your analytical needs.
However, their underlying architecture introduces a considerable caveat: data ingestion. They offer poor performance for mixed workloads, that require real-time high-throughput. In other words, they can’t match a transactional database’s real-time data ingestion performance which allows the insertion of data into the system quite fast.
Combining fast ingestion and querying is the holy grail of data processing systems. Achieving an optimal mixed workload is probably not possible at this point since there is no single do-it-all technology that excels at both. Having a database for each purpose is the typical way to go, but it largely limits the performance of the whole system.
Data processing is a complex but incredibly interesting technical design area. We firmly believe that scalable real-time high-throughput analytical databases are possible, but they don’t exist.
Yet.
Originally published on the Jscrambler Blog.