The Data Processing Holy Grail? Row vs. Columnar Databases

J. Guilherme Routar de Sousa - Dec 9 '19 - - Dev Community

It is arguable that, in recent decades, data has become the #1 resource in the world, dethroning oil as the most valuable asset. However, it may only reach its full potential if properly processed — that is, extracted, stored and analyzed in a dynamic and productive fashion. Thus, in order to collect all the knowledge portrayed by your data, it is crucial to have effective data processing pipelines.

Throughout this blog post, we will cover the fundamentals that will enable you to build efficient data processing mechanisms, emphasizing analytical solutions in particular. To kick things off, let’s look at the two main data processing systems.

OLAP vs. OLTP

OLTP, Online Transaction Processing, is the most traditional processing system. It is able to manage transaction-oriented applications and can be characterized by a large number of short, atomic database operations, such as inserts, updates, and deletes, quite common in your day-to-day application. Common examples include online banking and e-commerce applications.

OLAP, Online Analytical Processing, manages historical or archival data. It is characterized by a relatively low volume of transactions. OLAP systems are typically used for analytical purposes — to extract insights and knowledge from bulk data, merged from multiple sources. Unlike OLTP, the goal for OLAP systems is to have a limited amount of transactions, each consisting of mostly bulk reads and writes. Data warehouses are the typical infrastructure to maintain these systems.

OLTP OLAP
Low volume of data High volume of data
High volume of transactions Low volume of transactions
Typically normalized data Denormalized data
ACID compliance Not necessarily ACID-compliant
Require high availability  Don’t usually require high availability
OLAP Schematics

Hopefully, by now you are able to distinguish both data processing systems easily. Both OLTP and OLAP systems have been around for quite some time, but recently, with the boom of data mining and machine learning techniques, the demand for OLAP systems has increased. Choosing a suitable technological infrastructure to host either system is also a crucial step to ensure your system or application is delivering the best performance for your needs.

Row vs. Columnar Databases

You might have never heard of the terms row and columnar associated with databases, but you have most certainly come across them before. Row-oriented databases are your typical transactional ones, able to handle a large number of transactions, whereas column-oriented databases usually have fewer transactions, with a higher volume of data. By now, you probably have already guessed which type of database is more suitable for each processing system. Row-oriented DBs are commonly used in OLTP systems, whereas columnar ones are more suitable for OLAP. Some examples include:

Row-oriented Columnar
MySQL Amazon Redshift
PostgreSQL MariaDB
Oracle ClickHouse
... ...

But what makes them different internally?

The key difference between both data stores is the way they are physically stored on disk. Common HDDs are organized in blocks and rely on expensive head seek operations, thus sequential reads and writes tend to be much faster than random accesses.

Row-oriented databases store the whole row in the same block, if possible. Columnar databases store columns in subsequent blocks.

But what does this mean in practice?

Amazon Redshift provides a simple and concise explanation highlighting the differences between both databases. The figure shown below consists of row-wise database storage, where each row is stored in a sequential disk block. Picking the ideal block size is important to grant optimal performance, since having a block size that’s too small or too big results in inefficient use of disk space.

Redshift

The figure below portrays columnar database storage, where each disk block stores values of a single column for multiple rows. In this example, using columnar storage requires one-third of I/O disk operations to extract the columns’ values, compared to a row-wise database.

Redshift

Performance Example

Let’s look at a simple example to fully understand the differences between both databases.

Consider a database storing a total of 100GB of data, with 100 million rows and 100 columns (1GB per column). For simplification purposes, consider that the database administrator is a rookie and hasn’t implemented any indexes, partitioning, or any other optimization process on the database. With this in mind, for the analytical query:

What is the average age of males?

We would have these possibilities:

Row wise DB -> Has to read all the data in the database — 100GB to read
Columnar DB -> Has to read only the columns age and gender — 2GB to read

This is obviously an extreme example — hardly any database will completely lack indexes or other optimization processes, but the goal is to give you an overview of columnar databases’ true potential for analytical purposes.

Row and Columnar Databases Wrap-up

Now that you have an overview of row-oriented and columnar databases, as well as the main differences between them, highlighting their advantages and disadvantages shouldn’t be too hard:

Row-oriented DB: for OLTP Columnar DB: for OLAP
Performs fast multiple read, write, update, delete operations Performs slowly if required to perform multiple read, write, update, delete operations
Bulk operations (read and write) aren’t particularly fast Performs fast bulk operations (mostly read and write)
ACID compliance No ACID compliance
Typically inefficient data compression Improved data compression — columns are individually compressed and each one is of the same type
High storage size (indexes, etc.) Low storage size
Typically require multiple indexes, depending on queries Relies on a single index per column (self-indexing)
Easy to add a single row (1 insert operation) Hard to add a single row (multi-column insert operation)
Hard to add a single column (multi-row insert operation) Easy to add a single column (1 insert operation)

Benchmarks

Everyone loves theoretical concepts (well, at least let’s suppose so), but why not put them to practice? With this in mind, we set up an experiment to compare their performance in a real case scenario. For this purpose, we used different technologies, aiming to compare their performance on both OLTP and OLAP systems. The technologies selected were:

OLTP

  • MongoDB — a NoSQL database, widely used on several applications. Excels at delivering fast and dynamic transactions, made possible by its schema-free document-oriented mechanism.
  • PostgreSQL — a free and open-source database, widely dynamic and extensible. Uses range from small applications to data warehouses.
  • Citus — a “worry-free Postgre built to scale-out”. It is a PostgreSQL extension that distributes data and queries across several nodes.

OLAP

  • Cstore_fdw — an open-source Postgres extension, developed by Citus Data, that transforms the original row-oriented Postgres database into a columnar database.
  • ClickHouse — a recent, open-source columnar database developed by Yandex. It claims to be capable of delivering real-time analytical data reports using a SQL-like syntax.

Queries Performance

To conduct the benchmark, we used a database with approximately 135 million records of web event logs, distributed across 4 different tables. All the technologies were used on-premise, on a local machine with an i5-9600k processor clocked at 3.70 GHz coupled with 32GB of RAM.

We made a set of queries that are typically considered analytical (i.e. focus mostly on columns rather than rows). In order to make the process as homogeneous as possible, for each query we ran, we flushed the respective storage engine’s cache and restarted the machine each time we switched technology. Additionally, for the row-wise databases, we built the most efficient indexes for the queries we created. We built no indexes for the columnar storages other than the ones that are created by default upon populating these databases. The results are displayed in the table below.

Queries (s)

Mongo DB

PSQL

Citus

PSQL cstore_fdw

Click House

Q1: Total events

1

92

46

3

< 1

Q2: Total events of type ‘A’

207

94

46

3

< 1

Q3: Daily events distribution

462

87

51

18

3

Q4: Events distribution by operation

357

96

15

8

< 1

Q5: Top 10 events distribution by operation

288

91

26

5

< 1

Q6: # of events containing images

266

99

97

216

4

Q7: # of events containing scripts

270

276

143

456

14

By looking at the data above, it is no surprise that columnar databases (i.e. PostgreSQL cstore_fdw and ClickHouse) have considerably shorter running times when compared to the other technologies. However, cstore_fdw is under-optimized for queries that require joining tables (e.g. left join) and perform a text search, as denoted by the running times for queries Q6 and Q7.

ClickHouse, on the other hand, outperforms every single technology by far, especially when it comes to cstore_fdw’s caveats: joins combined with text search. While ClickHouse doesn’t inherently support joining tables, this issue can be bypassed by using subqueries. The image below displays a comparative view of all the technologies and their performance on the analytical queries (lower is better).

Data Processing Queries Performance

Storage Size

As you might recall from our first comparison table, columnar storage has improved compression mechanisms over row-wise databases. This happens essentially because each column, compressed as a single file, has a unique data type. Besides, as mentioned in the previous section, these databases don’t have indexes besides the one created by default for the (primary) key. Such features allow for highly efficient storage space, as shown on the graph below, with ClickHouse taking 10GB of space, followed by the raw data itself with 76GB occupied and, finally, PostgreSQL, with 78GB of taken storage space. This is a 780% increase over ClickHouse.

jscrambler-blog-data-processing-storage-size-graph

Note that for PostgreSQL (and other row-wise databases), besides the space data itself occupies, indexes also contribute to the substantial increase of storage space.

Final Remarks

Columnar databases are rather incredible at what they do ⁠— processing massive amounts of data in a matter of seconds or even less. There are many examples out there — Redshift, BigQuery, Snowflake, MonetDB, Greenplum, MemSQL, ClickHouse — and all offer optimal performance for your analytical needs.

However, their underlying architecture introduces a considerable caveat: data ingestion. They offer poor performance for mixed workloads, that require real-time high-throughput. In other words, they can’t match a transactional database’s real-time data ingestion performance which allows the insertion of data into the system quite fast.

Combining fast ingestion and querying is the holy grail of data processing systems. Achieving an optimal mixed workload is probably not possible at this point since there is no single do-it-all technology that excels at both. Having a database for each purpose is the typical way to go, but it largely limits the performance of the whole system.

Data processing is a complex but incredibly interesting technical design area. We firmly believe that scalable real-time high-throughput analytical databases are possible, but they don’t exist.

Yet.


Originally published on the Jscrambler Blog.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .