How to deal with large database table

Sameh Muhammed - Oct 11 '22 - - Dev Community

Introduction

How DBMS stores and retrieves our data ?

DBMS stores data on disk blocks or pages (keep in consideration differences between HDD and SSD), as each block or page contains a portion of tables' data, HDD, consists of platters each platter consists of track each trach contains blocks and sector/pages, data stored on pages, in order to retrieve data a rotation on platters and moving of read/write head is mandatory but costive operation.

Image description

On other hand, SSD consists of grids and blocks/pages which no mechanical movements need to retrieve our data, which makes SSD way faster from HDD, but still it costive operation to do.

Image description

So, DBMS in order to serve your queries like below query

SELECT * FROM EMPLOYEES

DBMS knows where its' start location of EMPLOYEES table so it goes to this location and start pulling disk blocks and and loading it into memory and extract EMPLOYEES records, and this the most expensive operation you need to optimize and avoid or reduce it as can as possible.

Image description

How to deal with large table data ?

Off course sequential scan for large data table is insufficient option, so there are multiple tactics you can apply to avoid sequential scan.

Indexing

database index is a data structure built on top of some table values and keep track of where exactly the row that contains this value (page no. and row no. with other metadata), this data structure can be stored in memory if it's fit there or if it large it stored in disk.

Image description

How index helps us to reduce disk scan ?

Instead of sequential scan the whole table to search for specific value or row, index data structure is sorted so searching is efficient in it and when find required value you now know exactly where this value residence on disk.

Image description

Bitmap index (PostgreSQL only)

Default search using index that when a value is matched it goes to disk and fetch desired page and extract the record then return to index and continue search operation, in case of multiple page fetching this might be inefficient , so in order to enhance this operation it builds bitmap as an array of bits representing page numbers as the index(position) of array whenever it find matched page it go to bitmap and put value 1 in it's index and so on, then it go to disk and fetch all desired pages at once.

Image description

Partitioning

Partitioning is the database process where very large tables are divided into multiple smaller parts. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan. The main goal of partitioning is to aid in maintenance of large tables and to reduce the overall response time to read and load data for particular SQL operation.

There are two types of partitioning

Vertical

No alt text provided for this image
which you partition the table by separating the columns of the table.

Image description

Horizontal

No alt text provided for this image
which you partition the table by separating the rows of the table.

Image description

Sharding

Sharding is similar to partitioning but the difference is that sharding is done across multiple database servers or nodes whereas partitioning is done on same database server.

When get most benefit from sharding or partitioning ?

partitioning and sharding is done up on columns and values, so choosing right column or value to shard or partition data on is crucial to get most benefit.

Sharding or partitioning your data based on UUID or random value may just complex your searching capabilities as now data distributed randomly and in this case you may end searching on multiple database instances/servers without any value from sharding/partitioning.

Putting all together

For example you may shard your USERS data based on their country so each country has its' own database instance whenever you are looking for specific users' data you can resolve its' location and forward the request to right database instance/server, and beside sharding you can partition your data based on creation date so that group data that created on each week/month in separate group to subset and slice searching data, doing so you can built index on top of username/id or any field that you most query to minimize scanning and disk input/output operations as much as you can.

Image description

DBMS is smart!

DBMS perform execution plan before it goes to execute query to trace how it can execute this query in most efficient way.

Image description

So DBMS can make decisions based on table data and resources it has, even for full table scan DBMS execute multiple workers/threads to scan the table if needed and merge the results, or even smarter it may decides to scan full table even if there is index in case of your query will return most of the table data so it decides it's more efficient to go to table directly instead on scan index then get results from table and this is based on some statistics done behind the scene by DBMS.

Resources and References

most of content of this article based on Hussein Nasser "Fundamentals of database engineering" course and his YouTube channel.

and as a second and most detailed reference is Designing Data-Intensive Applications book.

IF YOU LIKED THE POST, THEN YOU CAN BUY ME A COFFEE, THANKS IN ADVANCE.

Buy Me A Coffee

. . . . . . . . . . . . . . . . . . . . . . . . . . . .