Understanding PostgreSQL Scan Types

Sandip Mahato - Sep 2 - - Dev Community

Table Content

  1. Little Prior Knowledge about the Indexes and Rows(How search works)
  2. What is a scan?
  3. Let's Prepare some data first
  4. Type of scans
  5. It's not about the number of rows every time

Little Prior Knowledge about the Indexes and Rows(How search works)

Not gonna go deep into what are and what is the purpose of an index. This section focuses on how Rows and Index stored and how searches perform

Image description

In RDS Data Store in pages and a page is fixed size memory blob, In Postgres, the size of the page is generally 8 kB

So generally in the Relation databases Indexes are stored in a B-Tree Data structure and the table's actual data rows are stored in the heap in the form of pages.
B-Tree is structured and ordered in nature which is why searching in indexes is much-much faster.

So when the database finds a match in the index and it's required some data from its respective row then the database will jump to the heap and fetch the required data.

select name from employees where id = 150
Suppose Here id field has an index but the name don't have so name is in the heap memory

What is a scan?

A scan in PostgreSQL refers to the process the database uses to read data from a table.

Let's see what the type of scans are and which scan is used in which condition in Postgres.

Let's Prepare some data first

I am generating 1000000 (1 Million) rows for employees table

create table employees(
id serial primary key,
name varchar(50),
age int);
Enter fullscreen mode Exit fullscreen mode

OUTPUT:

CREATE TABLE
Enter fullscreen mode Exit fullscreen mode
insert into employees(name, age)
select substring(md5(random()::text), 0, 25), ceil(random()*100) from
 generate_series(0, 1000000);
Enter fullscreen mode Exit fullscreen mode

OUTPUT:

INSERT 0 1000001
Enter fullscreen mode Exit fullscreen mode

Take a look at the table

select * from employees limit 10;
Enter fullscreen mode Exit fullscreen mode

OUTPUT:

 id |           name           | age
----+--------------------------+-----
  1 | b8e9596fc5ee85d06bca1b30 |   9
  2 | e8f397f7e2fa26335ce83bbc |  65
  3 | fbf1ccec82216fb6a419763f |  39
  4 | fb5de3587efadfc8103c6625 |  88
  5 | c77706805753a79200c0c475 |  57
  6 | 35de97e48db3af0c4b93c576 |   2
  7 | 70e151aa42b8f29d0212ab6f |  16
  8 | b976c00d30278a6851450e4c |  32
  9 | 409d9feb5b3d87376714e8cc |  19
 10 | 6f907e21d7f7f15ab3dce736 |  84
(10 rows)

Enter fullscreen mode Exit fullscreen mode

Have a look at employees table structure as well

\d employees
Enter fullscreen mode Exit fullscreen mode

OUTPUT:

 Column |         Type          | Collation | Nullable |                Default
--------+-----------------------+-----------+----------+-----------------
 id     | integer               |           | not null | nextval('employees_id_seq'::regclass)
 name   | character varying(50) |           |          |
 age    | integer               |           |          |
Indexes:
    "employees_pkey" PRIMARY KEY, btree (id)

Enter fullscreen mode Exit fullscreen mode

Types of scans

  1. Index Scan
  2. Index Only Scan
  3. Sequential Scan
  4. Bitmap Scan

Index Scan

When you have to choose very selective rows or if you are selecting many rows via range query on Primary Key Index and also the index is present for the searching parameter obviously then only Index scan is performed

explain analyze select name from employees where id < 100;
Enter fullscreen mode Exit fullscreen mode

OUTPUT:

                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Index Scan using employees_pkey on employees  (cost=0.42..10.14 rows=98 width=4) (actual time=0.005..0.017 rows=99 loops=1)
   Index Cond: (id < 100)
 Planning Time: 0.094 ms
 Execution Time: 0.030 ms

Enter fullscreen mode Exit fullscreen mode

Here I am selecting approx 100 rows which is not even 1% of 1 million rows. So here Index Scan is done.

Note: Here for name value I have to go to the heap whenever I find a match in id index

Index Only Scan

Postgres will do an Index Only Scan when the database doesn't have to into the heap all the sufficient data is present in the heap it self.

Let's understand with this with a simple but stupid example

explain analyze select id from employees where id=100;
Enter fullscreen mode Exit fullscreen mode

OUTPUT

                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using employees_pkey on employees  (cost=0.42..4.44 rows=1 width=4) (actual time=0.338..0.340 rows=1 loops=1)
   Index Cond: (id = 100)
   Heap Fetches: 0
 Planning Time: 0.252 ms
 Execution Time: 0.523 ms

Enter fullscreen mode Exit fullscreen mode

Here database doesn't have to go into the heap for anything. So this is Index Only Scan
A good example can be performed using Index Included Column

Sequential Scan

Postgress will choose Sequential search in a worst-case scenario maybe. Because here he has to traverse the heap which is very slow. This can happen in multiple scenarios

  1. When you are selecting too many rows
  2. When indexing is not present for the search parameter

Here I am taking an example of the first scenario

explain analyze select name from employees where id<600000;
Enter fullscreen mode Exit fullscreen mode

OUTPUT

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on employees  (cost=0.00..20834.01 rows=600472 width=25) (actual time=0.016..92.301 rows=599999 loops=1)
   Filter: (id < 600000)
   Rows Removed by Filter: 400002
 Planning Time: 0.081 ms
 Execution Time: 109.809 ms

Enter fullscreen mode Exit fullscreen mode

Here you can see I am selecting more than half a million rows around 60% of the total rows.
So even if the id has indexing it's not scanning the index because switching between B-Tree to Heap frequently is very Time Consuming.

Bitmap Scan

It's is special type of scan performed when you have a fair amount of rows not too many and not too less. Scanning through the index is efficient. And The rows are far away from each other, like one row is on page 1 second row is on page 10, which means not in sequence.

During a bitmap scan, the database creates a bitmap where each bit corresponds to a specific row or a specific block (page) or Tuple ID in the table.
Imagine Bitmap as an array of 0's and 1's

Here I am gonna show two scenarios for understanding when and why Postgres will choose Bitmap and when not.

 create index age_index on employees(age) include(id);
Enter fullscreen mode Exit fullscreen mode

OUTPUT

CREATE INDEX
Enter fullscreen mode Exit fullscreen mode

For the Bitmap example, I created an age_index.

explain analyze select * from employees where age=60;
Enter fullscreen mode Exit fullscreen mode

OUTPUT

explain analyze select * from employees where age=60;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on employees  (cost=195.48..9092.74 rows=10200 width=33) (actual time=1.425..6.772 rows=10037 loops=1)
   Recheck Cond: (age = 60)
   Heap Blocks: exact=5869
   ->  Bitmap Index Scan on age_index  (cost=0.00..192.93 rows=10200 width=0) (actual time=0.756..0.757 rows=10037 loops=1)
         Index Cond: (age = 60)
 Planning Time: 0.051 ms
 Execution Time: 7.060 ms

Enter fullscreen mode Exit fullscreen mode

As you can see in the output we are fetching 10037 rows which is nearly 1% of 1 Million rows.

It's not about the number of rows every time

But again it's not only about the number of rows every time let's take another example.

explain analyze select * from employees where id < 500000;
Enter fullscreen mode Exit fullscreen mode
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using employees_pkey on employees  (cost=0.42..18323.98 rows=497632 width=33) (actual time=0.027..90.433 rows=499999 loops=1)
   Index Cond: (id < 500000)
 Planning Time: 0.081 ms
 Execution Time: 101.570 ms
Enter fullscreen mode Exit fullscreen mode

Here you can see we are fetching half a million rows which is 50% of the total rows but the database uses the Index scan instead of Sequence scan or Bitmap scan.

So The Question is WHYYY?

So here we are searching on employees_pkey which is our primary_key index.

The id column is the primary key and primary key indexes are typically very efficient for range queries. In this case, the index on id allows PostgreSQL to quickly locate and fetch rows where id < 500000.

With nearly half a million rows, the index scan is efficient because it directly leverages the sorted nature of the index on id. This avoids scanning the whole table and focuses on the range of values.

The cost of scanning through an index and fetching rows sequentially is generally less than scanning the table itself for such a large number of rows.

In a nutshell, On primary key index range queries are super fast.

Thank you for reading. This is my first blog post so if you have any suggestions on how I can improve please suggest.

.