Table Content
- Little Prior Knowledge about the Indexes and Rows(How search works)
- What is a scan?
- Let's Prepare some data first
- Type of scans
- It's not about the number of rows every time
Little Prior Knowledge about the Indexes and Rows(How search works)
Not gonna go deep into what are and what is the purpose of an index. This section focuses on how Rows and Index stored and how searches perform
In RDS Data Store in pages and a page is fixed size memory blob, In Postgres, the size of the page is generally 8 kB
So generally in the Relation databases Indexes are stored in a B-Tree Data structure and the table's actual data rows are stored in the heap in the form of pages.
B-Tree is structured and ordered in nature which is why searching in indexes is much-much faster.
So when the database finds a match in the index and it's required some data from its respective row then the database will jump to the heap and fetch the required data.
select name from employees where id = 150
Suppose Here id
field has an index but the name
don't have so name
is in the heap memory
What is a scan?
A scan in PostgreSQL refers to the process the database uses to read data from a table.
Let's see what the type of scans are and which scan is used in which condition in Postgres.
Let's Prepare some data first
I am generating 1000000 (1 Million) rows for employees table
create table employees(
id serial primary key,
name varchar(50),
age int);
OUTPUT:
CREATE TABLE
insert into employees(name, age)
select substring(md5(random()::text), 0, 25), ceil(random()*100) from
generate_series(0, 1000000);
OUTPUT:
INSERT 0 1000001
Take a look at the table
select * from employees limit 10;
OUTPUT:
id | name | age
----+--------------------------+-----
1 | b8e9596fc5ee85d06bca1b30 | 9
2 | e8f397f7e2fa26335ce83bbc | 65
3 | fbf1ccec82216fb6a419763f | 39
4 | fb5de3587efadfc8103c6625 | 88
5 | c77706805753a79200c0c475 | 57
6 | 35de97e48db3af0c4b93c576 | 2
7 | 70e151aa42b8f29d0212ab6f | 16
8 | b976c00d30278a6851450e4c | 32
9 | 409d9feb5b3d87376714e8cc | 19
10 | 6f907e21d7f7f15ab3dce736 | 84
(10 rows)
Have a look at employees table structure as well
\d employees
OUTPUT:
Column | Type | Collation | Nullable | Default
--------+-----------------------+-----------+----------+-----------------
id | integer | | not null | nextval('employees_id_seq'::regclass)
name | character varying(50) | | |
age | integer | | |
Indexes:
"employees_pkey" PRIMARY KEY, btree (id)
Types of scans
- Index Scan
- Index Only Scan
- Sequential Scan
- Bitmap Scan
Index Scan
When you have to choose very selective rows or if you are selecting many rows via range query on Primary Key Index and also the index is present for the searching parameter obviously then only Index scan is performed
explain analyze select name from employees where id < 100;
OUTPUT:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using employees_pkey on employees (cost=0.42..10.14 rows=98 width=4) (actual time=0.005..0.017 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.094 ms
Execution Time: 0.030 ms
Here I am selecting approx 100 rows which is not even 1% of 1 million rows. So here Index Scan is done.
Note: Here for name
value I have to go to the heap whenever I find a match in id
index
Index Only Scan
Postgres will do an Index Only Scan when the database doesn't have to into the heap all the sufficient data is present in the heap it self.
Let's understand with this with a simple but stupid example
explain analyze select id from employees where id=100;
OUTPUT
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using employees_pkey on employees (cost=0.42..4.44 rows=1 width=4) (actual time=0.338..0.340 rows=1 loops=1)
Index Cond: (id = 100)
Heap Fetches: 0
Planning Time: 0.252 ms
Execution Time: 0.523 ms
Here database doesn't have to go into the heap for anything. So this is Index Only Scan
A good example can be performed using Index Included Column
Sequential Scan
Postgress will choose Sequential search in a worst-case scenario maybe. Because here he has to traverse the heap which is very slow. This can happen in multiple scenarios
- When you are selecting too many rows
- When indexing is not present for the search parameter
Here I am taking an example of the first scenario
explain analyze select name from employees where id<600000;
OUTPUT
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on employees (cost=0.00..20834.01 rows=600472 width=25) (actual time=0.016..92.301 rows=599999 loops=1)
Filter: (id < 600000)
Rows Removed by Filter: 400002
Planning Time: 0.081 ms
Execution Time: 109.809 ms
Here you can see I am selecting more than half a million rows around 60% of the total rows.
So even if the id
has indexing it's not scanning the index because switching between B-Tree to Heap frequently is very Time Consuming.
Bitmap Scan
It's is special type of scan performed when you have a fair amount of rows not too many and not too less. Scanning through the index is efficient. And The rows are far away from each other, like one row is on page 1 second row is on page 10, which means not in sequence.
During a bitmap scan, the database creates a bitmap where each bit corresponds to a specific row or a specific block (page) or Tuple ID in the table.
Imagine Bitmap as an array of 0's and 1's
Here I am gonna show two scenarios for understanding when and why Postgres will choose Bitmap and when not.
create index age_index on employees(age) include(id);
OUTPUT
CREATE INDEX
For the Bitmap example, I created an age_index
.
explain analyze select * from employees where age=60;
OUTPUT
explain analyze select * from employees where age=60;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on employees (cost=195.48..9092.74 rows=10200 width=33) (actual time=1.425..6.772 rows=10037 loops=1)
Recheck Cond: (age = 60)
Heap Blocks: exact=5869
-> Bitmap Index Scan on age_index (cost=0.00..192.93 rows=10200 width=0) (actual time=0.756..0.757 rows=10037 loops=1)
Index Cond: (age = 60)
Planning Time: 0.051 ms
Execution Time: 7.060 ms
As you can see in the output we are fetching 10037 rows which is nearly 1% of 1 Million rows.
It's not about the number of rows every time
But again it's not only about the number of rows every time let's take another example.
explain analyze select * from employees where id < 500000;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using employees_pkey on employees (cost=0.42..18323.98 rows=497632 width=33) (actual time=0.027..90.433 rows=499999 loops=1)
Index Cond: (id < 500000)
Planning Time: 0.081 ms
Execution Time: 101.570 ms
Here you can see we are fetching half a million rows which is 50% of the total rows but the database uses the Index scan instead of Sequence scan or Bitmap scan.
So The Question is WHYYY?
So here we are searching on employees_pkey
which is our primary_key index
.
The id column is the primary key and primary key indexes are typically very efficient for range queries. In this case, the index on id allows PostgreSQL to quickly locate and fetch rows where id < 500000
.
With nearly half a million rows, the index scan is efficient because it directly leverages the sorted nature of the index on id. This avoids scanning the whole table and focuses on the range of values.
The cost of scanning through an index and fetching rows sequentially is generally less than scanning the table itself for such a large number of rows.
In a nutshell, On primary key index range queries are super fast.
Thank you for reading. This is my first blog post so if you have any suggestions on how I can improve please suggest.