Recently I was asked to look into a case where an identical SQL executed on the same database in two different ways resulted in two quite different latencies: 133ms versus 750ms.
One was acceptable for the client, and the other one wasn't.
How can the same SQL yield two such different latencies? First let me explain the setting: the first latency figure, the fast one, was a plain SQL. The second one was the exact same SQL, however it was inside a PLpgSQL procedure, and using a refcursor. Both returned around 60.000 rows. The SQL statement in both cases did not have any variables, they were completely identical.
For the plain SQL, it's easy to diagnose the execution plan and the time it spent: just put explain analyze
in front of the SQL, execute it, and look at the statistics. However, for the SQL executed in PLpgSQL procedure, it is not possible to get the execution plan from inside the PLpgSQL procedure. The general way to deal with this is to use the plain SQL approach, by copying the SQL from the procedure and explain it. And that was the problem: once executed as plain SQL it exposed an acceptable latency, only when executed as PLpgSQL it exposed a different, not acceptable latency.
So, what can you do about it? It took me a few stabs at it before I found the issue.
We had an issue created for this, and a colleague found that if the order by from the statement was removed, it would function correct again.
I took an approach quite native to me, and ran linux perf
to record execution stacks with call-graphs (perf record -g -p <PID>
) to see what it was doing.
Many of the principal parts of an execution plan can be seen back in the C functions. One way of doing that is to look in the perf output (using perf report
) for the function ExecScan
, and look at what that function calls. This is the execution plan:
Sort (cost=14603.24..14764.65 rows=64562 width=20)
Sort Key: tile_row, tile_col, block_row, block_col
-> Bitmap Heap Scan on block_tbl b (cost=2139.33..9445.26 rows=64562 width=20)
Recheck Cond: ((world_key = 19301) AND (perimeter_num = 8007) AND (updated_ts = '1655419238118'::bigint))
Filter: ((updated_user_key = 0) AND (frozen_flag = 0))
-> Bitmap Index Scan on block_perimeter (cost=0.00..2123.18 rows=65108 width=0)
Index Cond: ((world_key = 19301) AND (perimeter_num = 8007) AND (updated_ts = '1655419238118'::bigint))
And if I allocate ExecScan, and look at it's surrounding calls, it looks like this:
- 27.44% ExecSort
- 12.93% tuplesort_gettupleslot
- 9.15% tuplesort_gettuple_common
- 4.42% mergereadnext
- 2.84% readtup_heap
1.58% __mcount_internal
- 0.95% getlen
0.63% __mcount_internal
0.63% __mcount_internal
- 3.47% tuplesort_heap_replace_top
- 2.84% comparetup_heap
1.58% __mcount_internal
1.26% __mcount_internal
2.52% __mcount_internal
0.95% _mcount
- 6.31% ExecScan
- 5.36% BitmapHeapNext
- 3.15% MultiExecBitmapIndexScan
index_getbitmap
- btgetbitmap
- 2.21% _bt_next
_bt_steppage
_bt_readnextpage
- _bt_readpage
- _bt_checkkeys
0.63% FunctionCall2Coll
- 0.95% tbm_add_tuples
0.63% tbm_get_pageentry
1.26% tbm_iterate
0.63% heap_hot_search_buffer
- 0.63% ExecInterpExpr
slot_getsomeattrs
slot_deform_tuple
There we got the sort first (ExecSort) and the scan second (ExecScan), which performs a function called MultiExecBitmapIndexScan. I can see the similarity of the functions and the executionplan.
Okay, now how about the perf report from the procedure? That looks like this:
- 13.55% ExecScan
- 10.57% IndexNext
- 9.76% index_getnext
- 9.49% index_fetch_heap
- 3.25% ReadBufferExtended
- ReadBuffer_common
- 1.08% BufTableLookup
0.54% __mcount_internal
0.54% hash_search_with_hash_value
0.81% PinBuffer
0.54% LWLockAcquire
1.90% heap_hot_search_buffer
1.63% heap_page_prune_opt
0.81% __mcount_internal
- 0.81% ReleaseAndReadBuffer
__mcount_internal
0.54% _mcount
0.54% LWLockAcquire
- 0.81% ExecStoreTuple
0.54% __mcount_internal
- 1.63% ExecInterpExpr
0.54% slot_getsomeattrs
This looks different. It's performing IndexNext instead of BitmapHeapNext, so it's definitely using a different execution plan. Also, I was not able to find the ExecSort function. So the SQL execution is not performing sorting.
At this point I could prove there was another execution plan by looking at the low level execution. My suspicion at this point was that the planner might optimise for PLpgSQL, and leave some work (such as maybe the sorting) to PLpgSQL, which might be why it's gone, and another plan was chosen. But PLpgSQL is not a core component of the postgres code, it's a loadable procedural language, which you can validate by looking at the extensions.
What I tried next was creating the SQL as a function, and then use the function as a table inside the procedure. Functions are known for being able to be a 'barrier' for the planner to optimise SQL, and are treated as a 'black box', meaning it's left untouched. And that actually worked! By creating a function that returns a table, I get near the SQL performance.
However: that would mean a lot of work to revert all the SQLs in the procedures to individual functions, and swap the SQL with that function. And of course it doesn't explain why the procedure did influence the SQL optimization.
Then the client came back to me and said that if he used the refcursor for obtaining one or a small number of rows, it was actually pretty quick...but that that wasn't useful to him, because in the way it's used, it needs all the (60.000) rows.
That got me thinking: wait a minute...what if it isn't the procedure that is influencing this, but the cursor that is created? And a cursor can be explained! Explaining a cursor works in the following way:
postgres=# begin;
BEGIN
postgres=# explain declare tt cursor for select b.tile_row, b.tile_col, b.block_row, b.block_col, b.block_value
postgres-# from block_tbl b
postgres-# where b.world_key = 19301
postgres-# and b.updated_ts = 1655419238118
postgres-# and b.updated_user_key = 0
postgres-# and b.frozen_flag = 0
postgres-# and b.perimeter_num = 8007
postgres-# order by b.tile_row, b.tile_col, b.block_row, b.block_col;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Index Scan using block_tile on block_tbl b (cost=0.42..40080.63 rows=64562 width=20)
Index Cond: ((world_key = 19301) AND (frozen_flag = 0))
Filter: ((updated_ts = '1655419238118'::bigint) AND (updated_user_key = 0) AND (perimeter_num = 8007))
Bingo! This shows a different execution plan than earlier, which used a sort and bitmap scans.
Okay, this proves another plan is captured based on the SQL used for a cursor. But why is this happening?
It turns out SQL based on cursors is automagically optimised for a percentage of first rows based on the parameter cursor_tuple_fraction
, which defaults to 0.1, alias 10%, whilst regular SQL execution is optimised for fetching 100%.
This is visible in the source code in parsenodes.h and planner.c. With cursors, or with the SPI_prepare_cursor function. The CURSOR_OPT_FAST_PLAN flag can be set, which then performs the first row optimization. Sadly, the different optimization goal is not shown, so it can't be seen from the execution plan.
Summary
The postgres planner optimises for getting 100% of the rows for execution. This is different for cursors, which are optimised for getting the first 10% of rows, which can lead to different execution plans and different latency if you do not want 10% of the rows as fast as possible, but 100% of the rows as fast as possible.