This is a technical investigation into memory allocation at the PostgreSQL and linux (memory allocation/malloc()) levels. Any comments are welcomed!
A postgres backend functions as a standalone operating system process that is forked from the postmaster. This is actually a very memory efficient operation on linux, because a lot of pages can be shared between the postmaster and the newly forked PostgreSQL process. However, as soon as the backend process starts executing code itself to bootstrap the process initialisation in general but for PostgreSQL in our case, it allocates and writes its own memory pages, which are unique to the process and remain paged in, and thus add to the RSS and are not shared.
The PostgreSQL processing memory areas are administered by PostgreSQL, which uses its own palloc() allocator, which calls malloc(), but provides some extra services, such as an administration of memory area's. This is very helpful, because this allows diagnosis of these memory area's.
This investigation is executed with a PostgreSQL server compiled from the source of version 11. I written a small PLpgSQL anonymous code block to do nothing else than filling an array, for the sake of allocating memory. This is the source code of that:
do $$
declare
array text[];
counter int:=1;
max_counter int:=5000;
begin
loop
array[counter]:=repeat('x',16384);
counter:=counter+1;
if counter = max_counter then
exit;
end if;
end loop;
perform pg_sleep(60);
end $$;
It really is quite simply looping over an array that gets filled with repeat('x',16384)
, and once it finished max_counter
loops, it calls pg_sleep(60)
to have the array being allocated for some time for analysis. (and pg_sleep() can easily be interrupted with ctrl-c)
When I create a postgres backend by logging in with psql, it occupies 18M, but proportionally (because of the shared pages) 7.9M:
# smem -k | grep -e [P]ID -e 6197
PID User Command Swap USS PSS RSS
6197 postgres postgres: postgres postgres 0 5.5M 7.9M 18.0M
When I run the anonymous PLpgSQL block, The size increases because of the allocations for the array:
# smem -k | grep -e [P]ID -e 6197
PID User Command Swap USS PSS RSS
6236 postgres postgres: postgres postgres 0 85.1M 87.6M 97.9M
The memory allocations for RSS and PSS increased by roughly 80M, which is not shocking: the array data needs to be stored.
We can see the the PostgreSQL memory allocations via gdb, the GNU debugger by calling print MemoryContextStats(TopMemoryContext)
. For version 11 this is the only way that I am aware that is reasonably convenient to get the PostgreSQL level memory details and sizes:
# gdb -p 6197
...
(gdb) c
Continuing.
^C
Program received signal SIGINT, Interrupt.
0x00007fe154a1d9eb in epoll_wait () from /lib64/libc.so.6
(gdb) print MemoryContextStats(TopMemoryContext)
$1 = void
(gdb) c
Continuing.
The command prints the memory details in the PostgreSQL log file:
TopMemoryContext: 67424 total in 5 blocks; 12256 free (7 chunks); 55168 used
...
TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
PortalContext: 1024 total in 1 blocks; 448 free (0 chunks); 576 used:
SPI Proc: 32768 total in 3 blocks; 11528 free (2 chunks); 21240 used
SPI TupTable: 8192 total in 1 blocks; 6544 free (0 chunks); 1648 used
PLpgSQL per-statement data: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
expanded array: 82299680 total in 5006 blocks; 32040 free (29 chunks); 82267640 used
...
Grand total: 83565536 bytes in 5233 blocks; 401536 free (148 chunks); 83164000 used
This shows the start of the MemoryContextStats() dump, where each memory context lists its own stats, so the TopMemoryContext does not include the array allocation.
A little further in the TopPortalContext the allocation for the array can be found in 'expanded array', which shows 82267640 bytes (78.5M)in use. This is all logical and reasonable.
When the call to sleep expires, the anonymous block ends. This releases any memory allocated by the PLpgSQL procedure and thus the array. This is completely logical, and to be sure we can perform the same call to MemoryContextStats() to validate PostgreSQL level allocations:
TopMemoryContext: 67424 total in 5 blocks; 13888 free (13 chunks); 53536 used
...
TopPortalContext: 8192 total in 1 blocks; 7936 free (1 chunks); 256 used
(no further sub allocations in TopPortalContext)
...
Grand total: 1067136 bytes in 161 blocks; 289296 free (113 chunks); 777840 used
The grand total shows 1067136 bytes, which is approximately 1M which is allocated to the backend.
However: if we look at the linux level allocation with smem again:
# smem -k | grep -e [P]ID -e 6236
PID User Command Swap USS PSS RSS
6236 postgres postgres: postgres postgres 0 66.4M 68.8M 79.2M
There still is much memory memory allocated to the process, which is approximately 60M, for which we know from the PostgreSQL memory dump is not allocated by PostgreSQL anymore. Where did it go?
Luckily, there is some help here too. PosgreSQL uses it's own memory allocation logic via palloc(), but that calls malloc() eventually. malloc has some diagnostics that can be called to see what is going on. One of these diagnostics is the function malloc_stats()
. This can be called when attaching with gdb, and call that function inside the process using gdb, which is the same principle as the PostgreSQL function call to MemoryContextStats():
(gdb) call (void) malloc_stats()
This will print an overview from malloc, which sits in between the operating system and PostgreSQL as user land executable:
Arena 0:
system bytes = 63832064
in use bytes = 917696
Total (incl. mmap):
system bytes = 71811072
in use bytes = 8896704
max mmap regions = 3
max mmap bytes = 7979008
This explains the hiatus between the operating system reporting 70M and the PostgreSQL level memory dump saying it released memory down to approximately 1M!
What we see is 'Arena 0', which is roughly put the administration of memory allocations of malloc() for this process, which has allocated from 'system' 63832064 bytes (60.9M), whilst actually in use (by PostgreSQL) is 917696 bytes (1M). What malloc() tries to do, is keep memory allocated to prevent having to deallocate and allocate over and over.
The question is how this works further down the line; I tried running some simple SQLs to see if this would trigger malloc() to release more memory, which it didn't do. My idea was that that the simple SQLs showed the execution didn't need all of that memory anymore.
There are descriptions that can be found on forums that say that the memory not in use is marked with madvise()
calls. I don't know if that can be seen on the linux level; and if it can, if that would mean the /proc/PID/smaps
statistic LazyFree
should be set indicating there is memory still allocated that can be freed by the OS? (I also did not find madvise() calls being executed by the PostgreSQL backend that performed the PLpgSQL code, so it does not seem likely to me this happens; I validated this by breaking on madvise())
Based on what linux tells me in the smaps file, it seems the pages being private paged in pages cannot be reused. But I hope someone can tell me if this indeed is the case, the backend needs to stop to truly release the memory for reuse, or if there is something that I have not seen yet? I am happy to add it to this investigation!
ps.
the smem
utility can be found in EPEL.
malloc_info:
There is another function that can be used inside a program that uses malloc (ptmalloc) to obtain the status of memory, which is the function malloc_info.
If you want to use this for PostgreSQL and trigger it to write to the PostgreSQL log file, perform calling the function in the following way:
(gdb) call (int) fopen("/tmp/malloc_info.txt", "wb")
$1 = 14729056
(gdb) call (int) malloc_info(0,$1)
$2 = 0
(gdb) call (int) fclose($1)
$3 = 0
This results in malloc level allocation administration results:
$ cat /tmp/malloc_info.txt
<malloc version="1">
<heap nr="0">
<sizes>
<size from="65" to="65" total="65" count="1"/>
<size from="209" to="209" total="209" count="1"/>
<size from="369" to="369" total="369" count="1"/>
<size from="785" to="785" total="785" count="1"/>
<unsorted from="1041" to="23595025" total="62780003" count="19"/>
</sizes>
<total type="fast" count="0" size="0"/>
<total type="rest" count="24" size="62914999"/>
<system type="current" size="63832064"/>
<system type="max" size="83521536"/>
<aspace type="total" size="63832064"/>
<aspace type="mprotect" size="63832064"/>
</heap>
<total type="fast" count="0" size="0"/>
<total type="rest" count="24" size="62914999"/>
<total type="mmap" count="3" size="7979008"/>
<system type="current" size="63832064"/>
<system type="max" size="83521536"/>
<aspace type="total" size="63832064"/>
<aspace type="mprotect" size="63832064"/>
</malloc>
Which does not seem to show memory area's actually in use, but rather the malloc administration of the chunks.
tcmalloc and yugabyte
If you are on yugabyte, the memory allocator is not ptmalloc, but tcmalloc, a different memory allocator.
This allocator also has a facility to show information about its memory administration, but requires different commands. It does require attaching with gdb to execute the command:
(gdb) print yb::TcMallocStats()
$4 = {<std::__1::__basic_string_common<true>> = {<No data fields>}, static __short_mask = 1, static __long_mask = 1,
...etc...
(gdb) printf "%s", $4.__r_.__value_.__l.__data_
------------------------------------------------
MALLOC: 20362064 ( 19.4 MiB) Bytes in use by application
MALLOC: + 6537216 ( 6.2 MiB) Bytes in page heap freelist
MALLOC: + 836400 ( 0.8 MiB) Bytes in central cache freelist
MALLOC: + 270848 ( 0.3 MiB) Bytes in transfer cache freelist
MALLOC: + 1419136 ( 1.4 MiB) Bytes in thread cache freelists
MALLOC: + 2621440 ( 2.5 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 32047104 ( 30.6 MiB) Actual memory used (physical + swap)
MALLOC: + 4718592 ( 4.5 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 36765696 ( 35.1 MiB) Virtual address space used
MALLOC:
MALLOC: 757 Spans in use
MALLOC: 10 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
------------------------------------------------
Total size of freelists for per-thread caches,
transfer cache, and central cache, by size class
------------------------------------------------
class 1 [ 8 bytes ] : 586 objs; 0.0 MiB; 0.0 cum MiB; 0.000 overhead MiB; 0.000 cum overhead MiB
...etc...