By Franck Pachot
.
In a previous post I've compiled PostgreSQL with GCC 7 and GCC 11 with the default options and checked that the ARM v8.2 features were used. However it may not be so simple:
- PostgreSQL ./configure defines gcc flags to produce a binary that is compatible with older versions of ARM. This means that it may not use the LSE feature introduced in ARM v8.1
- With this, the default compilation (-march=armv8-a) doesn't use LSE (Large System Extensions) for atomic instructions, because they were introduced in v8.1
- This means that we need to call ./configure with CFLAGS="-march=armv8-a+lse" or CFLAGS="-march=armv8.1-a" or CFLAGS="-march=armv8.2-a" to get a binary using LSE atomics
- When I compiled with GCC 11, without adding this CFLAGS, I had the LSE instructions in my binary because GCC 10 introduced the possibility to detect at run-time the support of LSE, and it is enabled by default
- When I compiled with GCC 7 I had the LSE instructions, despites using the default configure, because I was running on Amazon Linux 2 where this run-time detection has been backported, and enabled by default
- However, this run-time detection has been backported, but is not enabled by default on all versions, or distributions, and then may require the -moutline-atomics to use LSE
So, the most important is to understand if your binary will use LSE (which shows great performance improvement on spinlock and lightweight locks). And it is also good to know if LSE is inlined, or outlined by this run-time detection (which has a limited overhead but allows the binary to be compatible with all ARM versions).
I have started a t4g.micro EC2 instance which runs on ARM v8.2 Neoverse N1, the Amazon Graviton2. I'm using it because we have 750 hours per month free until June 30th, 2021 and I try to get my demos easy and free to reproduce. But of course this instance size is too small for running a database and compiling gcc.
[ec2-user@ip-172-31-83-114 postgresql-14devel]$ PS1="$(curl -s http://169.254.169.254/latest/meta-data/instance-type)# "
t4g.micro#
I've set my prompt to the instance type because I've a lot of tty opened and I like to see which one is still running (for the non-free ones especially).
I am running on Amazon Linux 7 with GCC 7
t4g.micro# cat /etc/system-release
Amazon Linux release 2 (Karoo)
t4g.micro# gcc --version
gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12)
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
But here the runtime detection and outlining atomics has been backported and enabled by default.
I'll will verify that by building the PostgreSQL binaries.
sudo yum install -y git gcc readline-devel bison-devel zlib-devel
curl -s https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.gz | tar -zxvf -
cd postgresql-14devel
./configure --enable-debug
make clean
make
sudo make install
This downloads PostgreSQL source (version 14 from latest development snapshot here) and compiles with all default options.
t4g.micro# file /usr/local/pgsql/bin/postgres
/usr/local/pgsql/bin/postgres: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 3.7.0, BuildID[sha1]=37c48090a793b8ee8b62e46e361ac4c24646732f, not stripped
I'm checking the load and store exclusives, the only ones available before ARM v8.1, and LSE which can replace them with atomic instructions starting from ARM v8.1
t4g.micro# objdump -d /usr/local/pgsql/bin/postgres | awk '/\t(ldxr|ldaxr|stxr|stlxr)/{print $3"\t(load and store exclusives)"}/\t(cas|casp|swp|ldadd|stadd|ldclr|stclr|ldeor|steor|ldset|stset|ldsmax|stsmax|ldsmin|stsmin|ldumax|stumax|ldumin|stumin)/{print $3"\t(large-system extensions)"}' | sort -k2 | uniq -c
2 casal (large-system extensions)
2 ldaddal (large-system extensions)
1 ldclral (large-system extensions)
1 ldsetal (large-system extensions)
1 swpa (large-system extensions)
7 ldaxr (load and store exclusives)
6 stlxr (load and store exclusives)
1 stxr (load and store exclusives)
I see both here. Instructions like CASAL (compare and swap) and SWPA (swap words) are LSE atomics. LDAXR and STXR are the instructions used before LSE.
Both are there because I compiled with the GCC versions that builds a binary with run-time detection to use one or the other. I'll take one function that makes use of atomics. If you have a high rate of writes in your PostgreSQL databases, from multiple sessions, you may have seen contention on WAL generation. Especially in case of CPU starvation because XLogInsert->XLogInsertRecord->ReserveXLogInsertLocation uses GCC built-in atomics to protect exclusive access the shared memory structure.
Let's see what the compiler has generated for this:
t4g.micro# gdb --batch /usr/local/pgsql/bin/postgres -ex 'disas ReserveXLogInsertLocation'
Dump of assembler code for function ReserveXLogInsertLocation:
0x000000000055ff18 <+0>: stp x29, x30, [sp, #-80]!
0x000000000055ff1c <+4>: mov x29, sp
0x000000000055ff20 <+8>: str w0, [sp, #44]
0x000000000055ff24 <+12>: str x1, [sp, #32]
0x000000000055ff28 <+16>: str x2, [sp, #24]
0x000000000055ff2c <+20>: str x3, [sp, #16]
0x000000000055ff30 <+24>: adrp x0, 0xde0000 <MultiXactMemberCtlData+56>
0x000000000055ff34 <+28>: add x0, x0, #0x550
0x000000000055ff38 <+32>: ldr x0, [x0]
0x000000000055ff3c <+36>: str x0, [sp, #72]
0x000000000055ff40 <+40>: ldr w0, [sp, #44]
0x000000000055ff44 <+44>: add w0, w0, #0x7
0x000000000055ff48 <+48>: and w0, w0, #0xfffffff8
0x000000000055ff4c <+52>: str w0, [sp, #44]
0x000000000055ff50 <+56>: ldr x0, [sp, #72]
0x000000000055ff54 <+60>: bl 0x55f9b0 <tas>
0x000000000055ff58 <+64>: cmp w0, #0x0
0x000000000055ff5c <+68>: b.eq 0x55ff80 <ReserveXLogInsertLocation+104> // b.none
...
This TAS is an atomic Test And Set function. There is a PostgreSQL wiki page about atomics implementation (https://wiki.postgresql.org/wiki/Atomics) but not really up to date. It says that the ARM for TAS is LDREX/STREX load and store instructions. Let's look at this in my binary:
t4g.micro# gdb --batch /usr/local/pgsql/bin/postgres -ex 'disas tas'
Dump of assembler code for function tas:
0x0000000000514784 <+0>: stp x29, x30, [sp, #-32]!
0x0000000000514788 <+4>: mov x29, sp
0x000000000051478c <+8>: str x0, [sp, #24]
0x0000000000514790 <+12>: ldr x1, [sp, #24]
0x0000000000514794 <+16>: mov w0, #0x1 // #1
0x0000000000514798 <+20>: bl 0xb4f080 <__aarch64_swp4_acq>
0x000000000051479c <+24>: ldp x29, x30, [sp], #32
0x00000000005147a0 <+28>: ret
End of assembler dump.
What is interesting here is the call to __aarch64_swp4_acq which is the outlined atomics.
t4g.micro# gdb --batch /usr/local/pgsql/bin/postgres -ex 'disas __aarch64_swp4_acq'
Dump of assembler code for function __aarch64_swp4_acq:
0x0000000000b4f080 <+0>: hint #0x22
0x0000000000b4f084 <+4>: adrp x16, 0xe11000 <hist_entries+129832>
0x0000000000b4f088 <+8>: ldrb w16, [x16, #1848]
0x0000000000b4f08c <+12>: cbz w16, 0xb4f098 <__aarch64_swp4_acq+24>
0x0000000000b4f090 <+16>: swpa w0, w0, [x1]
0x0000000000b4f094 <+20>: ret
0x0000000000b4f098 <+24>: mov w16, w0
0x0000000000b4f09c <+28>: ldaxr w0, [x1]
0x0000000000b4f0a0 <+32>: stxr w17, w16, [x1]
0x0000000000b4f0a4 <+36>: cbnz w17, 0xb4f09c <__aarch64_swp4_acq+28>
0x0000000000b4f0a8 <+40>: ret
End of assembler dump.
This is where we can see both LSE (SWPA) for ARM v8.1 or later and load store exclusives (LDAXR STRX) for previous versions. The latter is in a loop (CBNZ branches to LDAXR) but the former is one atomic instruction. The decision is done by CBZ to branch to one or the other.
I have installed GCC 11 where outlined atomics is the default, in all Linux distributions (and doing that on a larger size instance).
CFLAGS="" ./configure ./configure && make clean && make && sudo make install
This shows the same as above: a __aarch64_swp4_acq function with run-time detection and two branches, LSE or loop on load/store.
However if I explicitely compile for a version of ARM which has LSE atomics, there's no need for this runtime detection:
CFLAGS="-march=armv8.2-a" ./configure && make clean && make && sudo make install
c6gn.xlarge# gdb --batch /usr/local/pgsql/bin/postgres -ex 'disas tas'
Dump of assembler code for function tas:
Dump of assembler code for function tas:
0x00000000005146b4 <+0>: sub sp, sp, #0x10
0x00000000005146b8 <+4>: str x0, [sp, #8]
0x00000000005146bc <+8>: ldr x0, [sp, #8]
0x00000000005146c0 <+12>: mov w1, #0x1 // #1
0x00000000005146c4 <+16>: swpa w1, w1, [x0]
0x00000000005146c8 <+20>: mov w0, w1
0x00000000005146cc <+24>: add sp, sp, #0x10
0x00000000005146d0 <+28>: ret
End of assembler dump.
The TAS doesn't call a __aarch64_swp4_acq function and the SWPA is in the function (not outlined).
I have the same result with:
CFLAGS="-mcpu=neoverse-n1" ./configure && make clean && make && sudo make install
as GCC 11 knows that Neoverse N1 is ARM v8.2 and -mcpu defines -march and -mtune if not specified otherwise
Note, however, that:
CFLAGS="-mtune=neoverse-n1" ./configure && make clean && make && sudo make install
still generates __aarch64_swp4_acq to be compatible with pre v8.1 ARM (use -mcpu and not -mtune if you don't want them, or add -mno-outline-atomics to -mtune)
Note also that:
c6gn.xlarge# CFLAGS="-march=armv8-a+lse" ./configure && make clean && make && sudo make install
generates the LSE only (without runtime detection)
But:
c6gn.xlarge# CFLAGS="-march=armv8-a -mno-outline-atomics" ./configure && make clean && make && sudo make install
c6gn.xlarge# gdb --batch /usr/local/pgsql/bin/postgres -ex 'disas tas'
Dump of assembler code for function tas:
0x0000000000514730 : sub sp, sp, #0x10
0x0000000000514734 : str x0, [sp, #8]
0x0000000000514738 : ldr x0, [sp, #8]
0x000000000051473c : mov w1, #0x1 // #1
0x0000000000514740 : ldxr w2, [x0]
0x0000000000514744 : stxr w3, w1, [x0]
0x0000000000514748 : cbnz w3, 0x514740
0x000000000051474c : dmb ish
0x0000000000514750 : mov w0, w2
0x0000000000514754 : add sp, sp, #0x10
0x0000000000514758 : ret
End of assembler dump.
generates only load store exclusives.
You can find many blog posts with benchmarks. But, from what we have seen, the optimizations of Graviton2 depend on many parameters: gcc version, compilation flags, host cpu usage, and of course how spinlock and other atomic instructions are a bottleneck for your workload. The most important to understand is where the optimization shows up. In order to avoid severe performance degradation when processes are in spinlock contention and scheduled out of the CPU, we need to provision the instance size, or the autoscaling threshold, to avoid those peaks of CPU starvation. With those ARM optimizations, a smaller size may be acceptable for short peaks. This concurs to lower the cost when running on ARM.
So, if you are not sure about outline atomics being the default, like before version 10 of GCC, better add CFLAGS="-moutline-atomics" to benefit from LSE when available. A patch to add it by default for that was rejected by the PostgreSQL community: [PATCH] audo-detect and use -moutline-atomics compilation flag for aarch64.
With recent GCC, the best is probably CFLAGS="-mcpu=neoverse-n1" for Graviton2. Or CFLAGS="-mcpu=neoverse-n1 -moutline-atomics" if you want the binary to be compatible with non-LSE processors. And above all, in case of doubt, don't guess and check the binaries.
All this is quite new, any feedback welcome: