Performance and performance analysis always held that sense of magic. And that is wrong. If you read a performance analysis report, and you don't understand where the figures come from then the report is wrong. If you do understand, but don't know why the figures are there, the report is still wrong.
And maybe that is the magic of performance reports: it is actually hard to produce something that makes sense, not being an endless list of numbers. Yet on the other hand, the sensibility is highly dependent on what is the goal for the report; so what is sensible to report for one goal, might be wrong for another goal.
Recently I have been asked to start looking into Amazon's EC2 ARM based Graviton2 virtual machine offering.
Looking at a different hardware architecture is complicated. Since the early days, different vendors have implemented computer architectures (slightly) different, have enabled support for these innovations in their operating system at different times, and provided libraries to make use of specific features/innovations. And then a software product needs to pick up these innovations.
This is a lot of talk that might sound abstract. What this is meaning to say is that if you take a fairly complicated piece of software for which the different components/libraries are likely to be optimised for a certain architecture, and then bluntly move them over to another new architecture, you will probably find that the new architecture is not performing well.
The danger in this might be that the new architecture is discarded as inferior and/or slow, and not considered, whilst the comparison wasn't fair. Think about Intel SSE; these allow multiple operations to take place at the same time, which can lead to spectacular performance improvements. It also is understandable that if these are reverted to being done sequentially, this will take much more time. Intel SSE is one of the things that are likely to be not implemented on ARM when just moving something over without carefully going through specific implementations and optimisations.
This means that for a comparison of an architecture, the first thing that should be done is to compare identical, elementary computing properties for both an existing architecture and a new architecture. Elementary computing properties in this sense are things like CPU speed, memory latencies, basic computing operations on numbers, context switching, etc. Moving over and testing much more complicated tasks have a higher probability of running into issues with optimisations.
The closest thing that I have found so far to do that is a benchmark framework called 'lmbench'. It is quite old, and it requires work to get it running on EL8. If anyone has something more recent that can perform the same task I hope see a response.
How to install lmbench on Alma8
In order to install lmbench, perform the following tasks:
sudo yum -y install git
sudo yum -y groupinstall 'Development Tools'
sudo yum -y install libtirpc-devel
# this is a bit hacky, but otherwise it won't work
sudo ln -s /usr/include/tirpc/rpc/ /usr/include/rpc
sudo sed -i 's#<netconfig.h>#<tirpc/netconfig.h>#' /usr/include/rpc/types.h
sudo sed -i 's#<netconfig.h>#<tirpc/netconfig.h>#' /usr/include/rpc/clnt.h
sudo sed -i 's#<rpc/netdb.h>#<netdb.h>#' /usr/include/netdb.h
git clone https://github.com/intel/lmbench.git
cd lmbench/src
make results LDFLAGS=-ltirpc
ps: on Centos7: sudo yum groups install 'Development Tools'
ps: on Alma8/AARCH64: scripts/results: lmbench $CONFIG 2>../${RESULTS} -> lmbench $CONFIG 2>${RESULTS}
This will perform a number of warnings (-Wimplicit-function-declaration), and then provides the lmbench configuration questions.
The results are in lmbench/results. On x86_64, you can see them using make see
. This doesn't work on AARCH64, you have to use the executable lmbench/getsummary directly, and provide the filename of the results: ./lmbench/scripts/getsummary lmbench/results/RESULTSFILE.
Comment on the results
Currently I do not have a result that is worth sharing, because tests for a similar setup on different architectures require as much of the setup to be identical as much as possible. The setups we currently use on Intel and AARCH64 have different kernels which still means the results are an indication, but should not be used as truth, because there's room for things to be different outside of our view.
Still the tests show Graviton2/ARM Neoverse N1 to be in the same ballpark as the C5 Intel (8275CL; Cascade Lake) for performance, mostly having lower overhead/latencies.
For our database, which probably is true too for PostgreSQL, we found Intel SSE, atomic locking and the AARCH64 page size choice for EL8 (64k on AARCH64, 4k on x86_64) to be influencing performance significantly making x86_64 perform better at this point in time. My colleague Franck Pachot and others have found that PostgreSQL by default do not use the specific optimisations.
Our Yugabyte performance team found the page size increase of EL8 AARCH64 builds; I am quite surprised by this: bigger pages mean less page table entries (PTEs) and higher TLB hit ratio's for better performance, yet ARM/AARCH64 is in my mind meant for smaller deployments, which seems to contradict. Another issue with this is that buffered reads will read page size IOs. This means for both PostgreSQL and Yugabyte that any read IO not satisfied from cache will generate a 64kB IO, even if a smaller IO request is performed. This can eat up IO bandwidth quite easily, whilst IO bandwidth in the cloud can be and often is limited.
This means we need to make changes to make the execution perform the same optimisations and behave identical and test again to see how x86_64 and AARCH64 truly differ in performance.
This is also the reason for writing this blogpost: the NeoverseN1 CPU and the Graviton2 platform and the advancements that are made in that area are exciting, and the low level performance measurements show that once the specific optimisations are implemented for Graviton2, this should be close in performance. However to really see how performance relates to the two must be tested once the same optimizations are implemented.