Investigate performance with Process Watch on AWS Graviton processors

Jason Andrews - Aug 19 - - Dev Community

Process Watch displays the instruction mix of the processes running on your Linux system. It recently added support for the Arm architecture, and works well on AWS Graviton processors. It can be used to quickly identify application usage of floating point, NEON, SVE, and SVE2 vector instructions.

After analyzing a running system or application with Process Watch you may be able to increase performance by recompiling applications, updating libraries, or generally finding ways to improve performance using vector instructions.

Read on to find out how to install and use Process Watch.

What software is required to build Process Watch?

The Process Watch source code is available on GitHub.

You will need various tools to build Process Watch from source:

  • CMake
  • Clang
  • LLVM
  • libelf

The instructions below are for an AWS Graviton-based EC2 instance running Ubuntu 24.04. You can modify them as needed for your Linux distribution.

To install the required tools run:

sudo apt-get update
sudo apt-get install libelf-dev cmake clang llvm llvm-dev python-is-python3 -y
Enter fullscreen mode Exit fullscreen mode

Where is the Process Watch source code?

Use git to clone the Process Watch repository:

git clone --recursive https://github.com/intel/processwatch.git
Enter fullscreen mode Exit fullscreen mode

Make sure to include the --recursive option to clone all submodules.

If you are curious, the submodules are:

Change to the repository directory:

cd processwatch
Enter fullscreen mode Exit fullscreen mode

What is the best way to build Process Watch?

To build Process Watch, run the build script included in the repository:

./build.sh -b
Enter fullscreen mode Exit fullscreen mode

You will see the following output:

Compiling dependencies...
  No system bpftool found! Compiling libbpf and bpftool...
  Compiling capstone...
Building the 'insn' BPF program:
  Gathering BTF information for this kernel...
  Compiling the BPF program...
  Stripping the object file...
  Generating the BPF skeleton header...
Linking the main Process Watch binary...
Enter fullscreen mode Exit fullscreen mode

You now have the processwatch binary in the top-level directory of the repository.

For convenience, copy it to a common place included in your search path:

sudo cp ./processwatch /usr/local/bin
Enter fullscreen mode Exit fullscreen mode

Do I need to run Process Watch as root?

You can run Process Watch as a non-root user, but it requires the modifications shown below, which decreases system security.

To enable non-root users to run Process Watch, you need to run the 3 commands below. If you want to run Process Watch with sudo you can skip these 3 commands and go to the next section.

sudo setcap CAP_PERFMON,CAP_BPF=+ep /usr/local/bin/processwatch
sudo sysctl -w kernel.perf_event_paranoid=-1
sudo sysctl kernel.unprivileged_bpf_disabled=0
Enter fullscreen mode Exit fullscreen mode

How do I run Process Watch?

Process Watch accepts a number of command-line arguments. You can view these by running:

sudo processwatch -h
Enter fullscreen mode Exit fullscreen mode

The output is:

usage: processwatch [options]

options:
  -h          Displays this help message.
  -v          Displays the version.
  -i <int>    Prints results every <int> seconds.
  -n <num>    Prints results for <num> intervals.
  -c          Prints all results in CSV format to stdout.
  -p <pid>    Only profiles <pid>.
  -m          Displays instruction mnemonics, instead of categories.
  -s <samp>   Profiles instructions with a sampling period of <samp>.
  -f <filter> Can be used multiple times. Defines filters for columns. Defaults to 'FPARMv8', 'NEON', 'SVE' and 'SVE2'.
  -l          Prints all available categories, or mnemonics if -m is specified.
  -d          Prints only debug information.
Enter fullscreen mode Exit fullscreen mode

Without any options Process Watch:

  • Prints results every two seconds
  • Prints results until it is killed (using Ctrl+C)
  • Prints all results in a table format on stdout
  • Profiles all running processes
  • Displays counts for the default filters, which are 'FPARMv8', 'NEON', 'SVE', and 'SVE2'
  • Sets the sample period to every 10000 events

What does the Process Watch output look like?

You can run Process Watch with no arguments:

sudo ./processwatch
Enter fullscreen mode Exit fullscreen mode

The output is similar to:

PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL
ALL      ALL              0.00     0.29     0.00     0.00     100.00   346
17400    processwatch     0.00     0.36     0.00     0.00     80.64    279
254      systemd-journal  0.00     0.00     0.00     0.00     13.01    45
542      irqbalance       0.00     0.00     0.00     0.00     2.60     09
544      rs:main Q:Reg    0.00     0.00     0.00     0.00     2.02     07
560      snapd            0.00     0.00     0.00     0.00     1.16     04
296      multipathd       0.00     0.00     0.00     0.00     0.58     02

PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL
ALL      ALL              3.57     12.86    0.00     0.00     100.00   140
17400    processwatch     3.73     13.43    0.00     0.00     95.71    134
4939     sshd             0.00     0.00     0.00     0.00     2.86     04
296      multipathd       0.00     0.00     0.00     0.00     0.71     01
560      snapd            0.00     0.00     0.00     0.00     0.71     01

PID      NAME             FPARMv8  NEON     SVE      SVE2     %TOTAL   TOTAL
ALL      ALL              1.18     5.12     0.00     0.00     100.00   254
17400    processwatch     1.19     5.16     0.00     0.00     99.21    252
6651     packagekitd      0.00     0.00     0.00     0.00     0.39     01
4939     sshd             0.00     0.00     0.00     0.00     0.39     01
Enter fullscreen mode Exit fullscreen mode

New output comes every two seconds, and the next samples are appended to the bottom of the output.

Use Ctrl+C to end processwatch.

How can I use Process Watch to identify applications which could run faster?

To see an example of how Process Watch can be used, use a text editor to save the Python code below into a file named zip.py.

import gzip

size = 16384

for _ in range(3):
    with open('largefile', 'rb') as f_in:
        with gzip.open('largefile.gz', 'wb') as f_out:
            while (data := f_in.read(size)):
                f_out.write(data)

f_out.close()
print("Zip complete")
Enter fullscreen mode Exit fullscreen mode

The Python code reads a file named largefile and writes a compressed version as largefile.gz.

To create the input file, use the dd command:

dd if=/dev/zero of=largefile count=1M bs=1024
Enter fullscreen mode Exit fullscreen mode

Next, use a text editor to save the script below in a file named run1.sh.

#!/bin/bash

python ./zip.py &
pid=$!
sudo processwatch -p $pid -s 1 -i 2 -f HasCRC -f HasNEON &
pid2=$!
sleep 30
kill $pid2
Enter fullscreen mode Exit fullscreen mode

Look at the script and see that it starts the Python code to compress largefile and then attaches processwatch to monitor for CRC and NEON instructions. The sample rate is every 1 second and the output is printed every 2 seconds.

File compression is a task that runs best when CRC instructions are used. No CRC instructions indicates an opening for performance improvement.

Run the script using:

bash ./run1.sh
Enter fullscreen mode Exit fullscreen mode

The Process Watch output will start to print on stdout.

The output is similar to:

PID      NAME             CRC      NEON     %TOTAL   TOTAL   
ALL      ALL              0.00     1.63     100.00   25466   
1226     python           0.00     1.63     100.00   25466  

PID      NAME             CRC      NEON     %TOTAL   TOTAL   
ALL      ALL              0.00     1.96     100.00   23224   
1226     python           0.00     1.96     100.00   23224   
Enter fullscreen mode Exit fullscreen mode

Notice that no CRC instructions are shown. This is because the version of zlib supplied by the Linux distribution is not compiled with CRC instructions.

You can confirm this by running objdump to disassemble the library and look for crc32 instructions.

objdump -d /usr/lib/aarch64-linux-gnu/libz.so.1 | awk -F" " '{print $3}' | grep crc32 | wc -l
Enter fullscreen mode Exit fullscreen mode

If the result is 0 then there are no crc32 instructions used in the library.

Is there a way to use CRC instructions for file compression?

If there are no crc32 instructions in zlib then you can use zlib-cloudflare to increase application performance.

To build and install zlib-cloudflare download and build using:

git clone https://github.com/cloudflare/zlib.git
cd zlib && ./configure
make && sudo make install
cd ..
Enter fullscreen mode Exit fullscreen mode

With the new library installed, use a text editor to save a new bash script with the code below to the file run2.sh:

#!/bin/bash

LD_PRELOAD=/usr/local/lib/libz.so python ./zip.py &
pid=$!
sudo processwatch -p $pid -s 1 -i 2 -f HasCRC -f HasNEON &
pid2=$!
sleep 10
kill $pid2
Enter fullscreen mode Exit fullscreen mode

Notice the new libz.so is used and the CRC and NEON instructions are again printed every 2 seconds.

Run the new script using:

bash ./run2.sh
Enter fullscreen mode Exit fullscreen mode

This time the CRC instructions are used, and the performance is significantly faster.

The output is similar to:

PID      NAME             CRC      NEON     %TOTAL   TOTAL   
ALL      ALL              17.40    1.80     100.00   25246   
1251     python           17.40    1.80     100.00   25246   

PID      NAME             CRC      NEON     %TOTAL   TOTAL   
ALL      ALL              17.33    2.54     100.00   23556   
1251     python           17.33    2.54     100.00   23556   
Enter fullscreen mode Exit fullscreen mode

What else can I do with Process Watch?

Besides CRC instructions, you can use Process Watch to look for NEON, SVE, and SVE2 instructions. These are all vector extensions in Arm processors used to increase application performance.

To see examples of NEON and SVE refer to the Using Process watch Arm Learning Path.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .