Network performance optimization with Nvidia ConnectX on Proxmox

Serge Logvinov - Oct 27 - - Dev Community

Introduction

To improve network performance for virtual machines, we can use SR-IOV and tune network settings. SR-IOV (Single Root I/O Virtualization) is a feature that lets a single PCI Express (PCIe) device directly connect to virtual machines. This improves communication speed and reduces delays by bypassing the hypervisor.

The modern network adapter supports virtual functions (VFs) that can be used by virtual machines. The network adapter can create multiple virtual network adapters that can be assigned to virtual machines directly by SR-IOV.

This guide will show how to enable SR-IOV and VFs to adjust network performance in Proxmox with Nvidia ConnectX network adapters. We will use Mellanox ConnectX-6 Lx 25GbE NICs and linux bonding (port agrigation) to get the best reliability and speed.

Using bonding interfaces with SR-IOV can be tricky. However, Mellanox supports handling bonding interfaces at the hardware level. This means all virtual adapters connect to network hardware switches linked to the bonded interfaces.
To virtual machines, it seems like they are using just one network adapter.

Requirements

I assume you have a Proxmox server already installed and running.

Check the network adapter. In this example, we use Mellanox Technologies MT2894 (ConnectX-6 Lx) 25GbE NICs.
One network adapter with two ports.

lspci -nn | grep Ethernet
81:00.0 Ethernet controller [0200]: Mellanox Technologies MT2894 Family [ConnectX-6 Lx] [15b3:101f]
81:00.1 Ethernet controller [0200]: Mellanox Technologies MT2894 Family [ConnectX-6 Lx] [15b3:101f]
Enter fullscreen mode Exit fullscreen mode

Find the network interface name.

dmesg | grep 81:00.0 | grep renamed
mlx5_core 0000:81:00.0 enp129s0f0np0: renamed from eth2
Enter fullscreen mode Exit fullscreen mode

We need to find the switchid of the network adapter enp129s0f0np0, result is 3264160003bd70c4.
It is used to identify the network adapter in the Open vSwitch configuration.

ip -d link show enp129s0f0np0
5: enp129s0f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether c4:70:bd:16:64:32 brd ff:ff:ff:ff:ff:ff promiscuity 0  allmulti 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 768 numrxqueues 63 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 portname p0 switchid 3264160003bd70c4 parentbus pci parentdev 0000:81:00.0
Enter fullscreen mode Exit fullscreen mode

Enable SR-IOV

Enable SR-IOV in the BIOS.

Enter the BIOS and check the option to enable SR-IOV. The option may be in different locations depending on the motherboard manufacturer. Some options related only for AMD processors.

  • CPU Virtualization (SVM Mode) -> Enabled
  • Chipset -> NBIO Common Options -> IOMMU -> Enabled
  • Chipset -> NBIO Common Options -> ACS Enable -> Enabled
  • Chipset -> NBIO Common Options -> PCIe ARI Support -> Enabled

Check the SR-IOV status.

dmesg | grep -i -e DMAR -e IOMMU
Enter fullscreen mode Exit fullscreen mode

Check the IOMUU groups.

for g in $(find /sys/kernel/iommu_groups/* -maxdepth 0 -type d | sort -V); do
    echo "IOMMU Group ${g##*/}:"
    for d in $g/devices/*; do
        echo -e "\t$(lspci -nns ${d##*/})"
    done;
done;
Enter fullscreen mode Exit fullscreen mode

Or by proxmox command.

pvesh get /nodes/`hostname -s`/hardware/pci --pci-class-blacklist ""
Enter fullscreen mode Exit fullscreen mode

The result should have a lot of IOMMU Groups.

Configure Open vSwitch

Open vSwitch can offload bonding interfaces to the hardware, allowing the network adapter to manage ogrigeted traffic. It can also create a virtual switch on the network adapter, connecting virtual functions to the network.

Install Open vSwitch.

apt install openvswitch-switch ifupdown2 patch
Enter fullscreen mode Exit fullscreen mode

Configure the network adapter to the switchdev mode. And set the number of virtual functions to 4.

vi /etc/udev/rules.d/70-persistent-net-vf.rules

# Ingress bond interface
KERNELS=="0000:81:00.0", DRIVERS=="mlx5_core", SUBSYSTEMS=="pci", ACTION=="add", ATTR{sriov_totalvfs}=="?*", RUN+="/usr/sbin/devlink dev eswitch set pci/0000:81:00.0 mode switchdev", ATTR{sriov_numvfs}="0"
KERNELS=="0000:81:00.1", DRIVERS=="mlx5_core", SUBSYSTEMS=="pci", ACTION=="add", ATTR{sriov_totalvfs}=="?*", RUN+="/usr/sbin/devlink dev eswitch set pci/0000:81:00.1 mode switchdev", ATTR{sriov_numvfs}="0"

# Set the number of virtual functions to 4
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="52ac120003bd70c4", ATTR{phys_port_name}=="p0", ATTR{device/sriov_totalvfs}=="?*", ATTR{device/sriov_numvfs}=="0", ATTR{device/sriov_numvfs}="4"
# Rename the virtual network adapter to ovs-sw1pf0vf0, ovs-sw1pf0vf1, ovs-sw1pf0vf2, ovs-sw1pf0vf3
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="52ac120003bd70c4", ATTR{phys_port_name}!="p[0-9]*", ATTR{phys_port_name}!="", NAME="ovs-sw1$attr{phys_port_name}"
Enter fullscreen mode Exit fullscreen mode

We need to patch the openvswitch to support the hardware offload.
Add the following lines to the ovs-ctl script.
You need to disable auto update of the openvswitch-switch package after the changes, otherwise, the changes will be reverted.

patch  -d/ -p0 --ignore-whitespace <<'EOF'
--- /usr/share/openvswitch/scripts/ovs-ctl.diff 2024-10-16 01:25:28.369482552 +0000
+++ /usr/share/openvswitch/scripts/ovs-ctl  2024-10-16 01:27:32.740490528 +0000
@@ -162,6 +162,8 @@
         # Initialize database settings.
         ovs_vsctl -- init -- set Open_vSwitch . db-version="$schemaver" \
             || return 1
+        ovs_vsctl -- set Open_vSwitch . other_config:hw-offload=true other_config:tc-policy=skip_sw ||:
+        ovs_vsctl -- set Open_vSwitch . other_config:lacp-fallback-ab=true ||:
         set_system_ids || return 1
         if test X"$DELETE_BRIDGES" = Xyes; then
             for bridge in `ovs_vsctl list-br`; do
EOF
Enter fullscreen mode Exit fullscreen mode

After the changes, restart the server.
When the server is up, check the openvswitch configuration and offload status.

ovs-vsctl show
ovs-vsctl get Open_vSwitch . other_config
Enter fullscreen mode Exit fullscreen mode

Make sure the hw-offload=true is set.

{hw-offload="true", lacp-fallback-ab="true", tc-policy=skip_sw}
Enter fullscreen mode Exit fullscreen mode

And network adapter is in the switchdev mode.

ip -d link show enp129s0f0np0
Enter fullscreen mode Exit fullscreen mode

Output should be like this.

4: enp129s0f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether c4:70:bd:16:64:32 brd ff:ff:ff:ff:ff:ff promiscuity 1  allmulti 0 minmtu 68 maxmtu 9978
    openvswitch_slave addrgenmode none numtxqueues 768 numrxqueues 63 gso_max_size 65536 gso_max_segs 65535 tso_max_size 524280 tso_max_segs 65535 gro_max_size 65536 portname p0 switchid 52ac120003bd70c4 parentbus pci parentdev 0000:81:00.0
    vf 0     link/ether c4:70:ff:ff:ff:e0 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 1     link/ether c4:70:ff:ff:ff:e1 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 2     link/ether c4:70:ff:ff:ff:e2 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 3     link/ether c4:70:ff:ff:ff:e3 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
Enter fullscreen mode Exit fullscreen mode

vf 0, vf 1, vf 2, vf 3 are the virtual functions.

Configure the bond interface

Add network bond interface. Where enp129s0f0np0 and enp129s0f1np1 are the physical network adapters.

  • vi /etc/network/interfaces
auto enp129s0f0np0
iface enp129s0f0np0 inet manual

auto enp129s0f1np1
iface enp129s0f1np1 inet manual

auto vmbr1
iface vmbr1 inet static
        ovs_type OVSBridge
        ovs_ports bond1
        ovs_mtu 9000
        address 192.168.1.2/24

auto bond1
iface bond1 inet manual
        ovs_type OVSBond
        ovs_bonds enp129s0f0np0 enp129s0f1np1
        ovs_bridge vmbr1
        ovs_mtu 9000
        ovs_options lacp=active bond_mode=balance-tcp
Enter fullscreen mode Exit fullscreen mode

Reboot the server.

Add the virtual functions to the Open vSwitch

  • vi /etc/network/interfaces.d/ovs-sw1.conf
# Add VFs to the offloaded switch

auto ovs-sw1pf0vf0
iface ovs-sw1pf0vf0 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr1
        ovs_mtu 9000

auto ovs-sw1pf0vf1
iface ovs-sw1pf0vf1 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr1
        ovs_mtu 9000

auto ovs-sw1pf0vf2
iface ovs-sw1pf0vf2 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr1
        ovs_mtu 9000

auto ovs-sw1pf0vf3
iface ovs-sw1pf0vf3 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr1
        ovs_mtu 9000
Enter fullscreen mode Exit fullscreen mode

After the changes, restart the server. To make sure the changes are applied.

Check the virtual functions:

ip -d link show | grep ovs-sw
Enter fullscreen mode Exit fullscreen mode

Output should be like this.

14: ovs-sw1pf0vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
15: ovs-sw1pf0vf1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
16: ovs-sw1pf0vf2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
17: ovs-sw1pf0vf3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
Enter fullscreen mode Exit fullscreen mode

Now we have bonding interface bond1 and virtual functions ovs-sw1pf0vf0, ovs-sw1pf0vf1, ovs-sw1pf0vf2, ovs-sw1pf0vf3 connected to the virtual switch vmbr1 offloaded to the network hardware. We can use these interfaces in the virtual machines configuration, attached as a PCI device 0000:81:00.2 - 0000:81:00.5.

Configure the Proxmox

Let's create a resource mapping for the virtual functions.
Go to the Proxmox web interface, Datacenter -> Resource Mappings -> PCI Devices -> Add.

  • Name: network
  • Check all virtual functions starting from 0000:81:00.2 - 0000:81:00.5

The ofical documentation you can find here https://pve.proxmox.com/pve-docs/pve-admin-guide.html#resource_mapping

Configure the virtual machine

I hope you have a virtual machine already created.
Add the PCI devices to the virtual machine.

Go to the Proxmox web interface, Nodes -> your node -> Hardware -> PCI Devices -> Add.

  • Mappped Device: network
  • all options are default

After the changes, start the virtual machine.

Check the Virtual Machine

We've lunched the virtual machine with 16 vCPU, and eth1 is the virtual function ovs-sw1pf0vf0. Linux kernel 6.1.82.

# lscpu
...
Virtualization features:
  Virtualization:         AMD-V
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):
  L1d:                    1 MiB (16 instances)
  L1i:                    1 MiB (16 instances)
  L2:                     8 MiB (16 instances)
  L3:                     256 MiB (16 instances)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-15
Enter fullscreen mode Exit fullscreen mode
# ip -d link show eth1
9: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether c4:70:ff:ff:ff:e0 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9978 addrgenmode eui64 numtxqueues 96 numrxqueues 11 gso_max_size 65536 gso_max_segs 65535
Enter fullscreen mode Exit fullscreen mode

Driver version:

# ethtool -i eth1
driver: mlx5_core
version: 6.1.82-talos
firmware-version: 26.36.1010 (MT_0000000547)
expansion-rom-version:
bus-info: 0000:00:10.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
Enter fullscreen mode Exit fullscreen mode

Netwok settings:

# ethtool eth1
Settings for eth1:
    Supported ports: [ Backplane ]
    Supported link modes:   1000baseT/Full
                            10000baseT/Full
                            1000baseKX/Full
                            10000baseKR/Full
                            10000baseR_FEC
                            25000baseCR/Full
                            25000baseKR/Full
                            25000baseSR/Full
                            1000baseX/Full
                            10000baseCR/Full
                            10000baseSR/Full
                            10000baseLR/Full
                            10000baseER/Full
    Supported pause frame use: Symmetric
    Supports auto-negotiation: Yes
    Supported FEC modes: Not reported
    Advertised link modes:  1000baseT/Full
                            10000baseT/Full
                            1000baseKX/Full
                            10000baseKR/Full
                            10000baseR_FEC
                            25000baseCR/Full
                            25000baseKR/Full
                            25000baseSR/Full
                            1000baseX/Full
                            10000baseCR/Full
                            10000baseSR/Full
                            10000baseLR/Full
                            10000baseER/Full
    Advertised pause frame use: No
    Advertised auto-negotiation: Yes
    Advertised FEC modes: Not reported
    Speed: 25000Mb/s
    Duplex: Full
    Auto-negotiation: on
    Port: Direct Attach Copper
    PHYAD: 0
    Transceiver: internal
    Supports Wake-on: d
    Wake-on: d
        Current message level: 0x00000004 (4)
                               link
    Link detected: yes
Enter fullscreen mode Exit fullscreen mode

Netwok features:

# ethtool -k eth1 | grep " on"
rx-checksumming: on
tx-checksumming: on
    tx-checksum-ip-generic: on
scatter-gather: on
    tx-scatter-gather: on
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-gso-partial: on
tx-udp-segmentation: on
tx-vlan-stag-hw-insert: on
rx-vlan-stag-filter: on [fixed]
Enter fullscreen mode Exit fullscreen mode

Troubleshooting

ovs-vsctl show
ovs-vsctl get Open_vSwitch . other_config
ovs-appctl bond/show bond1
ovs-appctl lacp/show bond1

ovs-dpctl dump-flows -m
ovs-appctl dpctl/dump-flows --names type=offloaded
Enter fullscreen mode Exit fullscreen mode

Resources

. . . . . .