Skip to main content

User Guide

Infiniband Configuration

Last updated: 2024-08-22 10:34:08

Scenarios

After purchasing a GPU cloud server, if it contains Infiniband, please refer to this article for configuration and deployment.

Directions

Check Network

First, check if there is an Infiniband controller.

$ lspci | grep -i infiniband
00:07.0 Infiniband controller: Mellanox Technologies MT2910 Family [ConnectX-7]
If the above information is successfully returned, it indicates that the InfiniBand (IB) network card has been successfully allocated; if the corresponding information is not returned, it means that the network card may not have been correctly identified or the allocation has failed.

Install Driver

  1. Configure APT using the Mellanox repository and download the Mellanox GPG key.
$ wget -qO - http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox | gpg --dearmor -o /usr/share/keyrings/GPG-KEY-Mellanox.gpg
  1. Create the /etc/apt/sources.list.d/mlnx.list file and specify the repository location.
$ curl https://repo.download.nvidia.com/baseos/ubuntu/jammy/dgx-repo-files.tgz | sudo tar xzf - -C /
  1. Update package
$ sudo apt update
  1. Install the nvidia-manage-ofed software package.
$ sudo apt install -y nvidia-manage-ofed
  1. Remove installed OFED components
$ sudo /usr/sbin/nvidia-manage-ofed.py -r ofed
  1. Add Mellanox OFED components
$ sudo /usr/sbin/nvidia-manage-ofed.py -i mofed
  1. After executing the above commands, restart the system for the changes to take effect.
$ reboot

Configure Network

  1. Check the status of the IB network.
$ ibstat
CA 'mlx5_0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.39.3004
        Hardware version: 0
        Node GUID: 0xa088c20300d6d136
        System image GUID: 0xa088c20300d6d136
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 400
                Base lid: 189
                LMC: 0
                SM lid: 1
                Capability mask: 0xa751e848
                Port GUID: 0xa088c20300d6d136
                Link layer: InfiniBand
If the status shows "Active" or "LinkUp", it indicates that the network card has been started. If it is not in this state, you can try to restart the system and check again.
  1. Check the network port of the IB network.
$ ibdev2netdev
mlx5_0 port 1 ==> ibs7 (Down)

$ ibstatus mlx5_0
Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:a088:c203:00d6:d136
        base lid:        0xbd
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            400 Gb/sec (4X NDR)
        link_layer:      InfiniBand

You can see that the InfiniBand port name is ibs7. The subsequent series of configurations all require this port name.

  1. View the InfiniBand network segment allocation.

Please view the instance information and jump to the instance details. Check the network segment configuration rules at the "Device" prompt. You can refer to the screenshot.

For example, the network card is configured as 100.0.n.4/24, and n is set according to requirements.

  1. Create an interface configuration.

Create the configuration file at /etc/network/interfaces. If it doesn't exist, create it.

$ cd /etc/network/
$ cat interfaces
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet dhcp

Create the configuration file /etc/network/interfaces.d. If it doesn't exist, create it.

$ cd /etc/network/
$ ll
total 40
drwxr-xr-x   7 root root  4096 Mar  8 09:37 ./
drwxr-xr-x 152 root root 12288 Feb 19 07:46 ../
drwxr-xr-x   2 root root  4096 Dec 10 22:00 if-down.d/
drwxr-xr-x   2 root root  4096 Dec 10 22:00 if-post-down.d/
drwxr-xr-x   2 root root  4096 Dec 10 22:00 if-pre-up.d/
drwxr-xr-x   2 root root  4096 Jan 31 02:53 if-up.d/
-rw-r--r--   1 root root   241 Jan 31 02:50 interfaces
drwxr-xr-x   2 root root  4096 Mar  8 09:40 interfaces.d/

Enter /etc/network/interfaces.d, and create the file ifcfg-ibs7. Both the file name and the network card-related names in the configuration content need to be changed to ibs7. Please check carefully to maintain consistency.

Network configuration is 100.0.n.4/24. If n = 5, the IP address is configured as address 100.0.5.4. The subnet mask corresponding to /24 is netmask 255.255.255.0.

$ cd /etc/network/interfaces.d
$ vim ifcfg-ibs7

auto ibs7
iface ibs7 inet static
	address 100.0.5.4
	netmask 255.255.255.0
	pre-up echo datagram > /sys/class/net/ibs7/mode || :
	pre-up /sbin/ifconfig ibs7 mtu 1500 || :

Enter the /etc/systemd/network/ directory and create the file 10-ibs7.network. Both the file name and the network-card-related names in the configuration content need to be changed to ibs7. Please check carefully to maintain consistency.

For the network card configuration of 100.0.n.4/24, if n = 5, the IP address should be configured as Address = 100.0.5.4/24.

$ cd /etc/systemd/network/
$ vim 10-ibs7.network

[Match]
Name=ibs7

[Network]
Address=100.0.5.4/24

[Link]
MTUBytes=1500
  1. Update the network configuration and enable the interface.
$ netplan apply
$ ifconfig ibs7 up
  1. Check the IB network by attempting to ping.
$ ip -br a
lo               UNKNOWN        127.0.0.1/8 ::1/128 
eth0             UP             172.16.11.253/24 metric 100 fe80::222:10ff:fe94:6982/64 
ibs7             UP             100.0.5.4/24 fe80::a288:c203:d6:d136/64 

$ ping 100.0.5.4
PING 100.0.5.4 (100.0.5.4) 56(84) bytes of data.
64 bytes from 100.0.5.4: icmp_seq=1 ttl=64 time=0.129 ms
64 bytes from 100.0.5.4: icmp_seq=2 ttl=64 time=0.101 ms