0%

【虚拟化】SR-IOV 架构

SR-IOV 全称 Single Root I/O Virtualization,是 Intel 在 2007年提出的一种基于硬件的虚拟化解决方案。在虚拟化场景中,CPU 与内存是最先解决的,但是 I/O 设备一直没有很好的解决办法,Intel 有 VT-d(Virtualization Technology for Directed I/O)可以将物理服务器的 PCIe 设备直接提供给虚拟机使用,也就是我们常说的 Passthrough,但是直通面临一个问题是 PCIe 设备只能给一个虚拟机使用,其他虚拟机就只能干瞪眼,这肯定是不行的,所以有了 SR-IOV,一个物理设备可以虚拟出多个虚拟设备给虚拟机使用。

基本概念

SR-IOVSingle Root I/O Virtualization)是一个将PCIe共享给虚拟机的标准,通过为虚拟机提供独立的内存空间、中断、DMA流,来绕过VMM实现数据访问,此功能使得虚拟功能可以共享物理设备,并在没有CPU和虚拟机管理程序软件开销的情况下执行 I/O。SR-IOV 基于两种PCIe functions

  • PF (Physical Function): 包含完整的PCIe功能,包括SR-IOV的扩张能力,该功能用于SR-IOV的配置和管理
    • 禁用SR-IOV后,主机将在一个物理网卡上创建一个 PF
    • 每个 PF 最多可有64,000个与其关联的 VF,VF 的具体数量限制受限于 PCIe 设备自身配置及驱动程序的支持
    • PF 可以通过寄存器创建 VF,这些寄存器设计有专用于此目的的属性
  • VF (Virtual Function): 包含轻量级的PCIe功能。每一个VF有它自己独享的PCI配置区域,并且可能与其他VF共享着同一个物理资源
    • 每个 VF 都是通过 PF 来生成管理的,创建 VF 后,可以直接将其指定给虚拟机或各个应用程序
    • 每个 VF 都具有一个 PCI 内存空间,用于映射其寄存器集。VF设备驱动程序对寄存器集进行操作以启用其功能,并且显示为实际存在的PCI设备
    • 一旦在 PF 中启用了 SR-IOV,就可以通过 PF 的总线、设备和功能编号(路由 ID)访问各个 VF 的 PCI 配置空间

前置要求

  • CPU 必须支持IOMMU(比如英特尔的VT-d 或者AMDAMD-ViPower8 处理器默认支持IOMMU
  • 固件Firmware 必须支持IOMMU
  • CPU 根桥必须支持 ACS 或者ACS等价特性
  • PCIe 设备必须支持ACS 或者ACS等价特性
  • 建议根桥和PCIe 设备中间的所有PCIe 交换设备都支持ACS,如果某个PCIe交换设备不支持ACS,其后的所有PCIe设备只能共享某个IOMMU 组,所以只能分配给1台虚机。

使用示例

查看网卡是否支持 SR-IOV,可以看到 Capabilities 字段中支持 SR-IOV

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$ lspci -v | grep -A 20 Mellanox
0000:b3:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Subsystem: Mellanox Technologies Device 0007
Physical Slot: 44
Flags: bus master, fast devsel, latency 0, IRQ 53, NUMA node 1
Memory at 3ef806000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at ee100000 [disabled] [size=1M]
Capabilities: [60] Express Endpoint, MSI 00
Capabilities: [48] Vital Product Data
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [180] Single Root I/O Virtualization (SR-IOV)
Capabilities: [1c0] #19
Capabilities: [230] Access Control Services
Kernel driver in use: mlx5_core
Kernel modules: mlx5_core

创建 VF,其中 num_vfs 为需要创建的 VF 个数,可以查看 /sys/class/net/<eth>/device/sriov_totalvfs 查询支持的最大 VFs 个数

1
$ echo ${num_vfs} > /sys/class/net/eth2/device/sriov_numvfs

使用 lspci 可以查看创建的 Virtual Function,可以看到当前容器有两张 RDMA 网卡,使用其中一张 eth2 创建了 48 个 VF。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ lspci | grep Mellanox
0000:b3:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0000:b3:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
0000:b3:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:00.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:00.4 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:00.5 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:00.6 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:00.7 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:01.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:01.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:01.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:01.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:01.4 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:01.5 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:01.6 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:01.7 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:02.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:02.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:02.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:02.3 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
0000:b3:02.4 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
...

这个时候查看网口设备:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:d9:2b:9b brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether bc:97:e1:16:d3:a5 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 0c:42:a1:73:5c:26 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:1f:73:5c:26:00, spoof checking off, link-state disable, trust off, query_rss off
vf 1 MAC 00:1f:73:5c:26:01, spoof checking off, link-state disable, trust off, query_rss off
vf 2 MAC 00:1f:73:5c:26:02, spoof checking off, link-state disable, trust off, query_rss off
vf 3 MAC 00:1f:73:5c:26:03, spoof checking off, link-state disable, trust off, query_rss off
vf 4 MAC 00:1f:73:5c:26:04, spoof checking off, link-state disable, trust off, query_rss off
vf 5 MAC 00:1f:73:5c:26:05, spoof checking off, link-state disable, trust off, query_rss off
vf 6 MAC 00:1f:73:5c:26:06, spoof checking off, link-state disable, trust off, query_rss off
vf 7 MAC 00:1f:73:5c:26:07, spoof checking off, link-state disable, trust off, query_rss off
vf 8 MAC 00:1f:73:5c:26:08, spoof checking off, link-state disable, trust off, query_rss off
vf 9 MAC 00:1f:73:5c:26:09, spoof checking off, link-state disable, trust off, query_rss off
vf 10 MAC 00:1f:73:5c:26:10, spoof checking off, link-state disable, trust off, query_rss off
vf 11 MAC 00:1f:73:5c:26:11, spoof checking off, link-state disable, trust off, query_rss off
vf 12 MAC 00:1f:73:5c:26:12, spoof checking off, link-state disable, trust off, query_rss off
vf 13 MAC 00:1f:73:5c:26:13, spoof checking off, link-state disable, trust off, query_rss off
vf 14 MAC 00:1f:73:5c:26:14, spoof checking off, link-state disable, trust off, query_rss off
vf 15 MAC 00:1f:73:5c:26:15, spoof checking off, link-state disable, trust off, query_rss off
vf 16 MAC 00:1f:73:5c:26:16, spoof checking off, link-state disable, trust off, query_rss off
vf 17 MAC 00:1f:73:5c:26:17, spoof checking off, link-state disable, trust off, query_rss off
vf 18 MAC 00:1f:73:5c:26:18, spoof checking off, link-state disable, trust off, query_rss off
vf 19 MAC 00:1f:73:5c:26:19, spoof checking off, link-state disable, trust off, query_rss off
vf 20 MAC 00:1f:73:5c:26:20, spoof checking off, link-state disable, trust off, query_rss off
vf 21 MAC 00:1f:73:5c:26:21, spoof checking off, link-state disable, trust off, query_rss off
vf 22 MAC 00:1f:73:5c:26:22, spoof checking off, link-state disable, trust off, query_rss off
vf 23 MAC 00:1f:73:5c:26:23, spoof checking off, link-state disable, trust off, query_rss off
vf 24 MAC 00:1f:73:5c:26:24, spoof checking off, link-state disable, trust off, query_rss off
vf 25 MAC 00:1f:73:5c:26:25, spoof checking off, link-state disable, trust off, query_rss off
vf 26 MAC 00:1f:73:5c:26:26, spoof checking off, link-state disable, trust off, query_rss off
vf 27 MAC 00:1f:73:5c:26:27, spoof checking off, link-state disable, trust off, query_rss off
vf 28 MAC 00:1f:73:5c:26:28, spoof checking off, link-state disable, trust off, query_rss off
vf 29 MAC 00:1f:73:5c:26:29, spoof checking off, link-state disable, trust off, query_rss off
vf 30 MAC 00:1f:73:5c:26:30, spoof checking off, link-state disable, trust off, query_rss off
vf 31 MAC 00:1f:73:5c:26:31, spoof checking off, link-state disable, trust off, query_rss off
vf 32 MAC 00:1f:73:5c:26:32, spoof checking off, link-state disable, trust off, query_rss off
vf 33 MAC 00:1f:73:5c:26:33, spoof checking off, link-state disable, trust off, query_rss off
vf 34 MAC 00:1f:73:5c:26:34, spoof checking off, link-state disable, trust off, query_rss off
vf 35 MAC 00:1f:73:5c:26:35, spoof checking off, link-state disable, trust off, query_rss off
vf 36 MAC 00:1f:73:5c:26:36, spoof checking off, link-state disable, trust off, query_rss off
vf 37 MAC 00:1f:73:5c:26:37, spoof checking off, link-state disable, trust off, query_rss off
vf 38 MAC 00:1f:73:5c:26:38, spoof checking off, link-state disable, trust off, query_rss off
vf 39 MAC 00:1f:73:5c:26:39, spoof checking off, link-state disable, trust off, query_rss off
vf 40 MAC 00:1f:73:5c:26:40, spoof checking off, link-state disable, trust off, query_rss off
vf 41 MAC 00:1f:73:5c:26:41, spoof checking off, link-state disable, trust off, query_rss off
vf 42 MAC 00:1f:73:5c:26:42, spoof checking off, link-state disable, trust off, query_rss off
vf 43 MAC 00:1f:73:5c:26:43, spoof checking off, link-state disable, trust off, query_rss off
vf 44 MAC 00:1f:73:5c:26:44, spoof checking off, link-state disable, trust off, query_rss off
vf 45 MAC 00:1f:73:5c:26:45, spoof checking off, link-state disable, trust off, query_rss off
vf 46 MAC 00:1f:73:5c:26:46, spoof checking off, link-state disable, trust off, query_rss off
vf 47 MAC 00:1f:73:5c:26:47, spoof checking off, link-state disable, trust off, query_rss off
5: eth3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
link/ether 0c:42:a1:73:5c:26 brd ff:ff:ff:ff:ff:ff
258060: dev258060: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:1f:73:5c:26:11 brd ff:ff:ff:ff:ff:ff
258061: dev258061: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:1f:73:5c:26:12 brd ff:ff:ff:ff:ff:ff
258062: eth8: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:1f:73:5c:26:13 brd ff:ff:ff:ff:ff:ff
258063: eth9: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:1f:73:5c:26:14 brd ff:ff:ff:ff:ff:ff
258064: eth41: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:1f:73:5c:26:15 brd ff:ff:ff:ff:ff:ff
258065: eth10: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:1f:73:5c:26:16 brd ff:ff:ff:ff:ff:ff
258066: eth11: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:1f:73:5c:26:17 brd ff:ff:ff:ff:ff:ff
258067: eth12: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:1f:73:5c:26:18 brd ff:ff:ff:ff:ff:ff
258068: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:1f:73:5c:26:19 brd ff:ff:ff:ff:ff:ff
258069: eth13: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
....
411: ens44f0_0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether be:de:6e:7b:37:b5 brd ff:ff:ff:ff:ff:ff
413: ens44f0_1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether 6a:3f:66:8c:e0:5c brd ff:ff:ff:ff:ff:ff
415: ens44f0_2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether de:c2:13:60:77:74 brd ff:ff:ff:ff:ff:ff
417: ens44f0_3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether 66:07:7e:ce:b8:8f brd ff:ff:ff:ff:ff:ff
418: ens44f0_4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether 66:cf:35:4a:53:54 brd ff:ff:ff:ff:ff:ff
419: ens44f0_5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether 32:50:06:18:3c:06 brd ff:ff:ff:ff:ff:ff
420: ens44f0_6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether 06:12:97:50:33:fa brd ff:ff:ff:ff:ff:ff
424: ens44f0_7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
link/ether 86:f4:65:79:cc:a7 brd ff:ff:ff:ff:ff:ff
...

查看 eth2 各个 VF 对应的 PCI 号如下:

1
2
3
4
5
6
$ ls -l /sys/class/net/eth2/device/virtfn*
lrwxrwxrwx 1 root root 0 Mar 17 19:26 /sys/class/net/eth2/device/virtfn0 -> ../0000:b3:00.2
lrwxrwxrwx 1 root root 0 Mar 17 19:26 /sys/class/net/eth2/device/virtfn1 -> ../0000:b3:00.3
lrwxrwxrwx 1 root root 0 Mar 17 19:26 /sys/class/net/eth2/device/virtfn10 -> ../0000:b3:01.4
lrwxrwxrwx 1 root root 0 Mar 17 19:26 /sys/class/net/eth2/device/virtfn11 -> ../0000:b3:01.5
lrwxrwxrwx 1 root root 0 Mar 17 19:26 /sys/class/net/eth2/device/virtfn12 -> ../0000:b3:01.6

根据 PCI 号可以查找到 VF 对应的网络设备名,可知 0000:b3:00.2 对应的设备名是 dev506

1
2
$ ls -l /sys/class/net/ | grep 0000:b3:00.2
lrwxrwxrwx 1 root root 0 Mar 9 09:35 dev506 -> ../../devices/pci0000:ae/0000:ae:00.0/0000:af:00.0/0000:b0:0c.0/0000:b3:00.2/net/dev506

通过libvirt绑定到虚拟机

1
2
3
4
5
6
7
8
$ cat >/tmp/interface.xml <<EOF
<interface type='hostdev' managed='yes'>
<source>
<address type='pci' domain='0' bus='11' slot='16' function='0'/>
</source>
</interface>
EOF
$ virsh attach-device MyGuest /tmp/interface. xml --live --config

当然也可以给网卡配置MAC地址和VLAN

1
2
3
4
5
6
7
8
9
10
11
12
<interface type='hostdev' managed='yes'>
<source>
<address type='pci' domain='0' bus='11' slot='16' function='0'/>
</source>
<mac address='52:54:00:6d:90:02'>
<vlan>
<tag id='42'/>
</vlan>
<virtualport type='802.1Qbh'>
<parameters profileid='finance'/>
</virtualport>
</interface>

通过Qemu绑定到虚拟机

1
2
3
4
5
/usr/bin/qemu-kvm -name vdisk -enable-kvm -m 512 -smp 2 \
-hda /mnt/nfs/vdisk.img \
-monitor stdio \
-vnc 0.0.0.0:0 \
-device pci-assign,host=0b:00.0

其中 SR-IOV 支持方式就如前面提到的概念一样,大概需要如下配置:

  • 在 BIOS 中启用网卡的 SR-IOV 功能
  • 在 ESXi 安装 MFT vib 工具,用于管理和配置网卡 FW
  • 在网卡 FW 中开启 SR-IOV,设置最大的 VF 数量
  • 在 ESXi 网卡驱动中,开启 SRIOV,设置 VF 数量,需要重启 ESXi
  • 创建对应的 vSwitch 并将 PF 作为上联网卡接入
  • 创建虚拟机,添加 VF 作为 SR-IOV 网络适配器,并选择 PF 所在的 vSwitch 即可

架构对比

SR-IOV vs PCI path-through

架构上的比较(以网卡为例)

kvm-performance-optimization-for-ubuntu

kvm-performance-optimization-for-ubuntu

Virtio 和 Pass-Through 的详细比较

Virtio 和 Pass-Through 的详细比较

图片来源slideshare - Kvm performance optimization for ubuntuKVM 介绍(4):I/O 设备直接分配和 SR-IOV [KVM PCI/PCIe Pass-Through SR-IOV]

SR-IOV vs DPDK

sdn-fundamentals-for-nfv-openstack-and-containers-red-hat-summit

sdn-fundamentals-for-nfv-openstack-and-containers-red-hat-summit

sdn-fundamentals-for-nfv-openstack-and-containers-red-hat-summit

特性总结

SR-IOV 优点

  • 性能好
  • 减少主机 CPU 消耗

Pros:

  • More Scalable than Direct Assign
  • Security through IOMMU and function isolation
  • Control Plane separation through PF/VF notion
  • High packet rate, Low CPU, Low latency thanks to Direct Pass through

SR-IOV 缺点

  • 虚拟机使用 VF 后无法进行内存超分、快照、热迁移等高级功能
  • 配置管理复杂

  • Rigid: Composability issues

  • Control plane is pass through, puts pressure on Hardware resources
  • Parts of the PCIe config space are direct map from Hardware
  • Limited scalability (16 bit)
  • SR-IOV NIC forces switching features into the HW
  • All the Switching Features in the Hardware or nothing

SR-IOV相对与软件模拟IO虚拟化的优点:

1.降低了IO延迟和对CPU的占用,获得了接近原生的IO性能,因为虚拟机直接使用VFs,没有了VMM的陷入处理。

2.数据更加安全,因为每个VF属于一个IOMMU Group,共享IOMMU Group的设备不能分配给不同的虚拟机,而每个IOMMU Group又有独立的内存。

SR-IOV相对与Device assignment的优点:

没有了一个PCI设备只能给一个虚拟机的尴尬,SR-IOV下多个虚拟机可通过独占VFs的方式共享一个PCI设备。

SR-IOV的缺点:

使用了VFs的虚拟机不能在线迁移。

参考资料