<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Houmin</title>
  
  <subtitle>Yesterday You Said Tomorrow</subtitle>
  <link href="/atom.xml" rel="self"/>
  
  <link href="https://houmin.cc/"/>
  <updated>2022-11-09T15:13:45.394Z</updated>
  <id>https://houmin.cc/</id>
  
  <author>
    <name>Houmin</name>
    
  </author>
  
  <generator uri="https://hexo.io/">Hexo</generator>
  
  <entry>
    <title>使用 Go 语言开发 ebpf 程序</title>
    <link href="https://houmin.cc/posts/adca5ae5/"/>
    <id>https://houmin.cc/posts/adca5ae5/</id>
    <published>2021-03-31T06:28:57.000Z</published>
    <updated>2022-11-09T15:13:45.394Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>在 <a href="https://houmin.cc/posts/2c811c2c/">Introduction to eBPF</a> 这篇文章中介绍了基于内核源码开发并加载 eBPF 代码的过程。本文将介绍基于 Go 和对应的库开发 eBPF 程序，文中所有涉及的代码可以在我的 <a href="https://github.com/SimpCosm/godemo/tree/master/ebpf" target="_blank" rel="external nofollow noopener noreferrer">Github</a> 中找到。</p><a id="more"></a><h2 id="选择-eBPF-库"><a href="#选择-eBPF-库" class="headerlink" title="选择 eBPF 库"></a>选择 eBPF 库</h2><p>当涉及到选择库和工具来与 eBPF 进行交互时，会让人有所困惑。在选择时，你必须在基于 Python 的 <a href="https://github.com/iovisor/bcc" target="_blank" rel="external nofollow noopener noreferrer">BCC</a> 框架、基于 C 的 <a href="https://github.com/libbpf/libbpf" target="_blank" rel="external nofollow noopener noreferrer">libbpf</a> 和一系列基于 Go 的 <a href="https://github.com/dropbox/goebpf" target="_blank" rel="external nofollow noopener noreferrer">Dropbox</a>、<a href="https://github.com/cilium/ebpf" target="_blank" rel="external nofollow noopener noreferrer">Cilium</a>、<a href="https://github.com/aquasecurity/tracee/tree/main/libbpfgo" target="_blank" rel="external nofollow noopener noreferrer">Aqua</a> 和 <a href="https://github.com/projectcalico/felix/tree/master/bpf" target="_blank" rel="external nofollow noopener noreferrer">Calico</a> 等库中选择。</p><p>在大多数情况下，eBPF 库主要协助实现两个功能：</p><ul><li><strong>将 eBPF 程序和 Map</strong> 载入内核并执行<a href="https://kinvolk.io/blog/2018/10/exploring-bpf-elf-loaders-at-the-bpf-hackfest/#common-steps" target="_blank" rel="external nofollow noopener noreferrer">重定位</a>，通过其文件描述符将 eBPF 程序与正确的 Map 进行关联。</li><li><strong>与 eBPF Map 交互</strong>，允许对存储在 Map 中的键/值对进行标准的 CRUD 操作。</li></ul><p>部分库也可以帮助你将 eBPF 程序附加到一个特定的<a href="https://ebpf.io/what-is-ebpf/#hook-overview" target="_blank" rel="external nofollow noopener noreferrer">钩子</a>，尽管对于网络场景下，这可能很容易采用现有的 netlink API 库完成。</p><p>当涉及到 eBPF 库的选择时，仍然让人感到困惑（见<a href="https://twitter.com/maurovasquezb/status/1146438190062063616" target="_blank" rel="external nofollow noopener noreferrer">[1]</a>, <a href="https://twitter.com/qeole/status/1364521385138282497" target="_blank" rel="external nofollow noopener noreferrer">[2]</a>）。事实是每个库都有各自的范围和限制。</p><ul><li><a href="https://pkg.go.dev/github.com/projectcalico/felix@v3.8.9+incompatible/bpf" target="_blank" rel="external nofollow noopener noreferrer">Calico</a> 在用 <a href="https://twitter.com/qeole/status/1101450782841466880" target="_blank" rel="external nofollow noopener noreferrer">bpftool</a> 和 iproute2 实现的 CLI 命令基础上实现了一个 Go 包装器。</li><li><a href="https://github.com/aquasecurity/tracee/tree/main/tracee-ebpf" target="_blank" rel="external nofollow noopener noreferrer">Aqua</a> 实现了对 libbpf C 库的 Go 包装器。</li><li><a href="https://github.com/dropbox/goebpf" target="_blank" rel="external nofollow noopener noreferrer">Dropbox</a> 支持一小部分程序，但有一个非常干净和方便的用户API。</li><li>IO Visor 的 <a href="https://github.com/iovisor/gobpf" target="_blank" rel="external nofollow noopener noreferrer">gobpf</a> 是 BCC 框架的 Go 语言绑定，它更注重于跟踪和性能分析。</li><li><a href="https://github.com/cilium/ebpf" target="_blank" rel="external nofollow noopener noreferrer">Cilium 和 Cloudflare</a> 维护一个 <a href="https://linuxplumbersconf.org/event/4/contributions/449/attachments/239/529/A_pure_Go_eBPF_library.pdf" target="_blank" rel="external nofollow noopener noreferrer">纯 Go 语言编写的库</a> (以下简称 “libbpf-go”)，它将所有 eBPF 系统调用抽象在一个本地 Go 接口后面。</li></ul><p>参考 <a href="https://www.ebpf.top/post/ebpf_go/" target="_blank" rel="external nofollow noopener noreferrer">使用 Go 语言管理和分发 ebpf 程序</a> 可以看到 <code>cilium/ebpf</code> 更加活跃，本文也选择基于 <code>cilium/ebpf</code> 库来开发。<code>cilium/ebpf</code>  纯 Go 程序编写，从而实现了程序最小依赖；与此同时其还提供了 <code>bpf2go</code> 工具，可用来将 eBPF 程序编译成 Go 语言中的一部分，使得交付更加方便，后续如果配合 CO-RE 功能则威力大增。</p><h2 id="环境准备"><a href="#环境准备" class="headerlink" title="环境准备"></a>环境准备</h2><p>eBPF 程序一般有两部分组成：</p><ol><li>基于 C 语言的 eBPF 程序，最终使用 <code>clang/llvm</code> 编译成 <code>elf</code> 格式的文件，为内核中需要加载的程序；</li><li>Go 语言程序用于加载、调试 eBPF 程序，为用户空间的程序，用于配置或者读取 eBPF 程序生成的数据。</li></ol><p>前置条件需要安装 <code>clang/llvm</code> 编译器：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 安装 llvm 编译器，至少要求 clang 9.0 版本以上</span></span><br><span class="line">$ sudo apt update -y</span><br><span class="line">$ sudo apt install -y llvm</span><br><span class="line">$ sudo apt install -y clang</span><br></pre></td></tr></table></figure><p>可以从我的 Github 下载代码，目录结构如下：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line">[root@VM-4-27-centos demo]<span class="comment"># tree</span></span><br><span class="line">.</span><br><span class="line">|-- bpf</span><br><span class="line">|   |-- headers</span><br><span class="line">|   |   |-- bpf_core_read.h</span><br><span class="line">|   |   |-- bpf_helper_defs.h</span><br><span class="line">|   |   |-- bpf_helpers.h</span><br><span class="line">|   |   |-- bpf_tracing.h</span><br><span class="line">|   |   |-- update.sh</span><br><span class="line">|   |   `-- vmlinux.h</span><br><span class="line">|   `-- kprobe.c</span><br><span class="line">|-- Dockerfile</span><br><span class="line">|-- go.mod</span><br><span class="line">|-- go.sum</span><br><span class="line">|-- main.go</span><br><span class="line">`-- Makefile</span><br></pre></td></tr></table></figure><h2 id="编程规范"><a href="#编程规范" class="headerlink" title="编程规范"></a>编程规范</h2><h3 id="BPF-代码"><a href="#BPF-代码" class="headerlink" title="BPF 代码"></a>BPF 代码</h3><h4 id="以-kprobe-为例"><a href="#以-kprobe-为例" class="headerlink" title="以 kprobe 为例"></a>以 kprobe 为例</h4><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// +build ignore</span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"vmlinux.h"</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"bpf_helpers.h"</span></span></span><br><span class="line"></span><br><span class="line"><span class="keyword">char</span> __license[] SEC(<span class="string">"license"</span>) = <span class="string">"Dual MIT/GPL"</span>;</span><br><span class="line"></span><br><span class="line"><span class="function">struct bpf_map_def <span class="title">SEC</span><span class="params">(<span class="string">"maps"</span>)</span> kprobe_map </span>= &#123;</span><br><span class="line">    .type = BPF_MAP_TYPE_ARRAY,</span><br><span class="line">    .key_size = <span class="keyword">sizeof</span>(u32),</span><br><span class="line">    .value_size = <span class="keyword">sizeof</span>(u64),</span><br><span class="line">    .max_entries = <span class="number">1</span>,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line">SEC(<span class="string">"kprobe/sys_execve"</span>)</span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">kprobe_execve</span><span class="params">()</span> </span>&#123;</span><br><span class="line">    u32 key = <span class="number">0</span>;</span><br><span class="line">    u64 initval = <span class="number">1</span>, *valp;</span><br><span class="line"></span><br><span class="line">    valp = bpf_map_lookup_elem(&amp;kprobe_map, &amp;key);</span><br><span class="line">    <span class="keyword">if</span> (!valp) &#123;</span><br><span class="line">        bpf_map_update_elem(&amp;kprobe_map, &amp;key, &amp;initval, BPF_ANY);</span><br><span class="line">        <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">    &#125;</span><br><span class="line">    __sync_fetch_and_add(valp, <span class="number">1</span>);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="头文件"><a href="#头文件" class="headerlink" title="头文件"></a>头文件</h4><h5 id="libbpf"><a href="#libbpf" class="headerlink" title="libbpf"></a>libbpf</h5><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#!/usr/bin/env bash</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Version of libbpf to fetch headers from</span></span><br><span class="line">LIBBPF_VERSION=0.5.0</span><br><span class="line"></span><br><span class="line"><span class="comment"># The headers we want</span></span><br><span class="line">prefix=libbpf-<span class="string">"<span class="variable">$LIBBPF_VERSION</span>"</span></span><br><span class="line">headers=(</span><br><span class="line">    <span class="string">"<span class="variable">$prefix</span>"</span>/src/bpf_core_read.h</span><br><span class="line">    <span class="string">"<span class="variable">$prefix</span>"</span>/src/bpf_helper_defs.h</span><br><span class="line">    <span class="string">"<span class="variable">$prefix</span>"</span>/src/bpf_helpers.h</span><br><span class="line">    <span class="string">"<span class="variable">$prefix</span>"</span>/src/bpf_tracing.h</span><br><span class="line">)</span><br><span class="line"></span><br><span class="line"><span class="comment"># Fetch libbpf release and extract the desired headers</span></span><br><span class="line">curl -sL <span class="string">"https://github.com/libbpf/libbpf/archive/refs/tags/v<span class="variable">$&#123;LIBBPF_VERSION&#125;</span>.tar.gz"</span> | \</span><br><span class="line">    tar -xz --xform=<span class="string">'s#.*/##'</span> <span class="string">"<span class="variable">$&#123;headers[@]&#125;</span>"</span></span><br></pre></td></tr></table></figure><h5 id="vmlinux-h"><a href="#vmlinux-h" class="headerlink" title="vmlinux.h"></a>vmlinux.h</h5><p><code>vmlinux.h</code> 是使用工具生成的代码文件。它包含了系统运行 Linux 内核源代码中使用的所有类型定义。当我们编译 Linux 内核时，会输出一个称作 <code>vmlinux</code> 的文件组件，其是一个 <a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format" target="_blank" rel="external nofollow noopener noreferrer">ELF</a> 的二进制文件，包含了编译好的可启动内核。<code>vmlinux</code> 文件通常也会被打包在主要的 Linux 发行版中。</p><p>内核中的 bpftool 工具其中功能之一就是读取 <code>vmlinux</code> 文件并生成对应的 <code>vmlinux.h</code> 头文件。<code>vmlinux.h</code> 会包含运行内核中所使用的每一个类型定义，因此该文件的比较大。</p><p>生成 <code>vmlinux.h</code> 文件的命令如下：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ bpftool btf dump file /sys/kernel/btf/vmlinux format c &gt; vmlinux.h</span><br></pre></td></tr></table></figure><p>包含该 <code>vmlinux.h</code>，就意味着我们的程序可以使用内核中使用的所有数据类型定义，因此 BPF 程序在读取相关的内存时，就可以映射成对应的类型结构按照字段进行读取。</p><p>例如，Linux 中的 <a href="https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L649" target="_blank" rel="external nofollow noopener noreferrer">task_struct</a> 结构用于表示进程，如果 BPF 程序需要检查 <code>task_struct</code> 结构的值，那么首先就需要知道该结构的具体类型定义。</p><p><img alt="vmlinux" data-src="https://www.grant.pizza/libbpf/vmlinux.png"></p><p>由于 <code>vmlinux.h</code> 文件是由当前运行内核生成的，如果你试图将编译好的 eBPF 程序在另一台运行不同内核版本的机器上运行，可能会面临崩溃的窘境。这主要是因为在不同的版本中，对应数据类型的定义可能会在 Linux 源代码中发生变化。</p><p>但是，通过使用 libbpf 库提供的功能可以实现 “CO:RE”（一次编译，到处运行）。libbpf 库定义了部分宏（比如 BPF_CORE_READ），其可分析 eBPF 程序试图访问 <code>vmlinux.h</code> 中定义的类型中的哪些字段。如果访问的字段在当前内核定义的结构中发生了移动，宏 / 辅助函数会协助自动找到对应字段。对于可能消失的字段，也提供了对应的辅助函数 bpf_core_field_exists。因此，我们可以使用当前内核中生成的 <code>vmlinux.h</code> 头文件来编译 eBPF 程序，然后在不同的内核上运行它【需要运行的内核也支持 BTF 内核编译选项】。</p><h3 id="代码编译"><a href="#代码编译" class="headerlink" title="代码编译"></a>代码编译</h3><h4 id="bpf2go"><a href="#bpf2go" class="headerlink" title="bpf2go"></a>bpf2go</h4><p>该注解使用 <code>bpf2go</code> 程序将 <code>kprobe.c</code> 文件编译成 <code>bpfdemo_bpfeb.go</code> 和 <code>bpfdemo_bpfel.go</code> 两个文件，分别为 <code>bigendian</code> 和 <code>littleendian</code> 两种平台的程序。</p><p>其中参数中的 <code>BPFDemo</code> 参数为 <code>main.go</code> 文件中函数调用的名称，例如 <code>objs := BPFDemoObjects{}</code> 和 <code>LoadBPFDemoObjects(&amp;objs, nil);</code></p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// SPDX-License-Identifier: GPL-2.0-only</span></span><br><span class="line"><span class="comment">// Copyright (C) 2021 Authors of Nylon */</span></span><br><span class="line"></span><br><span class="line"><span class="comment">//go:generate sh -c "echo Generating for amd64"</span></span><br><span class="line"><span class="comment">//go:generate go run github.com/cilium/ebpf/cmd/bpf2go -cc clang BPFDemo ./bpf/kprobe.c -- -DOUTPUT_SKB -D__TARGET_ARCH_x86 -I./bpf/headers</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">package</span> main</span><br></pre></td></tr></table></figure><h4 id="Makefile"><a href="#Makefile" class="headerlink" title="Makefile"></a>Makefile</h4><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">GO := go</span><br><span class="line">GO_BUILD = CGO_ENABLED=0 <span class="variable">$(GO)</span> build</span><br><span class="line">GO_GENERATE = <span class="variable">$(GO)</span> generate</span><br><span class="line">GO_TAGS ?=</span><br><span class="line">TARGET=BPFDemo</span><br><span class="line">BINDIR ?= /usr/local/bin</span><br><span class="line">VERSION=<span class="variable">$(<span class="built_in">shell</span> git describe --tags --always)</span></span><br><span class="line"></span><br><span class="line"><span class="variable">$(TARGET)</span>:</span><br><span class="line"><span class="variable">$(GO_GENERATE)</span></span><br><span class="line"><span class="variable">$(GO_BUILD)</span> <span class="variable">$(<span class="built_in">if</span> <span class="variable">$(GO_TAGS)</span>,-tags <span class="variable">$(GO_TAGS)</span>)</span> \</span><br><span class="line">-ldflags <span class="string">"-w -s \</span></span><br><span class="line"><span class="string">-X 'github.com/SimpCosm/godemo/ebpf/BPFDemo.Version=$&#123;VERSION&#125;'"</span></span><br><span class="line"></span><br><span class="line"><span class="section">clean:</span></span><br><span class="line">rm -f <span class="variable">$(TARGET)</span></span><br><span class="line">rm -f bpfdemo_bpf*</span><br><span class="line">rm -rf ./release</span><br></pre></td></tr></table></figure><p>执行编译，可以看到生成了对应的 BPF 字节码 <code>bpfdemo_bpfeb.o</code> 和 <code>bpfdemo_bpfel.o</code>，还有对应的 go 文件：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">[root@VM-4-27-centos demo]<span class="comment"># make</span></span><br><span class="line">go generate</span><br><span class="line">Generating <span class="keyword">for</span> amd64</span><br><span class="line">Compiled /root/demo/bpfdemo_bpfel.o</span><br><span class="line">Stripped /root/demo/bpfdemo_bpfel.o</span><br><span class="line">Wrote /root/demo/bpfdemo_bpfel.go</span><br><span class="line">Compiled /root/demo/bpfdemo_bpfeb.o</span><br><span class="line">Stripped /root/demo/bpfdemo_bpfeb.o</span><br><span class="line">Wrote /root/demo/bpfdemo_bpfeb.go</span><br><span class="line">CGO_ENABLED=0 go build  \</span><br><span class="line">-ldflags <span class="string">"-w -s \</span></span><br><span class="line"><span class="string">-X 'github.com/SimpCosm/godemo/ebpf/BPFDemo.Version='"</span></span><br><span class="line">[root@VM-4-27-centos demo]<span class="comment"># ls</span></span><br><span class="line">Dockerfile  bpf               bpfdemo_bpfeb.o   bpfdemo_bpfel.o  go.mod  main.go        main_arm64.go</span><br><span class="line">Makefile    bpfdemo_bpfeb.go  bpfdemo_bpfel.go  demo             go.sum  main_amd64.go</span><br></pre></td></tr></table></figure><h3 id="加载代码"><a href="#加载代码" class="headerlink" title="加载代码"></a>加载代码</h3><p>在我们编写的 Go 代码中，首先需要将编译好的 eBPF 代码加载进内核，调用的是 <code>LoadBPFDemoObjects</code></p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Load pre-compiled programs and maps into the kernel.</span></span><br><span class="line">objs := BPFDemoObjects&#123;&#125;</span><br><span class="line"><span class="keyword">if</span> err := LoadBPFDemoObjects(&amp;objs, <span class="literal">nil</span>); err != <span class="literal">nil</span> &#123;</span><br><span class="line">log.Fatalf(<span class="string">"loading objects: %v"</span>, err)</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">defer</span> objs.Close()</span><br></pre></td></tr></table></figure><p>这里的 <code>LoadBPFDemoObjects</code> 和 <code>BPFDemoObjects</code> 都来自 <code>bpf2go</code> 自动生成的代码。</p><p>以 <code>bpfdemo_bpfeb.go</code> 为例，可以看到生成了很多辅助函数和结构体，其中：</p><ul><li>BPFDemoObjects 包括 BPF 程序和 BPF Map</li><li>LoadBPFDemoObjects 会调用 <code>LoadBPFDemo</code> 将编译好的 ELF 格式的 BPF 代码加载进内存，然后调用 <code>LoadAndAssign</code> 实际调用 BPF 系统调用 load BPF 程序到内核。</li></ul><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// BPFDemoMaps contains all maps after they have been loaded into the kernel.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// It can be passed to LoadBPFDemoObjects or ebpf.CollectionSpec.LoadAndAssign.</span></span><br><span class="line"><span class="keyword">type</span> BPFDemoMaps <span class="keyword">struct</span> &#123;</span><br><span class="line">        KprobeMap *ebpf.Map <span class="string">`ebpf:"kprobe_map"`</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// BPFDemoPrograms contains all programs after they have been loaded into the kernel.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// It can be passed to LoadBPFDemoObjects or ebpf.CollectionSpec.LoadAndAssign.</span></span><br><span class="line"><span class="keyword">type</span> BPFDemoPrograms <span class="keyword">struct</span> &#123;</span><br><span class="line">        KprobeExecve *ebpf.Program <span class="string">`ebpf:"kprobe_execve"`</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// BPFDemoObjects contains all objects after they have been loaded into the kernel.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// It can be passed to LoadBPFDemoObjects or ebpf.CollectionSpec.LoadAndAssign.</span></span><br><span class="line"><span class="keyword">type</span> BPFDemoObjects <span class="keyword">struct</span> &#123;</span><br><span class="line">        BPFDemoPrograms</span><br><span class="line">        BPFDemoMaps</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// LoadBPFDemoObjects loads BPFDemo and converts it into a struct.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// The following types are suitable as obj argument:</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">//     *BPFDemoObjects</span></span><br><span class="line"><span class="comment">//     *BPFDemoPrograms</span></span><br><span class="line"><span class="comment">//     *BPFDemoMaps</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// See ebpf.CollectionSpec.LoadAndAssign documentation for details.</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">LoadBPFDemoObjects</span><span class="params">(obj <span class="keyword">interface</span>&#123;&#125;, opts *ebpf.CollectionOptions)</span> <span class="title">error</span></span> &#123;</span><br><span class="line">        spec, err := LoadBPFDemo()</span><br><span class="line">        <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">                <span class="keyword">return</span> err</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> spec.LoadAndAssign(obj, opts)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>实际查看 <code>LoadAndAssign</code> 可以看到它会加载 BPF Program 和 BPF Map 到内核</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// LoadAndAssign loads Maps and Programs into the kernel and assigns them</span></span><br><span class="line"><span class="comment">// to a struct.</span></span><br><span class="line"><span class="comment">//    struct &#123;</span></span><br><span class="line"><span class="comment">//        Foo     *ebpf.Program `ebpf:"xdp_foo"`</span></span><br><span class="line"><span class="comment">//        Bar     *ebpf.Map     `ebpf:"bar_map"`</span></span><br><span class="line"><span class="comment">//        Ignored int</span></span><br><span class="line"><span class="comment">//    &#125;</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(cs *CollectionSpec)</span> <span class="title">LoadAndAssign</span><span class="params">(to <span class="keyword">interface</span>&#123;&#125;, opts *CollectionOptions)</span> <span class="title">error</span></span> &#123;</span><br><span class="line">loader := newCollectionLoader(cs, opts)</span><br><span class="line"><span class="keyword">defer</span> loader.cleanup()</span><br><span class="line"></span><br><span class="line"><span class="comment">// Support assigning Programs and Maps, lazy-loading the required objects.</span></span><br><span class="line">assignedMaps := <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">bool</span>)</span><br><span class="line">getValue := <span class="function"><span class="keyword">func</span><span class="params">(typ reflect.Type, name <span class="keyword">string</span>)</span> <span class="params">(<span class="keyword">interface</span>&#123;&#125;, error)</span></span> &#123;</span><br><span class="line"><span class="keyword">switch</span> typ &#123;</span><br><span class="line"></span><br><span class="line"><span class="keyword">case</span> reflect.TypeOf((*Program)(<span class="literal">nil</span>)):</span><br><span class="line"><span class="keyword">return</span> loader.loadProgram(name)</span><br><span class="line"></span><br><span class="line"><span class="keyword">case</span> reflect.TypeOf((*Map)(<span class="literal">nil</span>)):</span><br><span class="line">assignedMaps[name] = <span class="literal">true</span></span><br><span class="line"><span class="keyword">return</span> loader.loadMap(name)</span><br><span class="line"></span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"unsupported type %s"</span>, typ)</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">  <span class="comment">//...</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>这里的 <code>loadProgram</code> 会调用 <code>newProgramWithOptions</code>，处理很多与 BTF 等其他内容后，最终调用 <code>sys.ProgLoad(attr)</code></p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">newProgramWithOptions</span><span class="params">(spec *ProgramSpec, opts ProgramOptions, handles *handleCache)</span> <span class="params">(*Program, error)</span></span> &#123;</span><br><span class="line">    <span class="comment">// ...</span></span><br><span class="line">    fd, err := sys.ProgLoad(attr)</span><br><span class="line">  </span><br><span class="line">    <span class="comment">// ...</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>此即调用了 BPF 的系统调用：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">ProgLoad</span><span class="params">(attr *ProgLoadAttr)</span> <span class="params">(*FD, error)</span></span> &#123;</span><br><span class="line">fd, err := BPF(BPF_PROG_LOAD, unsafe.Pointer(attr), unsafe.Sizeof(*attr))</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">return</span> NewFD(<span class="keyword">int</span>(fd))</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>加载 map 也是类似，最终调用了 <code>sys.MapCreate</code></p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">MapCreate</span><span class="params">(attr *MapCreateAttr)</span> <span class="params">(*FD, error)</span></span> &#123;</span><br><span class="line">fd, err := BPF(BPF_MAP_CREATE, unsafe.Pointer(attr), unsafe.Sizeof(*attr))</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">return</span> NewFD(<span class="keyword">int</span>(fd))</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="Kprobe-处理"><a href="#Kprobe-处理" class="headerlink" title="Kprobe 处理"></a>Kprobe 处理</h3><p>kprobe可以对任何内核函数进行插桩，可以实时在生产环境中启用，不需要重启系统，也不需要以特殊方式重启内核。<br>现在有以下三种接口可以访问kprobes.</p><ul><li>kprobe API: 如 <code>register_kprobe()</code> 等，在 <a href="https://houmin.cc/posts/c28dc60d/">这篇文章中</a> 介绍了其用法</li><li>基于Frace的，通过 <code>/sys/kernel/debug/tracing/kprobe_events</code>: 通过向这个文件写入字符串，可以配置开启和停止kprobes，在 <a href="https://houmin.cc/posts/3d106760/">这篇文章中</a> 介绍了其用法</li><li><code>perf_event_open()</code>: 与 perf 工具所使用的一样，现在BPF跟踪工具也开始使用这些函数</li></ul><p>对应到 <code>main.go</code> 中，在 <code>LoadBPFDemoObjects</code>之后，我们还调用了 <code>link.Kprobe</code> 来</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Open a Kprobe at the entry point of the kernel function and attach the</span></span><br><span class="line"><span class="comment">// pre-compiled program. Each time the kernel function enters, the program</span></span><br><span class="line"><span class="comment">// will increment the execution counter by 1. The read loop below polls this</span></span><br><span class="line"><span class="comment">// map value once per second.</span></span><br><span class="line">kp, err := link.Kprobe(fn, objs.KprobeExecve)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">log.Fatalf(<span class="string">"opening kprobe: %s"</span>, err)</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">defer</span> kp.Close()</span><br></pre></td></tr></table></figure><h4 id="创建-kprobe类型的-perf-event"><a href="#创建-kprobe类型的-perf-event" class="headerlink" title="创建 kprobe类型的 perf event"></a>创建 kprobe类型的 perf event</h4><ul><li>symbol 是追踪的内核函数</li><li>prog 是编译的 eBPF 程序</li></ul><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">Kprobe</span><span class="params">(symbol <span class="keyword">string</span>, prog *ebpf.Program, opts *KprobeOptions)</span> <span class="params">(Link, error)</span></span> &#123;</span><br><span class="line">   k, err := kprobe(symbol, prog, opts, <span class="literal">false</span>)</span><br><span class="line">   <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">      <span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">   &#125;</span><br><span class="line"></span><br><span class="line">   lnk, err := attachPerfEvent(k, prog)</span><br><span class="line">   <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">      k.Close()</span><br><span class="line">      <span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">   &#125;</span><br><span class="line"></span><br><span class="line">   <span class="keyword">return</span> lnk, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>这里创建了一个 <code>kprobe</code> 类型的 Perf Event，传入的追踪地址是 <code>symbol</code></p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// kprobe opens a perf event on the given symbol and attaches prog to it.</span></span><br><span class="line"><span class="comment">// If ret is true, create a kretprobe.</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">kprobe</span><span class="params">(symbol <span class="keyword">string</span>, prog *ebpf.Program, opts *KprobeOptions, ret <span class="keyword">bool</span>)</span> <span class="params">(*perfEvent, error)</span></span> &#123;</span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line">  </span><br><span class="line">args := probeArgs&#123;</span><br><span class="line">pid:    perfAllThreads,</span><br><span class="line">symbol: platformPrefix(symbol),</span><br><span class="line">ret:    ret,</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// Use kprobe PMU if the kernel has it available.</span></span><br><span class="line">tp, err := pmuKprobe(args)</span><br><span class="line"><span class="keyword">if</span> err == <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> tp, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br><span class="line">  </span><br><span class="line">  <span class="comment">// ... </span></span><br><span class="line">  </span><br><span class="line"><span class="comment">// Use tracefs if kprobe PMU is missing.</span></span><br><span class="line">args.symbol = platformPrefix(symbol)</span><br><span class="line">tp, err = tracefsKprobe(args)</span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> tp, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>最终调用了 <code>PerfEventOpen</code> 来开启一个 perf event，这个系统调用可以参考 <a href="https://man7.org/linux/man-pages/man2/perf_event_open.2.html" target="_blank" rel="external nofollow noopener noreferrer">这里</a></p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// pmuProbe opens a perf event based on a Performance Monitoring Unit.</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// Requires at least a 4.17 kernel.</span></span><br><span class="line"><span class="comment">// e12f03d7031a "perf/core: Implement the 'perf_kprobe' PMU"</span></span><br><span class="line"><span class="comment">// 33ea4b24277b "perf/core: Implement the 'perf_uprobe' PMU"</span></span><br><span class="line"><span class="comment">//</span></span><br><span class="line"><span class="comment">// Returns ErrNotSupported if the kernel doesn't support perf_[k,u]probe PMU</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">pmuProbe</span><span class="params">(typ probeType, args probeArgs)</span> <span class="params">(*perfEvent, error)</span></span> &#123;</span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line"><span class="keyword">switch</span> typ &#123;</span><br><span class="line"><span class="keyword">case</span> kprobeType:</span><br><span class="line"><span class="comment">// Create a pointer to a NUL-terminated string for the kernel.</span></span><br><span class="line">sp, err = unsafeStringPtr(args.symbol)</span><br><span class="line"></span><br><span class="line">attr = unix.PerfEventAttr&#123;</span><br><span class="line">Type:   <span class="keyword">uint32</span>(et),          <span class="comment">// PMU event type read from sysfs</span></span><br><span class="line">Ext1:   <span class="keyword">uint64</span>(<span class="keyword">uintptr</span>(sp)), <span class="comment">// Kernel symbol to trace</span></span><br><span class="line">Config: config,              <span class="comment">// Retprobe flag</span></span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">case</span> uprobeType:</span><br><span class="line">    <span class="comment">// ...</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">rawFd, err := unix.PerfEventOpen(&amp;attr, args.pid, <span class="number">0</span>, <span class="number">-1</span>, unix.PERF_FLAG_FD_CLOEXEC)</span><br><span class="line">fd, err := sys.NewFD(rawFd)</span><br><span class="line">  </span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line"><span class="comment">// Kernel has perf_[k,u]probe PMU available, initialize perf event.</span></span><br><span class="line"><span class="keyword">return</span> &amp;perfEvent&#123;</span><br><span class="line">typ:    typ.PerfEventType(args.ret),</span><br><span class="line">name:   args.symbol,</span><br><span class="line">pmuID:  et,</span><br><span class="line">cookie: args.cookie,</span><br><span class="line">fd:     fd,</span><br><span class="line">&#125;, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="挂载-eBPF-程序到-perf-event"><a href="#挂载-eBPF-程序到-perf-event" class="headerlink" title="挂载 eBPF 程序到 perf event"></a>挂载 eBPF 程序到 perf event</h4><p>通过 perf_event 的 ioctl 调用把 BPF 程序 attach 到 kprobe event</p><ul><li><code>PERF_EVENT_IOC_SET_BPF</code>，表示允许 attach BPF 程序到 kprobe event 上，其中 ioctl 设置的第三个参数代表 bpf 系统调用的 fd。</li><li><code>PERF_EVENT_IOC_ENABLE</code>，表示使能 event。</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ioctl(perf_event_fd, PERF_EVENT_IOC_SET_BPF, bpf_prog_fd)</span><br><span class="line">ioctl(perf_event_fd, PERF_EVENT_IOC_ENABLE, <span class="number">0</span>)</span><br></pre></td></tr></table></figure><p><code>attachPerfEvent</code> 通过 perf_event 的 ioctl 调用把 BPF 程序 attach 到 kprobe event</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// attach the given eBPF prog to the perf event stored in pe.</span></span><br><span class="line"><span class="comment">// pe must contain a valid perf event fd.</span></span><br><span class="line"><span class="comment">// prog's type must match the program type stored in pe.</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">attachPerfEvent</span><span class="params">(pe *perfEvent, prog *ebpf.Program)</span> <span class="params">(Link, error)</span></span> &#123;</span><br><span class="line"><span class="keyword">if</span> prog == <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, errors.New(<span class="string">"cannot attach a nil program"</span>)</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">if</span> prog.FD() &lt; <span class="number">0</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"invalid program: %w"</span>, sys.ErrClosedFd)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">switch</span> pe.typ &#123;</span><br><span class="line"><span class="keyword">case</span> kprobeEvent, kretprobeEvent, uprobeEvent, uretprobeEvent:</span><br><span class="line"><span class="keyword">if</span> t := prog.Type(); t != ebpf.Kprobe &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"invalid program type (expected %s): %s"</span>, ebpf.Kprobe, t)</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">case</span> tracepointEvent:</span><br><span class="line"><span class="keyword">if</span> t := prog.Type(); t != ebpf.TracePoint &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"invalid program type (expected %s): %s"</span>, ebpf.TracePoint, t)</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"unknown perf event type: %d"</span>, pe.typ)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> err := haveBPFLinkPerfEvent(); err == <span class="literal">nil</span> &#123;</span><br><span class="line">lnk, err := attachPerfEventLink(pe, prog)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">return</span> lnk, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">lnk, err := attachPerfEventIoctl(pe, prog)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> lnk, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>通过 ioctl 挂载 BPF 程序：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">attachPerfEventIoctl</span><span class="params">(pe *perfEvent, prog *ebpf.Program)</span> <span class="params">(*perfEventIoctl, error)</span></span> &#123;</span><br><span class="line"><span class="keyword">if</span> pe.cookie != <span class="number">0</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"cookies are not supported: %w"</span>, ErrNotSupported)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// Assign the eBPF program to the perf event.</span></span><br><span class="line">err := unix.IoctlSetInt(pe.fd.Int(), unix.PERF_EVENT_IOC_SET_BPF, prog.FD())</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"setting perf event bpf program: %w"</span>, err)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// PERF_EVENT_IOC_ENABLE and _DISABLE ignore their given values.</span></span><br><span class="line"><span class="keyword">if</span> err := unix.IoctlSetInt(pe.fd.Int(), unix.PERF_EVENT_IOC_ENABLE, <span class="number">0</span>); err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"enable perf event: %s"</span>, err)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">pi := &amp;perfEventIoctl&#123;pe&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// Close the perf event when its reference is lost to avoid leaking system resources.</span></span><br><span class="line">runtime.SetFinalizer(pi, (*perfEventIoctl).Close)</span><br><span class="line"><span class="keyword">return</span> pi, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="查看-Map-信息"><a href="#查看-Map-信息" class="headerlink" title="查看 Map 信息"></a>查看 Map 信息</h3><p>定期查看 eBPF map 的更新：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Read loop reporting the total amount of times the kernel</span></span><br><span class="line"><span class="comment">// function was entered, once per second.</span></span><br><span class="line">ticker := time.NewTicker(<span class="number">1</span> * time.Second)</span><br><span class="line"></span><br><span class="line">log.Println(<span class="string">"Waiting for events.."</span>)</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> <span class="keyword">range</span> ticker.C &#123;</span><br><span class="line"><span class="keyword">var</span> value <span class="keyword">uint64</span></span><br><span class="line"><span class="keyword">if</span> err := objs.KprobeMap.Lookup(mapKey, &amp;value); err != <span class="literal">nil</span> &#123;</span><br><span class="line">log.Fatalf(<span class="string">"reading map: %v"</span>, err)</span><br><span class="line">&#125;</span><br><span class="line">log.Printf(<span class="string">"%s called %d times\n"</span>, fn, value)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="容器镜像"><a href="#容器镜像" class="headerlink" title="容器镜像"></a>容器镜像</h3><figure class="highlight dockerfile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">FROM</span> ubuntu:<span class="number">20.04</span></span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> apt update -y -q</span></span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y -q curl build-essential ca-certificates</span></span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> curl -s https://storage.googleapis.com/golang/go1.16.3.linux-amd64.tar.gz| tar -v -C /usr/<span class="built_in">local</span> -xz</span></span><br><span class="line"><span class="keyword">ENV</span> PATH $PATH:/usr/local/go/bin</span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> apt install -y wget gnupg2</span></span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> <span class="built_in">printf</span> <span class="string">"deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-12 main"</span> | tee /etc/apt/sources.list.d/llvm-toolchain-xenial-12.list</span></span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add -</span></span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> apt -y update</span></span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> apt install -y llvm clang git</span></span><br><span class="line"><span class="keyword">WORKDIR</span><span class="bash"> /ebpf</span></span><br><span class="line"><span class="keyword">COPY</span><span class="bash"> . .</span></span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> make</span></span><br><span class="line"><span class="keyword">RUN</span><span class="bash"> chmod a+x /ebpf</span></span><br><span class="line"><span class="keyword">ENTRYPOINT</span><span class="bash"> [<span class="string">"./ebpf"</span>]</span></span><br><span class="line"><span class="keyword">CMD</span><span class="bash"> [<span class="string">"./ebpf"</span>]</span></span><br></pre></td></tr></table></figure><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://networkop.co.uk/post/2021-03-ebpf-intro/" target="_blank" rel="external nofollow noopener noreferrer">https://networkop.co.uk/post/2021-03-ebpf-intro/</a></li><li><a href="https://www.grant.pizza/blog/vmlinux-header/" target="_blank" rel="external nofollow noopener noreferrer">https://www.grant.pizza/blog/vmlinux-header/</a></li><li><a href="https://www.ebpf.top/post/ebpf_go_translation/" target="_blank" rel="external nofollow noopener noreferrer">https://www.ebpf.top/post/ebpf_go_translation/</a></li><li><a href="https://tinylab.org/bcc-overview/" target="_blank" rel="external nofollow noopener noreferrer">https://tinylab.org/bcc-overview/</a></li><li><a href="https://www.brendangregg.com/blog/2021-06-15/bpf-internals.html" target="_blank" rel="external nofollow noopener noreferrer">https://www.brendangregg.com/blog/2021-06-15/bpf-internals.html</a></li></ul><h2 id><a href="#" class="headerlink" title=" "></a> </h2>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;在 &lt;a href=&quot;https://houmin.cc/posts/2c811c2c/&quot;&gt;Introduction to eBPF&lt;/a&gt; 这篇文章中介绍了基于内核源码开发并加载 eBPF 代码的过程。本文将介绍基于 Go 和对应的库开发 eBPF 程序，文中所有涉及的代码可以在我的 &lt;a href=&quot;https://github.com/SimpCosm/godemo/tree/master/ebpf&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;Github&lt;/a&gt; 中找到。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://networkop.co.uk/img/xdp-xconnect.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="Go" scheme="https://houmin.cc/tags/Go/"/>
    
      <category term="BPF" scheme="https://houmin.cc/tags/BPF/"/>
    
  </entry>
  
  <entry>
    <title>eBPF Map 操作</title>
    <link href="https://houmin.cc/posts/98a3c8ff/"/>
    <id>https://houmin.cc/posts/98a3c8ff/</id>
    <published>2021-03-28T06:25:32.000Z</published>
    <updated>2022-11-09T15:13:45.394Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>eBPF Map 是用户空间和内核空间进行数据交换、信息传递的桥梁，它以 <code>key/value</code> 方式将数据存储在内核中，可以被任何知道它们的BPF程序访问。在内核空间的程序创建 BPF Map 并返回对应的 <strong>文件描述符</strong>，在用户空间运行的程序就可以通过这个文件描述符来访问并操作BPF Map。eBPF Map 支持多种数据结构类型，在 <a href="https://houmin.cc/posts/2c811c2c/">上一篇博客</a> 中已经简单介绍过，本文将通过代码实例展示其使用方法，所有代码可以在我的 <a href="https://github.com/SimpCosm/cake" target="_blank" rel="external nofollow noopener noreferrer">Github</a> 中找到。</p><a id="more"></a><h2 id="创建BPF-Map"><a href="#创建BPF-Map" class="headerlink" title="创建BPF Map"></a>创建BPF Map</h2><p>最初创建 BPF Map 的方式都是通过 <code>bpf</code> 系统调用函数，传入的第一个参数是<code>BPF_MAP_CREATE</code>，在 <a href="https://houmin.cc/posts/2c811c2c/">上一篇博客</a> 中已经介绍，此处不在详述。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">union</span> bpf_attr my_map_attr &#123;</span><br><span class="line">  .map_type = BPF_MAP_TYPE_ARRAY,</span><br><span class="line">  .key_size = <span class="keyword">sizeof</span>(<span class="keyword">int</span>),</span><br><span class="line">  .value_size = <span class="keyword">sizeof</span>(<span class="keyword">int</span>),</span><br><span class="line">  .max_entries = <span class="number">1024</span>,</span><br><span class="line">  .map_flags = BPF_F_NO_PREALLOC,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="keyword">int</span> fd = bpf(BPF_MAP_CREATE, &amp;my_map_attr, <span class="keyword">sizeof</span>(my_map_attr));</span><br></pre></td></tr></table></figure><p>相对于直接使用 <code>bpf</code> 系统调用函数来创建BPF Map，在实际场景中常用的是基于 <code>SEC(&quot;maps&quot;)</code> 这个语法糖来做到声明即创建：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">struct bpf_map_def <span class="title">SEC</span><span class="params">(<span class="string">"maps"</span>)</span> my_bpf_map </span>= &#123;</span><br><span class="line">  .type       = BPF_MAP_TYPE_HASH, </span><br><span class="line">  .key_size   = <span class="keyword">sizeof</span>(<span class="keyword">int</span>),</span><br><span class="line">  .value_size   = <span class="keyword">sizeof</span>(<span class="keyword">int</span>),</span><br><span class="line">  .max_entries = <span class="number">100</span>,</span><br><span class="line">  .map_flags   = BPF_F_NO_PREALLOC,</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>关键点就是<code>SEC(&quot;maps&quot;)</code>，<code>ELF convention</code>，它的工作原理是这样的：</p><ul><li>声明 ELF Section 属性 <code>SEC(&quot;maps&quot;)</code></li><li>内核代码<a href="https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.c" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_load.c</code></a> 扫描目标文件中所有 Section 信息，它会扫描目标文件里定义的 Section，其中就有用来创建BPF Map的<code>SEC(&quot;maps&quot;)</code>，我们可以到<a href="https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.h#L41" target="_blank" rel="external nofollow noopener noreferrer">相关代码</a>里看到说明：</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.h#L41</span></span><br><span class="line"><span class="comment">/* parses elf file compiled by llvm .c-&gt;.o</span></span><br><span class="line"><span class="comment"> * . parses 'maps' section and creates maps via BPF syscall // 就是这里</span></span><br><span class="line"><span class="comment"> * . parses 'license' section and passes it to syscall</span></span><br><span class="line"><span class="comment"> * . parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns by</span></span><br><span class="line"><span class="comment"> *   storing map_fd into insn-&gt;imm and marking such insns as BPF_PSEUDO_MAP_FD</span></span><br><span class="line"><span class="comment"> * . loads eBPF programs via BPF syscall</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * One ELF file can contain multiple BPF programs which will be loaded</span></span><br><span class="line"><span class="comment"> * and their FDs stored stored in prog_fd array</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * returns zero on success</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">load_bpf_file</span><span class="params">(<span class="keyword">char</span> *path)</span></span>;</span><br></pre></td></tr></table></figure><ul><li><a href="https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.c" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_load.c</code></a>扫描到<code>SEC(&quot;maps&quot;)</code>后，对BPF Map相关的操作是由<a href="https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.c#L212" target="_blank" rel="external nofollow noopener noreferrer"><code>load_maps</code></a>函数完成，其中的<a href="https://elixir.bootlin.com/linux/v4.15/source/tools/lib/bpf/bpf.c#L62" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_create_map_node()</code></a>和<a href="https://elixir.bootlin.com/linux/v4.15/source/tools/lib/bpf/bpf.c#L101" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_create_map_in_map_node()</code></a>就是创建BPF Map的关键函数</li><li>它们背后都是调用了定义在内核代码<a href="https://elixir.bootlin.com/linux/v4.15/source/tools/lib/bpf/bpf.c" target="_blank" rel="external nofollow noopener noreferrer">tools/lib/bpf/bpf.c</a>中的方法，而<a href="https://elixir.bootlin.com/linux/v4.15/source/tools/lib/bpf/bpf.c#L83" target="_blank" rel="external nofollow noopener noreferrer">这个方法</a>就是使用上文提到的<code>BPF_MAP_CREATE</code>命令进行的系统调用。</li><li>最后在编译程序时，通过添加<code>bpf_load.o</code>作为依赖库，并合并为最终的可执行文件中，这样在程序运行起来时，就可以通过声明<code>SEC(&quot;maps&quot;)</code>即可完成创建BPF Map的行为了。</li></ul><p>从上面梳理的过程可以看到，这个简化版虽然使用了语法糖，但最后还是会去使用 bpf() 函数完成系统调用。</p><h2 id="数据结构"><a href="#数据结构" class="headerlink" title="数据结构"></a>数据结构</h2><p>本小节将介绍 eBPF Map 的几种常见的数据结构，包括其使用场景和使用方法。</p><h3 id="Hash-Table"><a href="#Hash-Table" class="headerlink" title="Hash Table"></a>Hash Table</h3><p>对于 <code>BPF_MAP_TYPE_HASH</code> 类型的 eBPF Map，其 key 和 value 都是可自定义的数据结构，使用方法如下所示：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// define the struct for the key of bpf map</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">pair</span> &#123;</span></span><br><span class="line">  __u32 src_ip;</span><br><span class="line">  __u32 dest_ip;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">stats</span> &#123;</span></span><br><span class="line">  __u64 tx_cnt; <span class="comment">// the sending request count</span></span><br><span class="line">  __u64 rx_cnt; <span class="comment">// the received request count</span></span><br><span class="line">  __u64 tx_bytes; <span class="comment">// the sending request bytes</span></span><br><span class="line">  __u64 rx_bytes; <span class="comment">// the sending received bytes</span></span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="function">struct bpf_map_def <span class="title">SEC</span><span class="params">(<span class="string">"maps"</span>)</span> tracker_map </span>= &#123;</span><br><span class="line">    .type = BPF_MAP_TYPE_HASH,</span><br><span class="line">    .key_size = <span class="keyword">sizeof</span>(struct pair),</span><br><span class="line">    .value_size = <span class="keyword">sizeof</span>(struct stats),</span><br><span class="line">    .max_entries = <span class="number">2048</span>,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">stats</span> *<span class="title">stats</span>, <span class="title">newstats</span> = &#123;</span><span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>&#125;;</span><br><span class="line"></span><br><span class="line">stats = bpf_map_lookup_elem(&amp;tracker_map, pair);</span><br><span class="line"><span class="keyword">if</span> (stats)</span><br><span class="line">&#123;</span><br><span class="line">    stats-&gt;rx_cnt++;</span><br><span class="line">    stats-&gt;rx_bytes += bytes;</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">    newstats.rx_cnt = <span class="number">1</span>;</span><br><span class="line">    newstats.rx_bytes = bytes;</span><br><span class="line">    bpf_map_update_elem(&amp;tracker_map, pair, &amp;newstats, BPF_NOEXIST);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="Array"><a href="#Array" class="headerlink" title="Array"></a>Array</h3><p>对于 <code>BPF_MAP_TYPE_ARRAY</code> 类型的 eBPF Map，有以下特性：</p><ul><li>它的 key 是作为一个数组的索引，只能是 4 个字节</li><li>在 Array 初始化的时候，Array 中所有的元素都 <code>pre-allocated</code> 并且初始化未 0</li><li><code>map_delete_elem()</code> 函数会返回 <code>EINVAL</code>，因为 Array 中的元素不能够被删除</li><li><code>map_update_elem()</code> 函数更新元素的时候是 <code>non-atomic</code> 的，并没有并发保护 </li></ul><p><code>BPF_MAP_TYPE_ARRAY</code> 类型的 eBPF Map 主要用于以下两种情景：</p><ul><li><p>全局变量：可以申请一个只有一个元素的 Array，key = 0，value 是一些全局变量的集合</p></li><li><blockquote><p>aggregation of tracing events into fixed set of buckets</p></blockquote></li></ul><p>下面展示了使用 <code>BPF_MAP_TYPE_ARRAY</code> 作为全局变量的方法：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">globals</span> &#123;</span></span><br><span class="line">    u64 lat_ave;</span><br><span class="line">    u64 lat_sum;</span><br><span class="line">    u64 missed;</span><br><span class="line">    u64 max_lat;</span><br><span class="line">    <span class="keyword">int</span> num_samples;</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="function">struct bpf_map_def <span class="title">SEC</span><span class="params">(<span class="string">"maps"</span>)</span> global_map </span>= &#123;</span><br><span class="line">    .type = BPF_MAP_TYPE_ARRAY,</span><br><span class="line">    .key_size = <span class="keyword">sizeof</span>(<span class="keyword">int</span>),</span><br><span class="line">    .value_size = <span class="keyword">sizeof</span>(struct globals),</span><br><span class="line">    .max_entries = <span class="number">1</span>,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_prog</span><span class="params">(struct bpf_context *ctx)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    ...</span><br><span class="line">    <span class="keyword">int</span> ind = <span class="number">0</span>;</span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">globals</span> *<span class="title">g</span> = <span class="title">bpf_map_lookup_elem</span>(&amp;<span class="title">global_map</span>, &amp;<span class="title">ind</span>);</span></span><br><span class="line">    <span class="keyword">if</span> (!g)</span><br><span class="line">            <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">    <span class="keyword">if</span> (g-&gt;lat_ave == <span class="number">0</span>) &#123;</span><br><span class="line">            g-&gt;num_samples++;</span><br><span class="line">            g-&gt;lat_sum += delta;</span><br><span class="line">            <span class="keyword">if</span> (g-&gt;num_samples &gt;= <span class="number">100</span>) &#123;</span><br><span class="line">                    g-&gt;lat_ave = g-&gt;lat_sum / g-&gt;num_samples;</span><br><span class="line">    ...</span><br></pre></td></tr></table></figure><h3 id="Prog-Array"><a href="#Prog-Array" class="headerlink" title="Prog Array"></a>Prog Array</h3><p><code>BPF_MAP_TYPE_PROG_ARRAY</code> 类型的 eBPF Map 主要用于尾调用，尾调用执行涉及两个步骤：</p><ul><li>设置类型为 <code>BPF_MAP_TYPE_PROG_ARRAY</code> 的 map，这个 map 可以从用户空间通过 key/value 操作</li><li>调用辅助函数 <code>bpf_tail_call()</code> 如下所示，内核将这个辅助函数调用内联到一个特殊的 BPF 指令内。目前，这样的程序数组在用户空间侧是只写模式<ul><li>一个对程序数组的引用（a reference to the program array）</li><li>一个查询 map 所用的 key。</li></ul></li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">long</span> <span class="title">bpf_tail_call</span><span class="params">(<span class="keyword">void</span> *ctx, struct bpf_map *prog_array_map, u32 index)</span></span></span><br></pre></td></tr></table></figure><p>内核根据传入的文件描述符查找相关的 BPF 程序，自动替换给定的 map slot  处的程序指针。如果没有找到给定的 key 对应的 value，内核会跳过（fall through）这一步 ，继续执行 <code>bpf_tail_call()</code> 后面的指令。</p><p>尾调用是一个强大的功能，它可以实现：</p><ul><li><strong>通过尾调用结构化地解析网络报头</strong></li><li><strong>运行时原子地添加或替换功能</strong>，也即动态地改变 BPF 程序的执行行为</li></ul><p>在 <code>samples/bpf</code> 中可以看到 <code>BPF_MAP_TYPE_PROG_ARRAY</code> 的使用示例：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> &#123;</span></span><br><span class="line">__uint(type, BPF_MAP_TYPE_PROG_ARRAY);</span><br><span class="line">__uint(key_size, <span class="keyword">sizeof</span>(u32));</span><br><span class="line">__uint(value_size, <span class="keyword">sizeof</span>(u32));</span><br><span class="line">__uint(max_entries, <span class="number">8</span>);</span><br><span class="line">&#125; <span class="function">jmp_table <span class="title">SEC</span><span class="params">(<span class="string">".maps"</span>)</span></span>;</span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> PARSE_VLAN 1</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> PARSE_MPLS 2</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> PARSE_IP 3</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> PARSE_IPV6 4</span></span><br><span class="line"></span><br><span class="line"><span class="comment">/* Protocol dispatch routine. It tail-calls next BPF program depending</span></span><br><span class="line"><span class="comment"> * on eth proto. Note, we could have used ...</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> *   bpf_tail_call(skb, &amp;jmp_table, proto);</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * ... but it would need large prog_array and cannot be optimised given</span></span><br><span class="line"><span class="comment"> * the map key is not static.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">inline</span> <span class="keyword">void</span> <span class="title">parse_eth_proto</span><span class="params">(struct __sk_buff *skb, u32 proto)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="keyword">switch</span> (proto) &#123;</span><br><span class="line"><span class="keyword">case</span> ETH_P_8021Q:</span><br><span class="line"><span class="keyword">case</span> ETH_P_8021AD:</span><br><span class="line">bpf_tail_call(skb, &amp;jmp_table, PARSE_VLAN);</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> ETH_P_MPLS_UC:</span><br><span class="line"><span class="keyword">case</span> ETH_P_MPLS_MC:</span><br><span class="line">bpf_tail_call(skb, &amp;jmp_table, PARSE_MPLS);</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> ETH_P_IP:</span><br><span class="line">bpf_tail_call(skb, &amp;jmp_table, PARSE_IP);</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line"><span class="keyword">case</span> ETH_P_IPV6:</span><br><span class="line">bpf_tail_call(skb, &amp;jmp_table, PARSE_IPV6);</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="Map-In-Map"><a href="#Map-In-Map" class="headerlink" title="Map In Map"></a>Map In Map</h3><p>eBPF 提供了两种特殊的 Map 类型，<code>BPF_MAP_TYPE_ARRAY_OF_MAPS</code> 和 <code>BPF_MAP_TYPE_HASH_OF_MAPS</code>，实现了 <code>map-in-map</code>，也就是 eBPF Map 中每一个 entry 的 Value 也是一个 Map，如下所示：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-04-08_ebpf-map-in-map.jpeg"></p><p><code>BPF_MAP_TYPE_ARRAY_OF_MAPS</code> 和 <code>BPF_MAP_TYPE_HASH_OF_MAPS</code> 的区别在于，<code>outer map</code> 是一个 Array 还是 HashTable。</p><h4 id="Create"><a href="#Create" class="headerlink" title="Create"></a>Create</h4><p>之前的常规 eBPF Map 是在 <code>load time</code> 创建的，对于 <code>map-in-map</code>，我们需要定义一个 <code>outer map</code>，<code>inner map</code> 是在 <code>runtime</code> 被用户创建并插入到 <code>outer map</code>。<code>outer map</code> 定义如下所示：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">struct bpf_map_def <span class="title">SEC</span><span class="params">(<span class="string">"maps"</span>)</span> outer_map </span>= &#123;</span><br><span class="line">    .type = BPF_MAP_TYPE_HASH_OF_MAPS,</span><br><span class="line">    .key_size = <span class="keyword">sizeof</span>(__u32),</span><br><span class="line">    .value_size = <span class="keyword">sizeof</span>(__u32), <span class="comment">// Must be u32 becuase it is inner map id</span></span><br><span class="line">    .max_entries = <span class="number">1</span>,</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>这里需要注意：</p><ul><li><code>outer map</code> 的 <code>value_size</code> 必须是 <code>__u32</code>，这正好是 <code>inner map id</code> 的大小</li></ul><p>尽管你不需要在 BPF C 程序中定义 <code>inner map</code>，<code>verifier</code> 需要在 <code>load time</code> 知道 <code>inner map</code> 的定义。所以，在调用 <code>bpf_object__load</code> 前，你必须创建一个 <code>dummy inner map</code> 并且 通过调用 <code>bpf_map__set_inner_map_fd</code> 设置它的 fd 到 <code>outer map</code> 。注意，<code>verifier</code> 要求 <code>dummy inner map</code> 的 fd 必须在 load 之后关闭。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">const</span> <span class="keyword">char</span>* outer_map_name = <span class="string">"outer_map"</span>;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">bpf_map</span>* <span class="title">outer_map</span> = <span class="title">bpf_object__find_map_by_name</span>(<span class="title">obj</span>, <span class="title">outer_map_name</span>);</span></span><br><span class="line"><span class="keyword">int</span> inner_map_fd = bpf_create_map(</span><br><span class="line">    BPF_MAP_TYPE_HASH,  <span class="comment">// type</span></span><br><span class="line">    <span class="keyword">sizeof</span>(__u32),      <span class="comment">// key_size</span></span><br><span class="line">    <span class="keyword">sizeof</span>(__u32),      <span class="comment">// value_size</span></span><br><span class="line">    <span class="number">8</span>,                  <span class="comment">// max_entries</span></span><br><span class="line">    <span class="number">0</span>);                 <span class="comment">// flag</span></span><br><span class="line">bpf_map__set_inner_map_fd(outer_map, inner_map_fd);</span><br><span class="line">bpf_object__load(obj);</span><br><span class="line"><span class="built_in">close</span>(inner_map_fd); <span class="comment">// Important</span></span><br></pre></td></tr></table></figure><h4 id="Insert"><a href="#Insert" class="headerlink" title="Insert"></a>Insert</h4><h5 id="Insert-Into-Outer-Map"><a href="#Insert-Into-Outer-Map" class="headerlink" title="Insert Into Outer Map"></a>Insert Into Outer Map</h5><p>插入到 <code>outer map</code> 步骤如下：</p><ul><li>创建一个新的 <code>inner map</code></li><li>将创建的 <code>inner map</code> 的 fd 作为 value 插入到 <code>outer map</code></li><li>关闭 <code>inner map fd</code></li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">int</span> inner_map_fd = bpf_create_map_name(</span><br><span class="line">    BPF_MAP_TYPE_HASH,   <span class="comment">// type</span></span><br><span class="line">    <span class="string">"hechaol_inner_map"</span>, <span class="comment">// name</span></span><br><span class="line">    <span class="keyword">sizeof</span>(__u32),       <span class="comment">// key_size</span></span><br><span class="line">    <span class="keyword">sizeof</span>(__u32),       <span class="comment">// value_size</span></span><br><span class="line">    <span class="number">8</span>,                   <span class="comment">// max_entries</span></span><br><span class="line">    <span class="number">0</span>);                  <span class="comment">// flag</span></span><br><span class="line">__u32 outer_key = <span class="number">42</span>;</span><br><span class="line">bpf_map_update_elem(outer_map_fd, &amp;outer_key, &amp;inner_map_fd, <span class="number">0</span> <span class="comment">/* flag */</span>);</span><br><span class="line"><span class="built_in">close</span>(inner_map_fd); <span class="comment">// Important!</span></span><br></pre></td></tr></table></figure><p>注意：</p><ul><li><code>outer map</code> 的每一项 entry 的 value 是 <code>the id of an inner map</code>，但是调用 <code>bpf_map_update_elem</code> API 时给的参数是 <code>the fd of the inner map</code></li><li>在插入之后你必须关闭 <code>inner map fd</code> 以避免内存泄漏。</li></ul><h5 id="Insert-Into-Inner-Map"><a href="#Insert-Into-Inner-Map" class="headerlink" title="Insert Into Inner Map"></a>Insert Into Inner Map</h5><p>如前所述，<code>outer map</code> 的每一项 entry 的 value 是 <code>the id of an inner map</code>，而不是 <code>the fd of the inner map</code>。即使我们在调用 <code>bpf_map_update_elem</code> 传递的参数是 <code>inner map fd</code>，使用  <code>bpf_map_lookup_elem</code> 的时候我们的到的 value 是 <code>inner map id</code>，为了获得 <code>inner map fd</code>，可以调用 <code>bpf_map_get_fd_by_id</code>。拿到 <code>inner map fd</code> 之后，就可以像之前一样操作 <code>inner map</code> 了。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">const</span> __u32 outer_key = <span class="number">42</span>;</span><br><span class="line">__u32 inner_map_id;</span><br><span class="line">bpf_map_lookup_elem(outer_map_fd, &amp;outer_key, &amp;inner_map_id);</span><br><span class="line"><span class="keyword">int</span> inner_map_fd = bpf_map_get_fd_by_id(inner_map_id);</span><br><span class="line"><span class="keyword">const</span> __u32 inner_key = <span class="number">12</span>;</span><br><span class="line">__u32 inner_value;</span><br><span class="line">bpf_map_lookup_elem(inner_map_fd, &amp;inner_key, &amp;inner_value);</span><br><span class="line"><span class="comment">// ... Use inner_value;</span></span><br><span class="line"><span class="built_in">close</span>(inner_map_fd); <span class="comment">// Important!</span></span><br></pre></td></tr></table></figure><p>注意，每次调用 <code>bpf_map_get_fd_by_id</code> 都会返回一个新的 fd，你必须在使用之后关闭它以避免内存泄露。</p><h4 id="Delete"><a href="#Delete" class="headerlink" title="Delete"></a>Delete</h4><p>对于 <code>inner map</code> 的删除和常规 Map 一样，可以调用 <code>bpf_map_delete_elem</code>：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">const</span> __u32 outer_key = <span class="number">42</span>;</span><br><span class="line">bpf_map_delete_elem(outer_map_fd, &amp;outer_key);</span><br></pre></td></tr></table></figure><h3 id="Perf-Event-Array"><a href="#Perf-Event-Array" class="headerlink" title="Perf Event Array"></a>Perf Event Array</h3><p>有时候我们期望 eBPF 程序能够通知用户态程序数据准备好了，array、hash 类型的 eBPF map 不满足此类使用场景，这时候就轮到 <code>BPF_MAP_TYPE_PERF_EVENT_ARRAY</code> 了。与普通 hash、array 类型有些不同，它没有 <code>bpf_map_lookup_elem()</code> 方法，使用的是 <code>bpf_perf_event_output()</code> 向用户态传递数据。它的 <code>value_size</code> 只能是 <code>sizeof(u32)</code>，代表的是 perf_event 的文件描述符；<code>max_entries</code> 则是 perf_event 的文件描述符数量。</p><p>有关源码如下：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">msg</span> &#123;</span></span><br><span class="line">__s32 seq;</span><br><span class="line">__u64 cts;</span><br><span class="line">__u8 comm[MAX_LENGTH];</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="function">struct bpf_map_def <span class="title">SEC</span><span class="params">(<span class="string">"maps"</span>)</span> <span class="built_in">map</span> </span>= &#123;</span><br><span class="line">.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,</span><br><span class="line">.key_size = <span class="keyword">sizeof</span>(<span class="keyword">int</span>),</span><br><span class="line">.value_size = <span class="keyword">sizeof</span>(__u32),</span><br><span class="line">.max_entries = <span class="number">0</span>,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line">SEC(<span class="string">"kprobe/vfs_read"</span>)</span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">hello</span><span class="params">(struct pt_regs *ctx)</span> </span>&#123;</span><br><span class="line"><span class="keyword">unsigned</span> <span class="keyword">long</span> cts = bpf_ktime_get_ns();</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">msg</span> <span class="title">val</span> = &#123;</span><span class="number">0</span>&#125;;</span><br><span class="line"><span class="keyword">static</span> __u32 seq = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line">val.seq = seq = (seq + <span class="number">1</span>) % <span class="number">4294967295U</span>;</span><br><span class="line">val.cts = bpf_ktime_get_ns();</span><br><span class="line">bpf_get_current_comm(val.comm, <span class="keyword">sizeof</span>(val.comm));</span><br><span class="line"></span><br><span class="line">bpf_perf_event_output(ctx, &amp;<span class="built_in">map</span>, <span class="number">0</span>, &amp;val, <span class="keyword">sizeof</span>(val));</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><blockquote><p>Note:</p><ol><li>这里的 <code>seq</code> 代表的是消息序列号</li><li>若用户态不向内核态传递消息，PERF<em>EVENT_ARRAY map 中的 <code>max_entries</code> 没有意义。该 map 向用户态传递的数据暂存在 perf ring buffer 中，而由 <code>max_entries</code> 指定的 map 存储空间存放的是 perf_event 文件描述符，若用户态程序不向 map 传递 perf_event 的文件描述符，其值可以为 0。用户态程序使用 <code>bpf(BPF_MAP_UPDATE_ELEM)</code> 将由 <code>sys_perf_event_open()</code> 取得的文件描述符传递给 eBPF 程序，eBPF 程序再使用 `bpf_perf_event</em>{read, read<em>value}()` 得到该文件描述符。于此有关的用法见 linux kernel 下的 [sample/bpf/tracex6</em>{user, kern.c}](<a href="https://github.com/torvalds/linux/blob/v5.10/samples/bpf/tracex6_kern.c)。" target="_blank" rel="external nofollow noopener noreferrer">https://github.com/torvalds/linux/blob/v5.10/samples/bpf/tracex6_kern.c)。</a></li></ol></blockquote><p><a href="https://github.com/torvalds/linux/tree/v5.10/tools/lib/bpf" target="_blank" rel="external nofollow noopener noreferrer">libbpf</a> 提供了 PERF_EVENT_ARRAY map 在用户态开箱即用的 API，它使用了 epoll 进行封装，仅需调用 <code>perf_buffer__new()</code>、<code>perf_buffer__poll()</code> 即可使用：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">void</span> <span class="title">print_bpf_output</span><span class="params">(<span class="keyword">void</span> *ctx, <span class="keyword">int</span> cpu, <span class="keyword">void</span> *data, __u32 <span class="built_in">size</span>)</span> </span>&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">msg</span> *<span class="title">msg</span> = <span class="title">data</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stdout</span>, <span class="string">"%.4f: @seq=%d @comm=%s\n"</span>,</span><br><span class="line"> (<span class="keyword">float</span>)msg-&gt;cts/<span class="number">1000000000u</span>l, msg-&gt;seq, msg-&gt;comm);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">(<span class="keyword">int</span> argc, <span class="keyword">char</span> *argv[])</span> </span>&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">perf_buffer_opts</span> <span class="title">pb_opts</span> = &#123;</span>&#125;;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">perf_buffer</span> *<span class="title">pb</span>;</span></span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">pb_opts.sample_cb = print_bpf_output;</span><br><span class="line">pb = perf_buffer__new(map_fd, <span class="number">8</span>, &amp;pb_opts);</span><br><span class="line"></span><br><span class="line"><span class="keyword">while</span> (<span class="literal">true</span>) &#123;</span><br><span class="line">perf_buffer__poll(pb, <span class="number">1000</span>);</span><br><span class="line"><span class="keyword">if</span> (<span class="built_in">stop</span>)</span><br><span class="line"><span class="keyword">break</span>;</span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h2 id="实战入门"><a href="#实战入门" class="headerlink" title="实战入门"></a>实战入门</h2><p>现在我们就可以借助 BPF Map 来实现在内核空间收集网络包信息，主要包括源地址和目标地址，在用户空间展示这些信息。代码主要分两个部分：</p><ul><li>一个是运行在内核空间的程序，主要功能为创建出定制版BPF Map，收集目标信息并存储至BPF Map中。</li><li>另一个是运行在用户空间的程序，主要功能为读取上面内核空间创建出的BPF Map里的数据，并进行格式化展示，以演示BPF Map在两者之间进行数据传递。</li></ul><p>请注意，该程序的编译运行是基于Linux内核代码中BPF示例环境，如果你还不熟悉，可以参考 <a href="https://houmin.cc/posts/2c811c2c/">上一篇博客</a>。</p><h3 id="内核空间"><a href="#内核空间" class="headerlink" title="内核空间"></a>内核空间</h3><p>下面首先介绍运行在内核空间的示例代码：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">define</span> KBUILD_MODNAME <span class="meta-string">"foo"</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;uapi/linux/bpf.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;uapi/linux/if_ether.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;uapi/linux/if_packet.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;uapi/linux/if_vlan.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;uapi/linux/ip.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;uapi/linux/in.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;uapi/linux/tcp.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;uapi/linux/udp.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"bpf_helpers.h"</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"bpf_endian.h"</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"xdp_ip_tracker_common.h"</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> bpf_printk(fmt, ...)                       \</span></span><br><span class="line">    (&#123;                                             \</span><br><span class="line">        <span class="keyword">char</span> ____fmt[] = fmt;                      \</span><br><span class="line">        bpf_trace_printk(____fmt, <span class="keyword">sizeof</span>(____fmt), \</span><br><span class="line">                         ##__VA_ARGS__);           \</span><br><span class="line">    &#125;)</span><br><span class="line"></span><br><span class="line"><span class="function">struct bpf_map_def <span class="title">SEC</span><span class="params">(<span class="string">"maps"</span>)</span> tracker_map </span>= &#123;</span><br><span class="line">    .type = BPF_MAP_TYPE_HASH,</span><br><span class="line">    .key_size = <span class="keyword">sizeof</span>(struct pair),</span><br><span class="line">    .value_size = <span class="keyword">sizeof</span>(struct stats),</span><br><span class="line">    .max_entries = <span class="number">2048</span>,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">static</span> __always_inline <span class="keyword">bool</span> <span class="title">parse_and_track</span><span class="params">(<span class="keyword">bool</span> is_rx, <span class="keyword">void</span> *data_begin, <span class="keyword">void</span> *data_end, struct pair *pair)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">ethhdr</span> *<span class="title">eth</span> = <span class="title">data_begin</span>;</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> ((<span class="keyword">void</span> *)(eth + <span class="number">1</span>) &gt; data_end)</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (eth-&gt;h_proto == bpf_htons(ETH_P_IP))</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="class"><span class="keyword">struct</span> <span class="title">iphdr</span> *<span class="title">iph</span> = (<span class="title">struct</span> <span class="title">iphdr</span> *)(<span class="title">eth</span> + 1);</span></span><br><span class="line">        <span class="keyword">if</span> ((<span class="keyword">void</span> *)(iph + <span class="number">1</span>) &gt; data_end)</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line"></span><br><span class="line">        pair-&gt;src_ip = is_rx ? iph-&gt;daddr : iph-&gt;saddr;</span><br><span class="line">        pair-&gt;dest_ip = is_rx ? iph-&gt;saddr : iph-&gt;daddr;</span><br><span class="line"></span><br><span class="line">        <span class="comment">// update the map for track</span></span><br><span class="line">        <span class="class"><span class="keyword">struct</span> <span class="title">stats</span> *<span class="title">stats</span>, <span class="title">newstats</span> = &#123;</span><span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>&#125;;</span><br><span class="line">        <span class="keyword">long</span> <span class="keyword">long</span> bytes = data_end - data_begin;</span><br><span class="line"></span><br><span class="line">        stats = bpf_map_lookup_elem(&amp;tracker_map, pair);</span><br><span class="line">        <span class="keyword">if</span> (stats)</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="keyword">if</span> (is_rx)</span><br><span class="line">            &#123;</span><br><span class="line">                stats-&gt;rx_cnt++;</span><br><span class="line">                stats-&gt;rx_bytes += bytes;</span><br><span class="line">            &#125;</span><br><span class="line">            <span class="keyword">else</span></span><br><span class="line">            &#123;</span><br><span class="line">                stats-&gt;tx_cnt++;</span><br><span class="line">                stats-&gt;tx_bytes += bytes;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="keyword">else</span></span><br><span class="line">        &#123;</span><br><span class="line">            <span class="keyword">if</span> (is_rx)</span><br><span class="line">            &#123;</span><br><span class="line">                newstats.rx_cnt = <span class="number">1</span>;</span><br><span class="line">                newstats.rx_bytes = bytes;</span><br><span class="line">            &#125;</span><br><span class="line">            <span class="keyword">else</span></span><br><span class="line">            &#123;</span><br><span class="line">                newstats.tx_cnt = <span class="number">1</span>;</span><br><span class="line">                newstats.tx_bytes = bytes;</span><br><span class="line">            &#125;</span><br><span class="line">            bpf_map_update_elem(&amp;tracker_map, pair, &amp;newstats, BPF_NOEXIST);</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">true</span>;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">false</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">SEC(<span class="string">"xdp_ip_tracker"</span>)</span><br><span class="line"><span class="keyword">int</span> _xdp_ip_tracker(struct xdp_md *ctx)</span><br><span class="line">&#123;</span><br><span class="line">    <span class="comment">// the struct to store the ip address as the keys of bpf map</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">pair</span> <span class="title">pair</span>;</span></span><br><span class="line"></span><br><span class="line">    bpf_printk(<span class="string">"starting xdp ip tracker...\n"</span>);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">void</span> *data_end = (<span class="keyword">void</span> *)(<span class="keyword">long</span>)ctx-&gt;data_end;</span><br><span class="line">    <span class="keyword">void</span> *data = (<span class="keyword">void</span> *)(<span class="keyword">long</span>)ctx-&gt;data;</span><br><span class="line">    <span class="comment">// pass if the network packet is not ipv4</span></span><br><span class="line">    <span class="keyword">if</span> (!parse_and_track(<span class="literal">true</span>, data, data_end, &amp;pair))</span><br><span class="line">        <span class="keyword">return</span> XDP_PASS;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> XDP_DROP;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">char</span> _license[] SEC(<span class="string">"license"</span>) = <span class="string">"GPL"</span>;</span><br></pre></td></tr></table></figure><p>我们先来看运行在内核空间的BPF程序代码重点内容：</p><ul><li>通过<code>SEC(&quot;maps&quot;)</code>声明并创建了一个名为<strong>tracker_map</strong> 的BPF Map，它的类型是<code>BPF_MAP_TYPE_HASH</code>，它的 key 和 value 都是自定义的struct，定义在了<code>xdp_ip_tracker_common.h</code>头文件中，具体如下所示：</li></ul><p><img alt="bpf-tracker-map" data-src="https://davidlovezoe.club/wordpress/wp-content/uploads/2020/08/DraggedImage-3-1024x770.png"></p><ul><li>函数<code>parse_and_track</code>是对网络包进行分析和过滤，把源地址和目的地址联合起来作为BPF Map的key，把当前网络包的大小以 byte 单位记录下来，并联合网络包计数器作为BPF Map的value。对于连续的网络包，如果生成的key已经存在，就把value累加，否则就新增一对key-value存入BPF Map中。其中通过<code>bpf_map_lookup_elem()</code>函数来查找元素，<code>bpf_map_update_elem()</code>函数来新增元素。</li></ul><h3 id="用户空间"><a href="#用户空间" class="headerlink" title="用户空间"></a>用户空间</h3><p>接下来是运行在用户空间的示例代码：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;linux/bpf.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;linux/if_link.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;assert.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;errno.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;signal.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdlib.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;string.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;sys/resource.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;arpa/inet.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;netinet/ether.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;unistd.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;time.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"bpf_load.h"</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;bpf/bpf.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"bpf_util.h"</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"xdp_ip_tracker_common.h"</span></span></span><br><span class="line"></span><br><span class="line"><span class="keyword">static</span> <span class="keyword">int</span> ifindex = <span class="number">6</span>; <span class="comment">// target network interface to attach, you can find it via `ip a`</span></span><br><span class="line"><span class="keyword">static</span> __u32 xdp_flags = <span class="number">0</span>;</span><br><span class="line"></span><br><span class="line"><span class="comment">// unlink the xdp program and exit</span></span><br><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">void</span> <span class="title">int_exit</span><span class="params">(<span class="keyword">int</span> sig)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="built_in">printf</span>(<span class="string">"stopping\n"</span>);</span><br><span class="line">    set_link_xdp_fd(ifindex, <span class="number">-1</span>, xdp_flags);</span><br><span class="line">    <span class="built_in">exit</span>(<span class="number">0</span>);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// An XDP program which track packets with IP address</span></span><br><span class="line"><span class="comment">// Usage: ./xdp_ip_tracker</span></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">(<span class="keyword">int</span> argc, <span class="keyword">char</span> **argv)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="keyword">char</span> *filename = <span class="string">"xdp_ip_tracker_kern.o"</span>;</span><br><span class="line">    <span class="comment">// change limits</span></span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">rlimit</span> <span class="title">r</span> = &#123;</span>RLIM_INFINITY, RLIM_INFINITY&#125;;</span><br><span class="line">    <span class="keyword">if</span> (setrlimit(RLIMIT_MEMLOCK, &amp;r))</span><br><span class="line">    &#123;</span><br><span class="line">        perror(<span class="string">"setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)"</span>);</span><br><span class="line">        <span class="keyword">return</span> <span class="number">1</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// load the kernel bpf object file</span></span><br><span class="line">    <span class="keyword">if</span> (load_bpf_file(filename))</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="built_in">printf</span>(<span class="string">"error - bpf_log_buf: %s"</span>, bpf_log_buf);</span><br><span class="line">        <span class="keyword">return</span> <span class="number">1</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// confirm the bpf prog fd is available</span></span><br><span class="line">    <span class="keyword">if</span> (!prog_fd[<span class="number">0</span>])</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="built_in">printf</span>(<span class="string">"load_bpf_file: %s\n"</span>, strerror(errno));</span><br><span class="line">        <span class="keyword">return</span> <span class="number">1</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// add signal handlers</span></span><br><span class="line">    signal(SIGINT, int_exit);</span><br><span class="line">    signal(SIGTERM, int_exit);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// link the xdp program to the network interface</span></span><br><span class="line">    <span class="keyword">if</span> (set_link_xdp_fd(ifindex, prog_fd[<span class="number">0</span>], xdp_flags) &lt; <span class="number">0</span>)</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="built_in">printf</span>(<span class="string">"link set xdp fd failed\n"</span>);</span><br><span class="line">        <span class="keyword">return</span> <span class="number">1</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">int</span> result;</span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">pair</span> <span class="title">next_key</span>, <span class="title">lookup_key</span> = &#123;</span><span class="number">0</span>, <span class="number">0</span>&#125;;</span><br><span class="line">    <span class="class"><span class="keyword">struct</span> <span class="title">stats</span> <span class="title">value</span> = &#123;</span>&#125;;</span><br><span class="line">    <span class="keyword">while</span> (<span class="number">1</span>)</span><br><span class="line">    &#123;</span><br><span class="line">        sleep(<span class="number">2</span>);</span><br><span class="line">        <span class="comment">// retrieve the bpf map of statistics</span></span><br><span class="line">        <span class="keyword">while</span> (bpf_map_get_next_key(map_fd[<span class="number">0</span>], &amp;lookup_key, &amp;next_key) != <span class="number">-1</span>)</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="comment">//printf("The local ip of next key in the map is: '%d'\n", next_key.src_ip);</span></span><br><span class="line">            <span class="comment">//printf("The remote ip of next key in the map is: '%d'\n", next_key.dest_ip);</span></span><br><span class="line">            <span class="class"><span class="keyword">struct</span> <span class="title">in_addr</span> <span class="title">local</span> = &#123;</span>next_key.src_ip&#125;;</span><br><span class="line">            <span class="class"><span class="keyword">struct</span> <span class="title">in_addr</span> <span class="title">remote</span> = &#123;</span>next_key.dest_ip&#125;;</span><br><span class="line">            <span class="built_in">printf</span>(<span class="string">"The local ip of next key in the map is: '%s'\n"</span>, inet_ntoa(local));</span><br><span class="line">            <span class="built_in">printf</span>(<span class="string">"The remote ip of next key in the map is: '%s'\n"</span>, inet_ntoa(remote));</span><br><span class="line">            </span><br><span class="line">            <span class="comment">// get the value via the key</span></span><br><span class="line">            <span class="comment">// <span class="doctag">TODO:</span> change to assert</span></span><br><span class="line">            <span class="comment">// assert(bpf_map_lookup_elem(map_fd[0], &amp;next_key, &amp;value) == 0)</span></span><br><span class="line">            result = bpf_map_lookup_elem(map_fd[<span class="number">0</span>], &amp;next_key, &amp;value);</span><br><span class="line">            <span class="keyword">if</span> (result == <span class="number">0</span>)</span><br><span class="line">            &#123;</span><br><span class="line">                <span class="comment">// print the value</span></span><br><span class="line">                <span class="built_in">printf</span>(<span class="string">"rx_cnt value read from the map: '%llu'\n"</span>, value.rx_cnt);</span><br><span class="line">                <span class="built_in">printf</span>(<span class="string">"rx_bytes value read from the map: '%llu'\n"</span>, value.rx_bytes);</span><br><span class="line">            &#125;</span><br><span class="line">            <span class="keyword">else</span></span><br><span class="line">            &#123;</span><br><span class="line">                <span class="built_in">printf</span>(<span class="string">"Failed to read value from the map: %d (%s)\n"</span>, result, strerror(errno));</span><br><span class="line">            &#125;</span><br><span class="line">            lookup_key = next_key;</span><br><span class="line">            <span class="built_in">printf</span>(<span class="string">"\n\n"</span>);</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="built_in">printf</span>(<span class="string">"start a new loop...\n"</span>);</span><br><span class="line">        <span class="comment">// reset the lookup key for a fresh start</span></span><br><span class="line">        lookup_key.src_ip = <span class="number">0</span>;</span><br><span class="line">        lookup_key.dest_ip = <span class="number">0</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="built_in">printf</span>(<span class="string">"end\n"</span>);</span><br><span class="line">    <span class="comment">// unlink the xdp program</span></span><br><span class="line">    set_link_xdp_fd(ifindex, <span class="number">-1</span>, xdp_flags);</span><br><span class="line">    <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><ul><li>用户空间的代码跟一般看到的C程序的结构是一样的，都是有main函数作为入口。基本流程是，通过<a href="https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.c#L606" target="_blank" rel="external nofollow noopener noreferrer"><code>load_bpf_file()</code></a>函数（本质就是用<code>BPF_PROG_LOAD</code>命令进行系统调用）加载对应内核空间的BPF程序编译出来的<strong>.o</strong>文件，这种通过编程加载BPF程序的方式，和我们之前通过命令行工具的方式相比，更具灵活性，适合实际场景中的产品分发。</li><li>加载完BPF程序之后，使用<code>set_link_xdp_fd()</code>函数 attach 到目标hook上，看函数名就知道了，这是XDP network hook。它接受的两个主要的参数是：<ul><li><code>ifindex</code>，这个是目标网卡的序号（可以通过<code>ip a</code>查看），我这里填写的是6，它是对应了一个docker容器的veth虚拟网络设备；</li><li><code>prog_fd[0]</code>，这个是BPF程序加载到内存后生成的文件描述符fd。</li></ul></li><li>有两个神奇的变量 <code>prog_fd</code> 和 <code>map_fd</code> 得说明下：<ul><li>它们都是定义在<a href="https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.c#L38" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_load.c</code></a>的全局变量；</li><li><code>prog_fd</code>是一个数组，在加载内核空间BPF程序时，一旦fd生成后，就<a href="https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.c#L111" target="_blank" rel="external nofollow noopener noreferrer">添加到这个数组中</a>去；</li><li><code>map_fd</code>也是一个数组，在运行上文提到的<a href="https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.c#L212" target="_blank" rel="external nofollow noopener noreferrer"><code>load_maps()</code></a>函数时，一旦完成创建BPF Map系统调用生成fd后，同样会<a href="https://elixir.bootlin.com/linux/v4.15/source/samples/bpf/bpf_load.c#L242" target="_blank" rel="external nofollow noopener noreferrer">添加到这个数组中</a>去。 因此在bpf sample文件夹下的程序可以直接使用这两个变量，作为对于BPF程序和BPF Map的引用。</li></ul></li><li>从代码 71 行开始是一个无限循环，里面是每2秒获取一下目标BPF Map的数据。获取的逻辑是通过<code>bpf_map_get_next_key(map_fd[0], &amp;lookup_key, &amp;next_key)</code>函数，<code>map_fd[0]</code>是你的目标BPF Map； <code>lookup_key</code>是需要查找的BPF Map目标key，这个参数是要主动传入的，而<code>next_key</code>是这个目标key相邻的下一个key，这个参数是被动赋值的。如果你想从头开始遍历BPF Map，就可以通过传入一个一定不存在的key作为<code>lookup_key</code>，然后<code>next_key</code>会被自动赋值为BPF Map中第一个key，key知道了，对应的value也就可以被读取了，直到<code>bpf_map_get_next_key()</code>返回为-1，即<code>next_key</code>没有可以被赋值的了，遍历也就完成了，这个函数工作起来是不是像一个iterator。<br>通过上面两层循环，不停遍历BPF Map并打印里面的内容，一旦有新的网络包进来，也能及时获取到相关信息。</li></ul><p><img alt data-src="https://davidlovezoe.club/wordpress/wp-content/uploads/2020/08/DraggedImage-1-1.png"></p><p>还有一段非常陌生的代码，如下所示：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">rlimit</span> <span class="title">r</span> = &#123;</span>RLIM_INFINITY, RLIM_INFINITY&#125;;</span><br><span class="line"><span class="keyword">if</span> (setrlimit(RLIMIT_MEMLOCK, &amp;r))</span><br><span class="line">&#123;</span><br><span class="line">   perror(<span class="string">"setrlimit(RLIMIT_MEMLOCK, RLIM_INFINITY)"</span>);</span><br><span class="line">   <span class="keyword">return</span> <span class="number">1</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><ul><li>这里有一个struct叫<a href="https://elixir.bootlin.com/linux/v4.15/source/include/uapi/linux/resource.h#L43" target="_blank" rel="external nofollow noopener noreferrer"><code>rlimit</code></a>，全称是<strong>resource limit</strong>，顾名思义，它是控制应用进程能使用资源的限额。</li><li>常量<code>RLIM_INFINITY</code>看起来就是<strong>无限</strong>的意思，因此第一行代码就是定义了一个没有上限的资源配额。</li><li>第二行代码使用了函数<code>setrlimit()</code>，传入的第一个参数是一个资源规格名称——<code>RLIMIT_MEMLOCK</code>，即内存；第二个参数是刚才定义的<strong>无限资源配额</strong>，可以猜出这行代码的意思就是为内存资源配置了无限配额，即没有内存上限。</li><li>为什么要把内存限制放开呢？因为操作系统在不同的CPU架构，对于应用进程能使用的内存限制是不统一的，而不同的BPF程序需要使用到的内存资源也是可变的，比如你的BPF Map申请了很大的<code>max_entries</code>，那么这个BPF程序一定会使用不少的内存。因此为了成功运行BPF程序，就把对于内存的限制放开成无限了。</li></ul><h2 id="匿名-inode"><a href="#匿名-inode" class="headerlink" title="匿名 inode"></a>匿名 inode</h2><p>在Unix/Linux的世界，一切皆是文件，BPF Map也不例外。从上文看到我们是可以通过文件描述符fd来访问BPF Map内的数据，因此BPF Map创建是遵循Linux文件创建的过程。实现<code>BPF_MAP_CREATE</code>系统调用命令的函数是<a href="https://elixir.bootlin.com/linux/v4.15/source/kernel/bpf/syscall.c#L383" target="_blank" rel="external nofollow noopener noreferrer"><code>map_create()</code></a>，即创建BPF Map的核心函数：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">int</span> <span class="title">map_create</span><span class="params">(<span class="keyword">union</span> bpf_attr *attr)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="keyword">int</span> numa_node = bpf_map_attr_numa_node(attr);</span><br><span class="line">  <span class="class"><span class="keyword">struct</span> <span class="title">bpf_map</span> *<span class="title">map</span>;</span></span><br><span class="line">  <span class="keyword">int</span> f_flags;</span><br><span class="line">  <span class="keyword">int</span> err;</span><br><span class="line"></span><br><span class="line">  err = CHECK_ATTR(BPF_MAP_CREATE);</span><br><span class="line">  <span class="keyword">if</span> (err)</span><br><span class="line">    <span class="keyword">return</span> -EINVAL;</span><br><span class="line"></span><br><span class="line">  f_flags = bpf_get_file_flag(attr-&gt;map_flags);</span><br><span class="line">  <span class="keyword">if</span> (f_flags &lt; <span class="number">0</span>)</span><br><span class="line">    <span class="keyword">return</span> f_flags;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> (numa_node != NUMA_NO_NODE &amp;&amp;</span><br><span class="line">      ((<span class="keyword">unsigned</span> <span class="keyword">int</span>)numa_node &gt;= nr_node_ids ||</span><br><span class="line">       !node_online(numa_node)))</span><br><span class="line">    <span class="keyword">return</span> -EINVAL;</span><br><span class="line"></span><br><span class="line">  <span class="comment">/* find map type and init map: hashtable vs rbtree vs bloom vs ... */</span></span><br><span class="line">  <span class="built_in">map</span> = find_and_alloc_map(attr);</span><br><span class="line">  <span class="keyword">if</span> (IS_ERR(<span class="built_in">map</span>))</span><br><span class="line">    <span class="keyword">return</span> PTR_ERR(<span class="built_in">map</span>);</span><br><span class="line"></span><br><span class="line">  err = bpf_obj_name_cpy(<span class="built_in">map</span>-&gt;name, attr-&gt;map_name);</span><br><span class="line">  <span class="keyword">if</span> (err)</span><br><span class="line">    <span class="keyword">goto</span> free_map_nouncharge;</span><br><span class="line"></span><br><span class="line">  atomic_set(&amp;<span class="built_in">map</span>-&gt;refcnt, <span class="number">1</span>);</span><br><span class="line">  atomic_set(&amp;<span class="built_in">map</span>-&gt;usercnt, <span class="number">1</span>);</span><br><span class="line"></span><br><span class="line">  err = security_bpf_map_alloc(<span class="built_in">map</span>);</span><br><span class="line">  <span class="keyword">if</span> (err)</span><br><span class="line">    <span class="keyword">goto</span> free_map_nouncharge;</span><br><span class="line"></span><br><span class="line">  err = bpf_map_charge_memlock(<span class="built_in">map</span>);</span><br><span class="line">  <span class="keyword">if</span> (err)</span><br><span class="line">    <span class="keyword">goto</span> free_map_sec;</span><br><span class="line"></span><br><span class="line">  err = bpf_map_alloc_id(<span class="built_in">map</span>);</span><br><span class="line">  <span class="keyword">if</span> (err)</span><br><span class="line">    <span class="keyword">goto</span> free_map;</span><br><span class="line"></span><br><span class="line">  <span class="comment">// assign a fd for bpf map</span></span><br><span class="line">  err = bpf_map_new_fd(<span class="built_in">map</span>, f_flags);</span><br><span class="line">  <span class="keyword">if</span> (err &lt; <span class="number">0</span>) &#123;</span><br><span class="line">    <span class="comment">/* failed to allocate fd.</span></span><br><span class="line"><span class="comment">     * bpf_map_put() is needed because the above</span></span><br><span class="line"><span class="comment">     * bpf_map_alloc_id() has published the map</span></span><br><span class="line"><span class="comment">     * to the userspace and the userspace may</span></span><br><span class="line"><span class="comment">     * have refcnt-ed it through BPF_MAP_GET_FD_BY_ID.</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    bpf_map_put(<span class="built_in">map</span>);</span><br><span class="line">    <span class="keyword">return</span> err;</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  trace_bpf_map_create(<span class="built_in">map</span>, err);</span><br><span class="line">  <span class="keyword">return</span> err;</span><br><span class="line"></span><br><span class="line">free_map:</span><br><span class="line">  bpf_map_uncharge_memlock(<span class="built_in">map</span>);</span><br><span class="line">free_map_sec:</span><br><span class="line">  security_bpf_map_free(<span class="built_in">map</span>);</span><br><span class="line">free_map_nouncharge:</span><br><span class="line">  <span class="built_in">map</span>-&gt;ops-&gt;map_free(<span class="built_in">map</span>);</span><br><span class="line">  <span class="keyword">return</span> err;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>其中<a href="https://elixir.bootlin.com/linux/v4.15/source/kernel/bpf/syscall.c#L327" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_map_new_fd()</code></a>函数就是用来为BPF Map分配fd的，下面是其函数主体：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// https://elixir.bootlin.com/linux/v4.15/source/kernel/bpf/syscall.c#L327</span></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_map_new_fd</span><span class="params">(struct bpf_map *<span class="built_in">map</span>, <span class="keyword">int</span> flags)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="keyword">int</span> ret;</span><br><span class="line"></span><br><span class="line">  ret = security_bpf_map(<span class="built_in">map</span>, OPEN_FMODE(flags));</span><br><span class="line">  <span class="keyword">if</span> (ret &lt; <span class="number">0</span>)</span><br><span class="line">    <span class="keyword">return</span> ret;</span><br><span class="line"><span class="comment">/**</span></span><br><span class="line"><span class="comment"> * anon_inode_getfd - creates a new file instance by hooking it up to an</span></span><br><span class="line"><span class="comment"> *                    anonymous inode, and a dentry that describe the "class"</span></span><br><span class="line"><span class="comment"> *                    of the file</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * @name:    [in]    name of the "class" of the new file</span></span><br><span class="line"><span class="comment"> * @fops:    [in]    file operations for the new file</span></span><br><span class="line"><span class="comment"> * @priv:    [in]    private data for the new file (will be file's private_data)</span></span><br><span class="line"><span class="comment"> * @flags:   [in]    flags</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> * Creates a new file by hooking it on a single inode. This is useful for files</span></span><br><span class="line"><span class="comment"> * that do not need to have a full-fledged inode in order to operate correctly.</span></span><br><span class="line"><span class="comment"> * All the files created with anon_inode_getfd() will share a single inode,</span></span><br><span class="line"><span class="comment"> * hence saving memory and avoiding code duplication for the file/inode/dentry</span></span><br><span class="line"><span class="comment"> * setup.  Returns new descriptor or an error code.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line">  <span class="keyword">return</span> anon_inode_getfd(<span class="string">"bpf-map"</span>, &amp;bpf_map_fops, <span class="built_in">map</span>,</span><br><span class="line">        flags | O_CLOEXEC);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>要说的是<code>anon_inode_getfd()</code>这个函数，它不是一般的分配 fd 的方式，是一种特殊的匿名方式，它的inode没有被绑定到磁盘上的某个文件，而是仅仅在内存里。一旦fd关闭后，对应的内存空间就会被释放，相关数据，即我们的 BPF Map也就被删除了。它的comment doc写得非常好，详细大家可以自行了解。</p><p>也可以通过<code>lsof</code>和<code>cat /proc/[pid]/fd</code>命令看到BPF Map作为 <strong>anon_inode</strong> 的效果（其实普通的BPF程序也是这个type）：</p><p><img alt data-src="https://davidlovezoe.club/wordpress/wp-content/uploads/2020/08/DraggedImage-2-1.png"></p><h2 id="BPF-Map-调试"><a href="#BPF-Map-调试" class="headerlink" title="BPF Map 调试"></a>BPF Map 调试</h2><p>如果想看当前操作系统上面是否有正在使用BPF Map，可以使用BPF社区大力推荐的命令行工具——<a href="https://elixir.bootlin.com/linux/v4.15/source/tools/bpf/bpftool/Documentation/bpftool.rst" target="_blank" rel="external nofollow noopener noreferrer">BPFtool</a>，它是专门用来查看BPF程序和BPF Map的命令行工具，并且可以对它们做一些简单操作。<a href="https://elixir.bootlin.com/linux/v4.15/source/tools/bpf/bpftool" target="_blank" rel="external nofollow noopener noreferrer">BPFtool源码</a> 被维护在Linux内核代码里，因此一般都是通过make命令自行编译出可执行文件，操作起来并不麻烦，如下所示：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">cd</span> linux-source-code/tools</span><br><span class="line">make -C  bpf/bpftool/</span><br><span class="line"><span class="built_in">cd</span> bpf/bpftool/</span><br><span class="line"><span class="comment"># the output is a binary named as `bpftool`</span></span><br><span class="line">./bpftool [prog|map]</span><br></pre></td></tr></table></figure><p>需要注意的是，不同内核版本下的BPFtool代码有所差异，其功能也不一样，一般来说高版本内核下的BPFtool功能更多，也是向下兼容的。我使用的就是在5.6.6内核版本下编译出来的BPFtool，并且在内核版本是4.15.0操作系统上运行顺畅。</p><p>接下来给大家简单演示如何使用bpftool查看BPF Map信息，主要用两个命令进行查看：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># command #1, list all the bpf map in the current node</span></span><br><span class="line"><span class="comment"># you can find map id, map type, map name, key type, value type, the number of max entry and memory allocation in the output</span></span><br><span class="line">&gt; bpftool map </span><br><span class="line">29: <span class="built_in">hash</span>  name tracker_map  flags 0x0</span><br><span class="line">  key 8B  value 32B  max_entries 2048  memlock 217088B</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment"># command #2, show the bpf map details including keys and value in hex-format</span></span><br><span class="line"><span class="comment"># the map id can be found in the output of command #1</span></span><br><span class="line"><span class="comment"># you can also find the element number</span></span><br><span class="line">&gt; bpftool map dump id [map id]</span><br><span class="line">key:</span><br><span class="line">c0 a8 3a 01 ac 11 00 02</span><br><span class="line">value:</span><br><span class="line">00 00 00 00 00 00 00 00  0a 00 00 00 00 00 00 00</span><br><span class="line">00 00 00 00 00 00 00 00  e4 02 00 00 00 00 00 00</span><br><span class="line">key:</span><br><span class="line">ac 11 00 01 ac 11 00 02</span><br><span class="line">value:</span><br><span class="line">00 00 00 00 00 00 00 00  07 00 00 00 00 00 00 00</span><br><span class="line">00 00 00 00 00 00 00 00  06 02 00 00 00 00 00 00</span><br><span class="line">Found 2 elements</span><br></pre></td></tr></table></figure><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://davidlovezoe.club/wordpress/archives/1044" target="_blank" rel="external nofollow noopener noreferrer">BPF数据传递的桥梁——BPF MAP</a></li><li><a href="https://lore.kernel.org/patchwork/patch/513670/" target="_blank" rel="external nofollow noopener noreferrer">Linux Kernel Patch, bpf: add hashtable type of eBPF maps, v3.19-rc1</a></li><li><a href="https://lore.kernel.org/patchwork/patch/513676/" target="_blank" rel="external nofollow noopener noreferrer">Linux Kernel Patch, bpf: add array type of eBPF maps, v3.19-rc1</a></li><li><a href="https://lore.kernel.org/patchwork/patch/944993/" target="_blank" rel="external nofollow noopener noreferrer">Linux Kernel Patch, bpf: allow bpf programs to tail-call other bpf programs, v4.2-rc1</a></li><li><a href="https://lore.kernel.org/patchwork/patch/582263/" target="_blank" rel="external nofollow noopener noreferrer">Linux Kernel Patch, bpf: Add new bpf map type to store the pointer to struct perf_event, v4.3-rc1</a></li><li><a href="https://gitlab.ic.unicamp.br/lkcamp/linux-staging/-/commit/824bd0ce6c7c43a9e1e210abf124958e54d88342" target="_blank" rel="external nofollow noopener noreferrer">Linux Kernel Patch, bpf: introduce BPF_MAP_TYPE_PERCPU_HASH map, v4.6-rc1</a></li><li><a href="https://patchwork.ozlabs.org/project/netdev/patch/1454395198-1796236-3-git-send-email-ast@fb.com/" target="_blank" rel="external nofollow noopener noreferrer">Linux Kernel Patch, bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map, v4.6-rc1</a></li><li><a href="https://www.spinics.net/lists/netdev/msg426191.html" target="_blank" rel="external nofollow noopener noreferrer">Linux Kernel Patch, bpf: Add hash of maps support</a></li><li><a href="https://hechao.li/2019/03/19/Use-Map-in-Map-in-BPF-programs-via-Libbpf/" target="_blank" rel="external nofollow noopener noreferrer">Use Map-in-Map in BPF programs via Libbpf</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;eBPF Map 是用户空间和内核空间进行数据交换、信息传递的桥梁，它以 &lt;code&gt;key/value&lt;/code&gt; 方式将数据存储在内核中，可以被任何知道它们的BPF程序访问。在内核空间的程序创建 BPF Map 并返回对应的 &lt;strong&gt;文件描述符&lt;/strong&gt;，在用户空间运行的程序就可以通过这个文件描述符来访问并操作BPF Map。eBPF Map 支持多种数据结构类型，在 &lt;a href=&quot;https://houmin.cc/posts/2c811c2c/&quot;&gt;上一篇博客&lt;/a&gt; 中已经简单介绍过，本文将通过代码实例展示其使用方法，所有代码可以在我的 &lt;a href=&quot;https://github.com/SimpCosm/cake&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;Github&lt;/a&gt; 中找到。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31-ebpf-map.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="BPF" scheme="https://houmin.cc/tags/BPF/"/>
    
      <category term="linux" scheme="https://houmin.cc/tags/linux/"/>
    
      <category term="map" scheme="https://houmin.cc/tags/map/"/>
    
  </entry>
  
  <entry>
    <title>Introduction to eBPF</title>
    <link href="https://houmin.cc/posts/2c811c2c/"/>
    <id>https://houmin.cc/posts/2c811c2c/</id>
    <published>2021-03-27T06:25:03.000Z</published>
    <updated>2022-11-09T15:13:45.392Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>eBPF 源于 <a href="https://en.wikipedia.org/wiki/Berkeley_Packet_Filter" target="_blank" rel="external nofollow noopener noreferrer">BPF</a>，本质上是处于内核中的一个高效与灵活的虚类虚拟机组件，以一种安全的方式在许多内核 hook 点执行字节码。BPF 最初的目的是用于高效网络报文过滤，经过重新设计，eBPF 不再局限于网络协议栈，已经成为内核顶级的子系统，演进为一个通用执行引擎。开发者可基于 eBPF 开发性能分析工具、软件定义网络、安全等诸多场景。本文将介绍 eBPF 的前世今生，并构建一个 eBPF 环境进行开发实践，文中所有的代码可以在我的 <a href="https://github.com/" target="_blank" rel="external nofollow noopener noreferrer">Github</a> 中找到。</p><a id="more"></a><h2 id="技术背景"><a href="#技术背景" class="headerlink" title="技术背景"></a>技术背景</h2><h3 id="发展历史"><a href="#发展历史" class="headerlink" title="发展历史"></a>发展历史</h3><p>BPF，是类 Unix 系统上数据链路层的一种原始接口，提供原始链路层封包的收发。1992 年，Steven McCanne 和 Van Jacobson 写了一篇名为 <a href="http://www.tcpdump.org/papers/bpf-usenix93.pdf" target="_blank" rel="external nofollow noopener noreferrer">The BSD Packet Filter: A New Architecture for User-level Packet Capture</a> 的论文。在文中，作者描述了他们如何在 Unix 内核实现网络数据包过滤，这种新的技术比当时最先进的数据包过滤技术快 20 倍。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31_bpf.png"></p><p>BPF 在数据包过滤上引入了两大革新：</p><ul><li>一个新的虚拟机 (VM) 设计，可以有效地工作在基于寄存器结构的 CPU 之上</li><li>应用程序使用缓存只复制与过滤数据包相关的数据，不会复制数据包的所有信息，这样可以最大程度地减少BPF 处理的数据</li></ul><p>由于这些巨大的改进，所有的 Unix 系统都选择采用 BPF 作为网络数据包过滤技术，直到今天，许多 Unix 内核的派生系统中（包括 Linux 内核）仍使用该实现。tcpdump 的底层采用 BPF 作为底层包过滤技术，我们可以在命令后面增加 <code>-d</code> 来查看 tcpdump 过滤条件的底层汇编指令。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">$ tcpdump -d &#39;ip and tcp port 8080&#39;</span><br><span class="line">(000) ldh      [12]</span><br><span class="line">(001) jeq      #0x800           jt 2jf 12</span><br><span class="line">(002) ldb      [23]</span><br><span class="line">(003) jeq      #0x6             jt 4jf 12</span><br><span class="line">(004) ldh      [20]</span><br><span class="line">(005) jset     #0x1fff          jt 12jf 6</span><br><span class="line">(006) ldxb     4*([14]&amp;0xf)</span><br><span class="line">(007) ldh      [x + 14]</span><br><span class="line">(008) jeq      #0x1f90          jt 11jf 9</span><br><span class="line">(009) ldh      [x + 16]</span><br><span class="line">(010) jeq      #0x1f90          jt 11jf 12</span><br><span class="line">(011) ret      #262144</span><br><span class="line">(012) ret      #0</span><br></pre></td></tr></table></figure><p>2014 年初，Alexei Starovoitov 实现了 eBPF（extended Berkeley Packet Filter）。经过重新设计，eBPF 演进为一个通用执行引擎，可基于此开发性能分析工具、软件定义网络等诸多场景。<strong>eBPF 最早出现在 3.18 内核中，此后原来的 BPF 就被称为经典 BPF，缩写 cBPF（classic BPF），cBPF 现在已经基本废弃。现在，Linux 内核只运行 eBPF，内核会将加载的 cBPF 字节码透明地转换成 eBPF 再执行</strong>。</p><h3 id="eBPF-与-cBPF"><a href="#eBPF-与-cBPF" class="headerlink" title="eBPF 与 cBPF"></a>eBPF 与 cBPF</h3><p>eBPF 新的设计针对现代硬件进行了优化，所以 eBPF 生成的指令集比旧的 BPF 解释器生成的机器码执行得更快。扩展版本也增加了虚拟机中的寄存器数量，将原有的 2 个 32 位寄存器增加到 10 个 64 位寄存器。由于寄存器数量和宽度的增加，开发人员可以使用函数参数自由交换更多的信息，编写更复杂的程序。总之，这些改进使 eBPF 版本的速度比原来的 BPF 提高了 4 倍。</p><div class="table-container"><table><thead><tr><th>维度</th><th>cBPF</th><th>eBPF</th></tr></thead><tbody><tr><td>内核版本</td><td>Linux 2.1.75（1997年）</td><td>Linux 3.18（2014年）[4.x for kprobe/uprobe/tracepoint/perf-event]</td></tr><tr><td>寄存器数目</td><td>2个：A, X</td><td>10个： R0–R9, 另外 R10 是一个只读的帧指针<br> - R0  eBPF 中内核函数的返回值和退出值<br> - R1 - R5  eBF 程序在内核中的参数值<br> - R6 - R9  内核函数将保存的被调用者callee保存的寄存器<br> - R10 一个只读的堆栈帧指针</td></tr><tr><td>寄存器宽度</td><td>32位</td><td>64位</td></tr><tr><td>存储</td><td>16 个内存位: M[0–15]</td><td>512 字节堆栈，无限制大小的 <code>map</code> 存储</td></tr><tr><td>限制的内核调用</td><td>非常有限，仅限于 JIT 特定</td><td>有限，通过 bpf_call 指令调用</td></tr><tr><td>目标事件</td><td>数据包、 seccomp-BPF</td><td>数据包、内核函数、用户函数、跟踪点 PMCs 等</td></tr></tbody></table></div><p>2014 年 6 月，<strong>eBPF 扩展到用户空间，这也成为了 BPF 技术的转折点</strong>。 正如 Alexei 在提交补丁的注释中写到：「这个补丁展示了 eBPF 的潜力」。当前，eBPF 不再局限于网络栈，已经成为内核顶级的子系统。</p><h3 id="eBPF-与内核模块"><a href="#eBPF-与内核模块" class="headerlink" title="eBPF 与内核模块"></a>eBPF 与内核模块</h3><p>对比 Web 的发展，eBPF 与内核的关系有点类似于 JavaScript 与浏览器内核的关系，eBPF 相比于直接修改内核和编写内核模块提供了一种新的内核可编程的选项。eBPF 程序架构强调安全性和稳定性，看上去更像内核模块，但与内核模块不同，eBPF 程序不需要重新编译内核，并且可以确保 eBPF 程序运行完成，而不会造成系统的崩溃。</p><div class="table-container"><table><thead><tr><th>维度</th><th>Linux 内核模块</th><th>eBPF</th></tr></thead><tbody><tr><td>kprobes/tracepoints</td><td>支持</td><td>支持</td></tr><tr><td><strong>安全性</strong></td><td>可能引入安全漏洞或导致内核 Panic</td><td>通过验证器进行检查，可以保障内核安全</td></tr><tr><td>内核函数</td><td>可以调用内核函数</td><td>只能通过 BPF Helper 函数调用</td></tr><tr><td>编译性</td><td>需要编译内核</td><td>不需要编译内核，引入头文件即可</td></tr><tr><td>运行</td><td>基于相同内核运行</td><td>基于稳定 ABI 的 BPF 程序可以编译一次，各处运行</td></tr><tr><td>与应用程序交互</td><td>打印日志或文件</td><td>通过 perf_event 或 map 结构</td></tr><tr><td>数据结构丰富性</td><td>一般</td><td>丰富</td></tr><tr><td><strong>入门门槛</strong></td><td>高</td><td>低</td></tr><tr><td><strong>升级</strong></td><td>需要卸载和加载，可能导致处理流程中断</td><td>原子替换升级，不会造成处理流程中断</td></tr><tr><td>内核内置</td><td>视情况而定</td><td>内核内置支持</td></tr></tbody></table></div><h3 id="eBPF-架构"><a href="#eBPF-架构" class="headerlink" title="eBPF 架构"></a>eBPF 架构</h3><p>eBPF 分为用户空间程序和内核程序两部分：</p><ul><li>用户空间程序负责加载 BPF 字节码至内核，如需要也会负责读取内核回传的统计信息或者事件详情</li><li>内核中的 BPF 字节码负责在内核中执行特定事件，如需要也会将执行的结果通过 maps 或者 perf-event 事件发送至用户空间</li><li>其中用户空间程序与内核 BPF 字节码程序可以使用 map 结构实现双向通信，这为内核中运行的 BPF 字节码程序提供了更加灵活的控制</li></ul><p>eBPF 整体结构图如下：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31-ebpf.png"></p><p>用户空间程序与内核中的 BPF 字节码交互的流程主要如下：</p><ol><li>使用 LLVM 或者 GCC 工具将编写的 BPF 代码程序编译成 BPF 字节码</li><li>使用加载程序 Loader 将字节码加载至内核</li><li>内核使用验证器（Verfier） 组件保证执行字节码的安全性，以避免对内核造成灾难，在确认字节码安全后将其加载对应的内核模块执行</li><li>内核中运行的 BPF 字节码程序可以使用两种方式将数据回传至用户空间<ul><li><strong>maps</strong> 方式可用于将内核中实现的统计摘要信息（比如测量延迟、堆栈信息）等回传至用户空间；</li><li><strong>perf-event</strong> 用于将内核采集的事件实时发送至用户空间，用户空间程序实时读取分析；</li></ul></li></ol><h3 id="eBPF-限制"><a href="#eBPF-限制" class="headerlink" title="eBPF 限制"></a>eBPF 限制</h3><p>eBPF 技术虽然强大，但是为了保证内核的处理安全和及时响应，内核中的 eBPF 技术也给予了诸多限制，当然随着技术的发展和演进，限制也在逐步放宽或者提供了对应的解决方案。</p><ul><li><p>eBPF 程序不能调用任意的内核参数，只限于内核模块中列出的 BPF Helper 函数，函数支持列表也随着内核的演进在不断增加。</p></li><li><p>eBPF 程序不允许包含无法到达的指令，防止加载无效代码，延迟程序的终止。</p></li><li><p>eBPF 程序中循环次数限制且必须在有限时间内结束，这主要是用来防止在 kprobes 中插入任意的循环，导致锁住整个系统；解决办法包括展开循环，并为需要循环的常见用途添加辅助函数。Linux 5.3 在 BPF 中包含了对有界循环的支持，它有一个可验证的运行时间上限。</p></li><li><p>eBPF 堆栈大小被限制在 MAX_BPF_STACK，截止到内核 Linux 5.8 版本，被设置为 512；参见 <a href="https://github.com/torvalds/linux/blob/v5.8/include/linux/filter.h" target="_blank" rel="external nofollow noopener noreferrer">include/linux/filter.h</a>，这个限制特别是在栈上存储多个字符串缓冲区时：一个char[256]缓冲区会消耗这个栈的一半。目前没有计划增加这个限制，解决方法是改用 bpf 映射存储，它实际上是无限的。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* BPF program can access up to 512 bytes of stack space. */</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> MAX_BPF_STACK512</span></span><br></pre></td></tr></table></figure></li><li><p>eBPF 字节码大小最初被限制为 4096 条指令，截止到内核 Linux 5.8 版本， 当前已将放宽至 100 万指令（ BPF_COMPLEXITY_LIMIT_INSNS），参见：<a href="https://github.com/torvalds/linux/blob/v5.8/include/linux/bpf.h" target="_blank" rel="external nofollow noopener noreferrer">include/linux/bpf.h</a>，对于无权限的BPF程序，仍然保留4096条限制 ( BPF_MAXINSNS )；新版本的 eBPF 也支持了多个 eBPF 程序级联调用，虽然传递信息存在某些限制，但是可以通过组合实现更加强大的功能。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">define</span> BPF_COMPLEXITY_LIMIT_INSNS      1000000 <span class="comment">/* yes. 1M insns */</span></span></span><br></pre></td></tr></table></figure></li></ul><h2 id="eBPF-实战"><a href="#eBPF-实战" class="headerlink" title="eBPF 实战"></a>eBPF 实战</h2><p>在深入介绍 eBPF 特性之前，让我们 <code>Get Hands Dirty</code>，切切实实的感受 eBPF 程序到底是什么，我们该如何开发 eBPF 程序。随着 eBPF 生态的演进，现在已经有越来越多的工具链用于开发 eBPF 程序，在后文也会详细介绍：</p><ul><li>基于 bcc 开发：bcc 提供了对 eBPF 开发，前段提供 Python API，后端 eBPF 程序通过 C 实现。特点是简单易用，但是性能较差。</li><li>基于 libebpf-bootstrap 开发：libebpf-bootstrap 提供了一个方便的脚手架</li><li>基于内核源码开发：内核源码开发门槛较高，但是也更加切合 eBPF 底层原理，所以这里以这个方法作为示例</li></ul><h3 id="内核源码编译"><a href="#内核源码编译" class="headerlink" title="内核源码编译"></a>内核源码编译</h3><p>系统环境如下，采用腾讯云 CVM，Ubuntu 20.04，内核版本 5.4.0</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ uname -a</span><br><span class="line">Linux VM-1-3-ubuntu 5.4.0-42-generic <span class="comment">#46-Ubuntu SMP Fri Jul 10 00:24:02 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux</span></span><br></pre></td></tr></table></figure><p>首先安装必要依赖：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">sudo apt install -y bison build-essential cmake flex git libedit-dev pkg-config libmnl-dev \</span><br><span class="line">   python zlib1g-dev libssl-dev libelf-dev libcap-dev libfl-dev llvm clang pkg-config \</span><br><span class="line">   gcc-multilib luajit libluajit-5.1-dev libncurses5-dev libclang-dev clang-tools</span><br></pre></td></tr></table></figure><p>一般情况下推荐采用 apt 方式的安装源码，安装简单而且只安装当前内核的源码，源码的大小在 200M 左右。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># apt-cache search linux-source</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># apt install linux-source-5.4.0</span></span><br></pre></td></tr></table></figure><p>源码安装至 <code>/usr/src/</code> 目录下。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">$ ls -hl</span><br><span class="line">total 4.0K</span><br><span class="line">drwxr-xr-x 4 root root 4.0K Nov  9 13:22 linux-source-5.4.0</span><br><span class="line">lrwxrwxrwx 1 root root   45 Oct 15 10:28 linux-source-5.4.0.tar.bz2 -&gt; linux-source-5.4.0/linux-source-5.4.0.tar.bz2</span><br><span class="line">$ tar -jxvf linux-source-5.4.0.tar.bz2</span><br><span class="line">$ <span class="built_in">cd</span> linux-source-5.4.0</span><br><span class="line"></span><br><span class="line">$ cp -v /boot/config-$(uname -r) .config <span class="comment"># make defconfig 或者 make menuconfig</span></span><br><span class="line">$ make headers_install</span><br><span class="line">$ make modules_prepare</span><br><span class="line">$ make scripts     <span class="comment"># 可选</span></span><br><span class="line">$ make M=samples/bpf  <span class="comment"># 如果配置出错，可以使用 make oldconfig &amp;&amp; make prepare 修复</span></span><br></pre></td></tr></table></figure><p>编译成功后，可以在 <code>samples/bpf</code> 目录下看到一系列的目标文件和二进制文件。</p><h3 id="Hello-World"><a href="#Hello-World" class="headerlink" title="Hello World"></a>Hello World</h3><p>前面说到 eBPF 通常由内核空间程序和用户空间程序两部分组成，现在 <code>samples/bpf</code> 目录下有很多这种程序，内核空间程序以 <code>_kern.c</code> 结尾，用户空间程序以 <code>_user.c</code> 结尾。先不看这些复杂的程序，我们手动写一个 eBPF 程序的 Hello World。</p><p>内核中的程序 <code>hello_kern.c</code>：</p><figure class="highlight c"><figcaption><span>hello_kern.c</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;linux/bpf.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"bpf_helpers.h"</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> SEC(NAME) __attribute__((section(NAME), used))</span></span><br><span class="line"></span><br><span class="line">SEC(<span class="string">"tracepoint/syscalls/sys_enter_execve"</span>)</span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_prog</span><span class="params">(<span class="keyword">void</span> *ctx)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="keyword">char</span> msg[] = <span class="string">"Hello BPF from houmin!\n"</span>;</span><br><span class="line">    bpf_trace_printk(msg, <span class="keyword">sizeof</span>(msg));</span><br><span class="line">    <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">char</span> _license[] SEC(<span class="string">"license"</span>) = <span class="string">"GPL"</span>;</span><br></pre></td></tr></table></figure><h4 id="函数入口"><a href="#函数入口" class="headerlink" title="函数入口"></a>函数入口</h4><p>上述代码和普通的C语言编程有一些区别。</p><ol><li>程序的入口通过编译器的 <code>pragama __section(&quot;tracepoint/syscalls/sys_enter_execve&quot;)</code> 指定的。</li><li>入口的参数不再是 <code>argc, argv</code>, 它根据不同的 prog type 而有所差别。我们的例子中，prog type 是 <code>BPF_PROG_TYPE_TRACEPOINT</code>， 它的入口参数就是 <code>void *ctx</code>。</li></ol><h4 id="头文件"><a href="#头文件" class="headerlink" title="头文件"></a>头文件</h4><h5 id="include-lt-linux-bpf-h-gt"><a href="#include-lt-linux-bpf-h-gt" class="headerlink" title="#include &lt;linux/bpf.h&gt;"></a><code>#include &lt;linux/bpf.h&gt;</code></h5><p>这个头文件的来源是kernel source header file 。它安装在 <code>/usr/include/linux/bpf.h</code>中。</p><p>它提供了bpf 编程需要的很多symbol。例如</p><ol><li>enum bpf_func_id  定义了所有的kerne helper function 的id</li><li>enum bpf_prog_type 定义了内核支持的所有的prog 的类型。</li><li>struct __sk_buff 是bpf 代码中访问内核struct sk_buff的接口。</li></ol><p>等等</p><h5 id="include-“bpf-helpers-h”"><a href="#include-“bpf-helpers-h”" class="headerlink" title="#include “bpf_helpers.h”"></a>#include “bpf_helpers.h”</h5><p>来自libbpf ，需要自行安装。 我们引用这个头文件是因为调用了bpf_printk()。这是一个kernel helper function。</p><h4 id="程序解释"><a href="#程序解释" class="headerlink" title="程序解释"></a>程序解释</h4><p>这里我们简单解读下内核态的 <code>ebpf</code> 程序，非常简单：</p><ul><li><code>bpf_trace_printk</code> 是一个 eBPF helper 函数，用于打印信息到 <code>trace_pipe</code> (/sys/kernel/debug/tracing/trace_pipe)，<a href="https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#1-bpf_trace_printk" target="_blank" rel="external nofollow noopener noreferrer">详见这里</a></li><li>代码声明了 <code>SEC</code> 宏，并且定义了 GPL 的 License，这是因为加载进内核的 eBPF 程序需要有 License 检查，类似于内核模块</li></ul><h4 id="加载-BPF-代码"><a href="#加载-BPF-代码" class="headerlink" title="加载 BPF 代码"></a>加载 BPF 代码</h4><p>用户态程序 <code>hello_user.c</code></p><figure class="highlight c"><figcaption><span>hello_user.c</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">"bpf_load.h"</span></span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">(<span class="keyword">int</span> argc, <span class="keyword">char</span> **argv)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="keyword">if</span>(load_bpf_file(<span class="string">"hello_kern.o"</span>) != <span class="number">0</span>)</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="built_in">printf</span>(<span class="string">"The kernel didn't load BPF program\n"</span>);</span><br><span class="line">        <span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    read_trace_pipe();</span><br><span class="line">    <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>在用户态 <code>ebpf</code> 程序中，解读如下：</p><ul><li>通过 <code>load_bpf_file</code> 将编译出的内核态 ebpf 目标文件加载到内核</li><li>通过 <a href="https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/bpf/trace_helpers.c#L120" target="_blank" rel="external nofollow noopener noreferrer"><code>read_trace_pipe</code></a> 从 <code>trace_pipe</code> 读取 trace 信息，打印到控制台中</li></ul><p>修改 <code>samples/bpf</code> 目录下的 <code>Makefile</code> 文件，在对应的位置添加以下三行：</p><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">hostprogs-y += hello</span><br><span class="line">hello-objs := bpf_load.o hello_user.o</span><br><span class="line">always += hello_kern.o</span><br></pre></td></tr></table></figure><p>重新编译，可以看到编译成功的文件</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">$ make M=samples/bpf</span><br><span class="line">$ ls -hl samples/bpf/hello*</span><br><span class="line">-rwxrwxr-x 1 ubuntu ubuntu 404K Mar 30 17:48 samples/bpf/hello</span><br><span class="line">-rw-rw-r-- 1 ubuntu ubuntu  317 Mar 30 17:47 samples/bpf/hello_kern.c</span><br><span class="line">-rw-rw-r-- 1 ubuntu ubuntu 3.8K Mar 30 17:48 samples/bpf/hello_kern.o</span><br><span class="line">-rw-rw-r-- 1 ubuntu ubuntu  246 Mar 30 17:47 samples/bpf/hello_user.c</span><br><span class="line">-rw-rw-r-- 1 ubuntu ubuntu 2.2K Mar 30 17:48 samples/bpf/hello_user.o</span><br></pre></td></tr></table></figure><p>进入到对应的目录运行 <code>hello</code> 程序，可以看到输出结果如下：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">$ sudo ./hello</span><br><span class="line">           &lt;...&gt;-102735 [001] ....  6733.481740: 0: Hello BPF from houmin!</span><br><span class="line"></span><br><span class="line">           &lt;...&gt;-102736 [000] ....  6733.482884: 0: Hello BPF from houmin!</span><br><span class="line"></span><br><span class="line">           &lt;...&gt;-102737 [002] ....  6733.483074: 0: Hello BPF from houmin!</span><br></pre></td></tr></table></figure><h3 id="代码解读"><a href="#代码解读" class="headerlink" title="代码解读"></a>代码解读</h3><p>前面提到 <code>load_bpf_file</code> 函数将 LLVM 编译出来的 eBPF 字节码加载进内核，这到底是如何实现的呢？</p><ul><li>经过搜查，可以看到 <code>load_bpf_file</code> 也是在 <code>samples/bpf</code> 目录下实现的，具体的参见<a href="https://elixir.bootlin.com/linux/v5.4/source/samples/bpf/bpf_load.c#L659" target="_blank" rel="external nofollow noopener noreferrer"> <code>bpf_load.c</code></a></li><li>阅读 <code>load_bpf_file</code> 代码可以看到，它主要是解析 ELF 格式的 eBPF 字节码，然后调用 <a href="https://elixir.bootlin.com/linux/v5.4/source/samples/bpf/bpf_load.c#L76" target="_blank" rel="external nofollow noopener noreferrer"><code>load_and_attach</code></a> 函数</li><li>在 <code>load_and_attach</code> 函数中，我们可以看到其调用了 <code>bpf_load_program</code> 函数，这是 libbpf 提供的函数。</li><li>调用的 <code>bpf_load_program</code> 中的 <code>license</code>、<code>kern_version</code> 等参数来自于解析 eBPF ELF 文件，prog_type 来自于 bpf 代码里面 SEC 字段指定的类型。</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">int</span> <span class="title">load_and_attach</span><span class="params">(<span class="keyword">const</span> <span class="keyword">char</span> *event, struct bpf_insn *prog, <span class="keyword">int</span> <span class="built_in">size</span>)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="keyword">bool</span> is_socket = <span class="built_in">strncmp</span>(event, <span class="string">"socket"</span>, <span class="number">6</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_kprobe = <span class="built_in">strncmp</span>(event, <span class="string">"kprobe/"</span>, <span class="number">7</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_kretprobe = <span class="built_in">strncmp</span>(event, <span class="string">"kretprobe/"</span>, <span class="number">10</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_tracepoint = <span class="built_in">strncmp</span>(event, <span class="string">"tracepoint/"</span>, <span class="number">11</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_raw_tracepoint = <span class="built_in">strncmp</span>(event, <span class="string">"raw_tracepoint/"</span>, <span class="number">15</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_xdp = <span class="built_in">strncmp</span>(event, <span class="string">"xdp"</span>, <span class="number">3</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_perf_event = <span class="built_in">strncmp</span>(event, <span class="string">"perf_event"</span>, <span class="number">10</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_cgroup_skb = <span class="built_in">strncmp</span>(event, <span class="string">"cgroup/skb"</span>, <span class="number">10</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_cgroup_sk = <span class="built_in">strncmp</span>(event, <span class="string">"cgroup/sock"</span>, <span class="number">11</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_sockops = <span class="built_in">strncmp</span>(event, <span class="string">"sockops"</span>, <span class="number">7</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_sk_skb = <span class="built_in">strncmp</span>(event, <span class="string">"sk_skb"</span>, <span class="number">6</span>) == <span class="number">0</span>;</span><br><span class="line"><span class="keyword">bool</span> is_sk_msg = <span class="built_in">strncmp</span>(event, <span class="string">"sk_msg"</span>, <span class="number">6</span>) == <span class="number">0</span>;</span><br><span class="line">  </span><br><span class="line">  <span class="comment">//...</span></span><br><span class="line">  </span><br><span class="line">fd = bpf_load_program(prog_type, prog, insns_cnt, license, kern_version,</span><br><span class="line">      bpf_log_buf, BPF_LOG_BUF_SIZE);</span><br><span class="line"><span class="keyword">if</span> (fd &lt; <span class="number">0</span>) &#123;</span><br><span class="line"><span class="built_in">printf</span>(<span class="string">"bpf_load_program() err=%d\n%s"</span>, errno, bpf_log_buf);</span><br><span class="line"><span class="keyword">return</span> <span class="number">-1</span>;</span><br><span class="line">&#125;</span><br><span class="line">  <span class="comment">//...</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h2 id="eBPF-特性"><a href="#eBPF-特性" class="headerlink" title="eBPF 特性"></a>eBPF 特性</h2><h3 id="Hook-Overview"><a href="#Hook-Overview" class="headerlink" title="Hook Overview"></a>Hook Overview</h3><p>eBPF 程序都是事件驱动的，它们会在内核或者应用程序经过某个确定的 Hook 点的时候运行，这些 Hook 点都是提前定义的，包括系统调用、函数进入/退出、内核 <code>tracepoints</code>、网络事件等。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31_ebpf-syscall-hook.png"></p><p>如果针对某个特定需求的 Hook 点不存在，可以通过 <code>kprobe</code> 或者 <code>uprobe</code> 来在内核或者用户程序的几乎所有地方挂载 eBPF 程序。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31_ebpf-hook-overview.png"></p><h3 id="Verification"><a href="#Verification" class="headerlink" title="Verification"></a>Verification</h3><blockquote><p>With great power there must also come great responsibility.</p></blockquote><p>每一个 eBPF 程序加载到内核都要经过 <code>Verification</code>，用来保证 eBPF 程序的安全性，主要包括：</p><ul><li><p>要保证 加载 eBPF 程序的进程有必要的特权级，除非节点开启了 <code>unpriviledged</code> 特性，只有特权级的程序才能够加载 eBPF 程序</p><ul><li><p>内核提供了一个配置项 <code>/proc/sys/kernel/unprivileged_bpf_disabled</code> 来禁止非特权用户使用 <code>bpf(2)</code> 系统调用，可以通过 <code>sysctl</code> 命令修改</p></li><li><p>比较特殊的一点是，这个配置项特意设计为<strong>一次性开关</strong>（one-time kill switch）， 这意味着一旦将它设为 <code>1</code>，就没有办法再改为 <code>0</code> 了，除非重启内核</p></li><li><p>一旦设置为 <code>1</code> 之后，只有初始命名空间中有 <code>CAP_SYS_ADMIN</code> 特权的进程才可以调用 <code>bpf(2)</code> 系统调用 。 Cilium 启动后也会将这个配置项设为 1：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">echo</span> 1 &gt; /proc/sys/kernel/unprivileged_bpf_disabled</span><br></pre></td></tr></table></figure></li></ul></li><li><p>要保证 eBPF 程序不会崩溃或者使得系统出故障</p></li><li><p>要保证 eBPF 程序不能陷入死循环，能够 <code>runs to completion</code></p></li><li><p>要保证 eBPF 程序必须满足系统要求的大小，过大的 eBPF 程序不允许被加载进内核</p></li><li><p>要保证 eBPF 程序的复杂度有限，<code>Verifier</code> 将会评估 eBPF 程序所有可能的执行路径，必须能够在有限时间内完成 eBPF 程序复杂度分析</p></li></ul><h3 id="JIT-Compilation"><a href="#JIT-Compilation" class="headerlink" title="JIT Compilation"></a>JIT Compilation</h3><p><code>Just-In-Time(JIT)</code> 编译用来将通用的 eBPF 字节码翻译成与机器相关的指令集，从而极大加速 BPF 程序的执行：</p><ul><li>与解释器相比，它们可以降低每个指令的开销。通常，指令可以 1:1 映射到底层架构的原生指令</li><li>这也会减少生成的可执行镜像的大小，因此对 CPU 的指令缓存更友好</li><li>特别地，对于 CISC 指令集（例如 <code>x86</code>），JIT 做了很多特殊优化，目的是为给定的指令产生可能的最短操作码，以降低程序翻译过程所需的空间</li></ul><p>64 位的 <code>x86_64</code>、<code>arm64</code>、<code>ppc64</code>、<code>s390x</code>、<code>mips64</code>、<code>sparc64</code> 和 32 位的 <code>arm</code> 、<code>x86_32</code> 架构都内置了 in-kernel eBPF JIT 编译器，它们的功能都是一样的，可以用如下方式打开：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">echo</span> 1 &gt; /proc/sys/net/core/bpf_jit_enable</span><br></pre></td></tr></table></figure><p>32 位的 <code>mips</code>、<code>ppc</code> 和 <code>sparc</code> 架构目前内置的是一个 cBPF JIT 编译器。这些只有 cBPF JIT 编译器的架构，以及那些甚至完全没有 BPF JIT 编译器的架构，需要通过<strong>内核中的解释器</strong>（in-kernel interpreter）执行 eBPF 程序。</p><p>要判断哪些平台支持 eBPF JIT，可以在内核源文件中 grep <code>HAVE_EBPF_JIT</code>：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">$ git grep HAVE_EBPF_JIT arch/</span><br><span class="line">arch/arm/Kconfig:       select HAVE_EBPF_JIT   <span class="keyword">if</span> !CPU_ENDIAN_BE32</span><br><span class="line">arch/arm64/Kconfig:     select HAVE_EBPF_JIT</span><br><span class="line">arch/powerpc/Kconfig:   select HAVE_EBPF_JIT   <span class="keyword">if</span> PPC64</span><br><span class="line">arch/mips/Kconfig:      select HAVE_EBPF_JIT   <span class="keyword">if</span> (64BIT &amp;&amp; !CPU_MICROMIPS)</span><br><span class="line">arch/s390/Kconfig:      select HAVE_EBPF_JIT   <span class="keyword">if</span> PACK_STACK &amp;&amp; HAVE_MARCH_Z196_FEATURES</span><br><span class="line">arch/sparc/Kconfig:     select HAVE_EBPF_JIT   <span class="keyword">if</span> SPARC64</span><br><span class="line">arch/x86/Kconfig:       select HAVE_EBPF_JIT   <span class="keyword">if</span> X86_64</span><br></pre></td></tr></table></figure><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-04-01_bpf-jit.png"></p><h3 id="Maps"><a href="#Maps" class="headerlink" title="Maps"></a>Maps</h3><p>BPF Map 是<strong>驻留在内核空间</strong>中的高效 <code>Key/Value store</code>，包含多种类型的 Map，由内核实现其功能，具体实现可以参考 <a href="https://houmin.cc">我的这篇博文</a>。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31-ebpf-map.png"></p><p>BPF Map 的交互场景有以下几种：</p><ul><li>BPF 程序和用户态程序的交互：BPF 程序运行完，得到的结果存储到 map 中，供用户态程序通过文件描述符访问</li><li>BPF 程序和内核态程序的交互：和 BPF 程序以外的内核程序交互，也可以使用 map 作为中介</li><li>BPF 程序间交互：如果 BPF 程序内部需要用全局变量来交互，但是由于安全原因 BPF 程序不允许访问全局变量，可以使用 map 来充当全局变量</li><li>BPF Tail call：Tail call 是一个BPF程序跳转到另一BPF程序，BPF程序首先通过 <code>BPF_MAP_TYPE_PROG_ARRAY</code> 类型的 map 来知道另一个BPF程序的指针，然后调用 <code>tail_call()</code> 的 helper function 来执行Tail call</li></ul><p>共享 map 的 BPF 程序不要求是相同的程序类型，例如 tracing 程序可以和网络程序共享 map，<strong>单个 BPF 程序目前最多可直接访问 64 个不同 map</strong>。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-04-01_bpf-map.png"></p><p>当前可用的 <strong>通用 map</strong> 有：</p><ul><li><code>BPF_MAP_TYPE_HASH</code></li><li><code>BPF_MAP_TYPE_ARRAY</code></li><li><code>BPF_MAP_TYPE_PERCPU_HASH</code></li><li><code>BPF_MAP_TYPE_PERCPU_ARRAY</code></li><li><code>BPF_MAP_TYPE_LRU_HASH</code></li><li><code>BPF_MAP_TYPE_LRU_PERCPU_HASH</code></li><li><code>BPF_MAP_TYPE_LPM_TRIE</code></li></ul><p>以上 map 都使用相同的一组 BPF 辅助函数来执行查找、更新或删除操作，但各自实现了不同的后端，这些后端各有不同的语义和性能特点。随着多CPU架构的成熟发展，BPF Map也引入了 <strong>per-cpu</strong> 类型，如<code>BPF_MAP_TYPE_PERCPU_HASH</code>、<code>BPF_MAP_TYPE_PERCPU_ARRAY</code>等，当你使用这种类型的BPF Map时，每个CPU都会存储并看到它自己的Map数据，从属于不同CPU之间的数据是互相隔离的，这样做的好处是，在进行查找和聚合操作时更加高效，性能更好，尤其是你的BPF程序主要是在做收集时间序列型数据，如流量数据或指标等。</p><p>当前内核中的 <strong>非通用 map</strong> 有：</p><ul><li><code>BPF_MAP_TYPE_PROG_ARRAY</code>：一个数组 map，用于 hold 其他的 BPF 程序</li><li><code>BPF_MAP_TYPE_PERF_EVENT_ARRAY</code></li><li><code>BPF_MAP_TYPE_CGROUP_ARRAY</code>：用于检查skb中的cgroup2成员信息</li><li><code>BPF_MAP_TYPE_STACK_TRACE</code>：用于存储栈跟踪的MAP</li><li><code>BPF_MAP_TYPE_ARRAY_OF_MAPS</code>：持有（hold） 其他 map 的指针，这样整个 map 就可以在运行时实现原子替换</li><li><code>BPF_MAP_TYPE_HASH_OF_MAPS</code>：持有（hold） 其他 map 的指针，这样整个 map 就可以在运行时实现原子替换</li></ul><h3 id="Helper-Calls"><a href="#Helper-Calls" class="headerlink" title="Helper Calls"></a>Helper Calls</h3><p>eBPF 程序不能够随意调用内核函数，如果这么做的话会导致 eBPF 程序与特定的内核版本绑定，相反它内核定义的一系列 <code>Helper functions</code>。 <code>Helper functions</code> 使得 BPF 能够通过一组内核定义的稳定的函数调用来从内核中查询数据，或者将数据推送到内核。<strong>所有的 BPF 辅助函数都是核心内核的一部分，无法通过内核模块来扩展或添加</strong>。<strong>当前可用的 BPF 辅助函数已经有几十个，并且数量还在不断增加</strong>，你可以在 <a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html" target="_blank" rel="external nofollow noopener noreferrer">Linux Manual Page: bpf-helpers</a> 看到当前 Linux 支持的 <code>Helper functions</code>。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31_ebpf-helper-calls.png"></p><p><strong>不同类型的 BPF 程序能够使用的辅助函数可能是不同的</strong>，例如:</p><ul><li>与 attach 到 tc 层的 BPF 程序相比，attach 到 socket 的 BPF程序只能够调用前者可以调用的辅助函数的一个子集</li><li><code>lightweight tunneling</code> 使用的封装和解封装辅助函数，只能被更低的 tc 层使用；而推送通知到用户态所使用的事件输出辅助函数，既可以被 tc 程序使用也可以被 XDP 程序使用</li></ul><p><strong>所有的辅助函数都共享同一个通用的、和系统调用类似的函数方法</strong>，其定义如下：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">u64 <span class="title">fn</span><span class="params">(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)</span></span></span><br></pre></td></tr></table></figure><p>内核将辅助函数抽象成 <code>BPF_CALL_0()</code> 到 <code>BPF_CALL_5()</code> 几个宏，形式和相应类型的系统调用类似，这里宏的定义可以参见 <a href="https://elixir.bootlin.com/linux/v5.4/source/include/linux/filter.h#L479" target="_blank" rel="external nofollow noopener noreferrer">include/linux/filter.h</a> 。以<a href="https://elixir.bootlin.com/linux/v5.4/source/kernel/bpf/helpers.c#L41" target="_blank" rel="external nofollow noopener noreferrer"> <code>bpf_map_update_elem</code></a> 为例，可以看到它通过调用相应 map 的回调函数完成更新 map 元素的操作：</p><figure class="highlight c"><figcaption><span>/kernel/bpf/helpers.c</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line">BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, <span class="built_in">map</span>, <span class="keyword">void</span> *, key,</span><br><span class="line">           <span class="keyword">void</span> *, value, u64, flags)</span><br><span class="line">&#123;</span><br><span class="line">    WARN_ON_ONCE(!rcu_read_lock_held());</span><br><span class="line">    <span class="keyword">return</span> <span class="built_in">map</span>-&gt;ops-&gt;map_update_elem(<span class="built_in">map</span>, key, value, flags);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">bpf_func_proto</span> <span class="title">bpf_map_update_elem_proto</span> = &#123;</span></span><br><span class="line">    .func           = bpf_map_update_elem,</span><br><span class="line">    .gpl_only       = <span class="literal">false</span>,</span><br><span class="line">    .ret_type       = RET_INTEGER,</span><br><span class="line">    .arg1_type      = ARG_CONST_MAP_PTR,</span><br><span class="line">    .arg2_type      = ARG_PTR_TO_MAP_KEY,</span><br><span class="line">    .arg3_type      = ARG_PTR_TO_MAP_VALUE,</span><br><span class="line">    .arg4_type      = ARG_ANYTHING,</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>这种方式有很多优点：</p><blockquote><p>虽然 cBPF 允许其加载指令（load instructions）进行超出范围的访问（overload），以便从一个看似不可能的包偏移量（packet offset）获取数据以唤醒多功能辅助函数，但每个 cBPF JIT 仍然需要为这个 cBPF 扩展实现对应的支持。而在 eBPF 中，JIT 编译器会以一种透明和高效的方式编译新加入的辅助函数，这意味着 JIT 编 译器只需要发射（emit）一条调用指令（call instruction），因为寄存器映射的方式使得 BPF 排列参数的方式（assignments）已经和底层架构的调用约定相匹配了。这使得基于辅助函数扩展核心内核（core kernel）非常方便。<strong>所有的 BPF 辅助函数都是核心内核的一部分，无法通过内核模块（kernel module）来扩展或添加</strong>。</p><p>前面提到的函数签名还允许校验器执行类型检测（type check）。上面的 <code>struct bpf_func_proto</code> 用于存放<strong>校验器必需知道的所有关于该辅助函数的信息</strong>，这 样校验器可以确保辅助函数期望的类型和 BPF 程序寄存器中的当前内容是匹配的。</p><p>参数类型范围很广，从任意类型的值，到限制只能为特定类型，例如 BPF 栈缓冲区（stack buffer）的 <code>pointer/size</code> 参数对，辅助函数可以从这个位置读取数据或向其写入数据。 对于这种情况，校验器还可以执行额外的检查，例如，缓冲区是否已经初始化过了。</p></blockquote><h3 id="Tail-Calls"><a href="#Tail-Calls" class="headerlink" title="Tail  Calls"></a>Tail  Calls</h3><p>尾调用的机制是指：一个 BPF 程序可以调用另一个 BPF 程序，并且调用完成后不用返回到原来的程序。</p><ul><li>和普通函数调用相比，这种调用方式开销最小，因为它是<strong>用长跳转（long jump）实现的，复用了原来的栈帧</strong> （stack frame）</li><li>BPF 程序都是独立验证的，因此要传递状态，要么使用 per-CPU map 作为 scratch 缓冲区 ，要么如果是 tc 程序的话，还可以使用 <code>skb</code> 的某些字段（例如 <code>cb[]</code>）</li><li><strong>相同类型的程序才可以尾调用</strong>，而且它们还要与 JIT 编译器相匹配，因此要么是 JIT 编译执行，要么是解释器执行（invoke interpreted programs），但不能同时使用两种方式</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-04-01_bpf_-tailcall.png"></p><h3 id="BPF-to-BPF-Calls"><a href="#BPF-to-BPF-Calls" class="headerlink" title="BPF to BPF Calls"></a>BPF to BPF Calls</h3><p>除了 BPF 辅助函数和 BPF 尾调用之外，BPF 核心基础设施最近刚加入了一个新特性：<code>BPF to BPF calls</code>。<strong>在这个特性引入内核之前，典型的 BPF C 程序必须 将所有需要复用的代码进行特殊处理，例如，在头文件中声明为 <code>always_inline</code></strong>。当 LLVM 编译和生成 BPF 对象文件时，所有这些函数将被内联，因此会在生成的对象文件中重 复多次，导致代码尺寸膨胀：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;linux/bpf.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">ifndef</span> __section</span></span><br><span class="line"><span class="meta"># <span class="meta-keyword">define</span> __section(NAME)                  \</span></span><br><span class="line">   __attribute__((section(NAME), used))</span><br><span class="line"><span class="meta">#<span class="meta-keyword">endif</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">ifndef</span> __inline</span></span><br><span class="line"><span class="meta"># <span class="meta-keyword">define</span> __inline                         \</span></span><br><span class="line">   <span class="keyword">inline</span> __attribute__((always_inline))</span><br><span class="line"><span class="meta">#<span class="meta-keyword">endif</span></span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">static</span> __inline <span class="keyword">int</span> <span class="title">foo</span><span class="params">(<span class="keyword">void</span>)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="keyword">return</span> XDP_DROP;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">__section(<span class="string">"prog"</span>)</span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">xdp_drop</span><span class="params">(struct xdp_md *ctx)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="keyword">return</span> foo();</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">char</span> __license[] __section(<span class="string">"license"</span>) = <span class="string">"GPL"</span>;</span><br></pre></td></tr></table></figure><p>之所以要这样做是因为 <strong>BPF 程序的加载器、校验器、解释器和 JIT 中都缺少对函数调用的支持</strong>。从 <code>Linux 4.16</code> 和 <code>LLVM 6.0</code> 开始，这个限制得到了解决，BPF 程序不再需要到处使用 <code>always_inline</code> 声明了。因此，上面的代码可以更自然地重写为：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;linux/bpf.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">ifndef</span> __section</span></span><br><span class="line"><span class="meta"># <span class="meta-keyword">define</span> __section(NAME)                  \</span></span><br><span class="line">   __attribute__((section(NAME), used))</span><br><span class="line"><span class="meta">#<span class="meta-keyword">endif</span></span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">int</span> <span class="title">foo</span><span class="params">(<span class="keyword">void</span>)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="keyword">return</span> XDP_DROP;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">__section(<span class="string">"prog"</span>)</span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">xdp_drop</span><span class="params">(struct xdp_md *ctx)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="keyword">return</span> foo();</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">char</span> __license[] __section(<span class="string">"license"</span>) = <span class="string">"GPL"</span>;</span><br></pre></td></tr></table></figure><p>BPF 到 BPF 调用是一个重要的性能优化，极大减小了生成的 BPF 代码大小，因此 <strong>对 CPU 指令缓存（instruction cache，i-cache）更友好</strong>。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-04-01_bpf-call.png"></p><p>BPF 辅助函数的调用约定也适用于 BPF 函数间调用：</p><ul><li><code>r1</code> - <code>r5</code> 用于传递参数，返回结果放到 <code>r0</code></li><li><code>r1</code> - <code>r5</code> 是 scratch registers，<code>r6</code> - <code>r9</code> 像往常一样是保留寄存器</li><li>最大嵌套调用深度是 <code>8</code></li><li>调用方可以传递指针（例如，指向调用方的栈帧的指针） 给被调用方，但反过来不行</li></ul><p><strong>当前，BPF 函数间调用和 BPF 尾调用是不兼容的</strong>，因为后者需要复用当前的栈设置（ stack setup），而前者会增加一个额外的栈帧，因此不符合尾调用期望的布局。</p><p>BPF JIT 编译器为每个函数体发射独立的镜像（emit separate images for each function body），稍后在最后一通 JIT 处理（final JIT pass）中再修改镜像中函数调用的地址 。已经证明，这种方式需要对各种 JIT 做最少的修改，因为在实现中它们可以将 BPF 函数间调用当做常规的 BPF 辅助函数调用。</p><h3 id="Object-Pinning"><a href="#Object-Pinning" class="headerlink" title="Object Pinning"></a>Object Pinning</h3><p><strong>BPF map 和程序作为内核资源只能通过文件描述符访问，其背后是内核中的匿名 inode。</strong>这带来了很多优点：</p><ul><li>用户空间应用程序能够使用大部分文件描述符相关的 API</li><li>传递给 Unix socket 的文件描述符是透明工作等等</li></ul><p>但同时，<strong>文件描述符受限于进程的生命周期，使得 map 共享之类的操作非常笨重</strong>，这给某些特定的场景带来了很多复杂性。</p><blockquote><p>例如 iproute2，其中的 tc 或 XDP 在准备环境、加载程序到内核之后最终会退出。在这种情况下，从用户空间也无法访问这些 map 了，而本来这些 map 其实是很有用的。例如，在 data path 的 ingress 和 egress 位置共享的 map（可以统计包数、字节数、PPS 等信息）。另外，第三方应用可能希望在 BPF 程序运行时监控或更新 map。</p></blockquote><p><strong>为了解决这个问题，内核实现了一个最小内核空间 BPF 文件系统，BPF map 和 BPF 程序 都可以 pin 到这个文件系统内</strong>，这个过程称为 <code>object pinning</code>。BPF 相关的文件系统<strong>不是单例模式</strong>（singleton），它支持多挂载实例、硬链接、软连接等等。</p><p>相应的，BPF 系统调用扩展了两个新命令，如下图所示：</p><ul><li><code>BPF_OBJ_PIN</code>：钉住一个对象</li><li><code>BPF_OBJ_GET</code>：获取一个被钉住的对象</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-04-01_bpf-fs.png"></p><h3 id="Hardening"><a href="#Hardening" class="headerlink" title="Hardening"></a>Hardening</h3><h4 id="Protection-Execution-Protection"><a href="#Protection-Execution-Protection" class="headerlink" title="Protection Execution Protection"></a>Protection Execution Protection</h4><p>为了避免代码被损坏，BPF 会在程序的生命周期内，在内核中将 BPF 解释器<strong>解释后的整个镜像</strong>（<code>struct bpf_prog</code>）和 <strong>JIT 编译之后的镜像</strong>（<code>struct bpf_binary_header</code>）锁定为只读的。在这些位置发生的任何数据损坏（例如由于某些内核 bug 导致的）会触发通用的保护机制，因此会造成内核崩溃而不是允许损坏静默地发生。</p><p>查看哪些平台支持将镜像内存（image memory）设置为只读的，可以通过下面的搜索：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">$ git grep ARCH_HAS_SET_MEMORY | grep select</span><br><span class="line">arch/arm/Kconfig:    select ARCH_HAS_SET_MEMORY</span><br><span class="line">arch/arm64/Kconfig:  select ARCH_HAS_SET_MEMORY</span><br><span class="line">arch/s390/Kconfig:   select ARCH_HAS_SET_MEMORY</span><br><span class="line">arch/x86/Kconfig:    select ARCH_HAS_SET_MEMORY</span><br></pre></td></tr></table></figure><p><code>CONFIG_ARCH_HAS_SET_MEMORY</code> 选项是不可配置的，因此平台要么内置支持，要么不支持，那些目前还不支持的架构未来可能也会支持。</p><h4 id="Mitigation-Against-Spectre"><a href="#Mitigation-Against-Spectre" class="headerlink" title="Mitigation Against Spectre"></a>Mitigation Against Spectre</h4><p>为了防御 <a href="https://en.wikipedia.org/wiki/Spectre_(security_vulnerability" target="_blank" rel="external nofollow noopener noreferrer">Spectre v2</a>) 攻击，Linux 内核提供了 <code>CONFIG_BPF_JIT_ALWAYS_ON</code> 选项，打开这个开关后 BPF 解释器将会从内核中完全移除，永远启用 JIT 编译器：</p><ul><li>如果应用在一个基于虚拟机的环境，客户机内核将不会复用内核的 BPF 解释器，因此可以避免某些相关的攻击</li><li>如果是基于容器的环境，这个配置是可选的，如果 JIT 功能打开了，解释器仍然可能会在编译时被去掉，以降低内核的复杂度 </li><li>对于主流架构（例如 <code>x86_64</code> 和 <code>arm64</code>）上的 JIT 通常都建议打开这个开关 </li></ul><p>将 <code>/proc/sys/net/core/bpf_jit_harden</code> 设置为 <code>1</code> 会为非特权用户的 JIT 编译做一些额外的加固工作。这些额外加固会稍微降低程序的性能，但在有非受信用户在系统上进行操作的情况下，能够有效地减小潜在的受攻击面。但与完全切换到解释器相比，这些性能损失还是比较小的。对于 <code>x86_64</code> JIT 编译器，如果设置了 <code>CONFIG_RETPOLINE</code>，尾调用的间接跳转（ indirect jump）就会用 <code>retpoline</code> 实现。写作本文时，在大部分现代 Linux 发行版上这个配置都是打开的。</p><h4 id="Constant-Blinding"><a href="#Constant-Blinding" class="headerlink" title="Constant Blinding"></a>Constant Blinding</h4><p>当前，启用加固会在 JIT 编译时<strong>盲化</strong>（blind）BPF 程序中用户提供的所有 32 位和 64 位常量，以防御 <strong>JIT spraying攻击</strong>，这些攻击会将原生操作码作为立即数注入到内核。这种攻击有效是因为：<strong>立即数驻留在可执行内核内存（executable kernel memory）中</strong>，因此某些内核 bug 可能会触发一个跳转动作，如果跳转到立即数的开始位置，就会把它们当做原生指令开始执行。</p><p>盲化 JIT 常量通过对真实指令进行随机化（randomizing the actual instruction）实现 。在这种方式中，通过对指令进行重写，将原来<strong>基于立即数的操作</strong>转换成<strong>基于寄存器的操作</strong>。指令重写将加载值的过程分解为两部分：</p><ol><li>加载一个盲化后的（blinded）立即数 <code>rnd ^ imm</code> 到寄存器</li><li>将寄存器和 <code>rnd</code> 进行异或操作（xor）</li></ol><p>这样原始的 <code>imm</code> 立即数就驻留在寄存器中，可以用于真实的操作了。这里介绍的只是加载操作的盲化过程，实际上所有的通用操作都被盲化了。下面是加固关闭的情况下，某个程序的 JIT 编译结果：</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">$ echo 0 &gt; &#x2F;proc&#x2F;sys&#x2F;net&#x2F;core&#x2F;bpf_jit_harden</span><br><span class="line"></span><br><span class="line">  ffffffffa034f5e9 + &lt;x&gt;:</span><br><span class="line">  [...]</span><br><span class="line">  39:   mov    $0xa8909090,%eax</span><br><span class="line">  3e:   mov    $0xa8909090,%eax</span><br><span class="line">  43:   mov    $0xa8ff3148,%eax</span><br><span class="line">  48:   mov    $0xa89081b4,%eax</span><br><span class="line">  4d:   mov    $0xa8900bb0,%eax</span><br><span class="line">  52:   mov    $0xa810e0c1,%eax</span><br><span class="line">  57:   mov    $0xa8908eb4,%eax</span><br><span class="line">  5c:   mov    $0xa89020b0,%eax</span><br><span class="line">  [...]</span><br></pre></td></tr></table></figure><p>加固打开之后，以上程序被某个非特权用户通过 BPF 加载的结果（这里已经进行了常量盲化）：</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line">$ echo 1 &gt; &#x2F;proc&#x2F;sys&#x2F;net&#x2F;core&#x2F;bpf_jit_harden</span><br><span class="line"></span><br><span class="line">  ffffffffa034f1e5 + &lt;x&gt;:</span><br><span class="line">  [...]</span><br><span class="line">  39:   mov    $0xe1192563,%r10d</span><br><span class="line">  3f:   xor    $0x4989b5f3,%r10d</span><br><span class="line">  46:   mov    %r10d,%eax</span><br><span class="line">  49:   mov    $0xb8296d93,%r10d</span><br><span class="line">  4f:   xor    $0x10b9fd03,%r10d</span><br><span class="line">  56:   mov    %r10d,%eax</span><br><span class="line">  59:   mov    $0x8c381146,%r10d</span><br><span class="line">  5f:   xor    $0x24c7200e,%r10d</span><br><span class="line">  66:   mov    %r10d,%eax</span><br><span class="line">  69:   mov    $0xeb2a830e,%r10d</span><br><span class="line">  6f:   xor    $0x43ba02ba,%r10d</span><br><span class="line">  76:   mov    %r10d,%eax</span><br><span class="line">  79:   mov    $0xd9730af,%r10d</span><br><span class="line">  7f:   xor    $0xa5073b1f,%r10d</span><br><span class="line">  86:   mov    %r10d,%eax</span><br><span class="line">  89:   mov    $0x9a45662b,%r10d</span><br><span class="line">  8f:   xor    $0x325586ea,%r10d</span><br><span class="line">  96:   mov    %r10d,%eax</span><br><span class="line">  [...]</span><br></pre></td></tr></table></figure><p>两个程序在语义上是一样的，但在第二种方式中，原来的立即数在反汇编之后的程序中不再可见。同时，加固还会禁止任何 JIT 内核符合（kallsyms）暴露给特权用户，JIT 镜像地址不再出现在 <code>/proc/kallsyms</code> 中。</p><h3 id="Offloads"><a href="#Offloads" class="headerlink" title="Offloads"></a>Offloads</h3><p>BPF 网络程序，尤其是 tc 和 XDP BPF 程序在内核中都有一个 offload 到硬件的接口，这样就可以直接在网卡上执行 BPF 程序。</p><p>当前，Netronome 公司的 <code>nfp</code> 驱动支持通过 JIT 编译器 offload BPF，它会将 BPF 指令翻译成网卡实现的指令集。另外，它还支持将 BPF maps offload 到网卡，因此 offloaded BPF 程序可以执行 map 查找、更新和删除操作。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-04-01_bpf-offload.png"></p><h2 id="eBPF-接口"><a href="#eBPF-接口" class="headerlink" title="eBPF 接口"></a>eBPF 接口</h2><h3 id="BPF-系统调用"><a href="#BPF-系统调用" class="headerlink" title="BPF 系统调用"></a>BPF 系统调用</h3><p>eBPF 提供了 <a href="https://man7.org/linux/man-pages/man2/bpf.2.html" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf()</code></a> 系统调用来对 BPF Map 或 程序进行操作，其函数原型如下：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;linux/bpf.h&gt;</span></span></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf</span><span class="params">(<span class="keyword">int</span> cmd, <span class="keyword">union</span> bpf_attr *attr, <span class="keyword">unsigned</span> <span class="keyword">int</span> <span class="built_in">size</span>)</span></span>;</span><br></pre></td></tr></table></figure><p>函数有三个参数，其中：</p><ul><li><code>cmd</code> 指定了 bpf 系统调用执行的命令类型，每个 cmd 都会附带一个参数 <code>attr</code></li><li><code>bpf_attr union</code> 允许在内核和用户空间之间传递数据，确切的格式取决于 <code>cmd</code> 这个参数</li><li><code>size</code> 这个参数表示<code>bpf_attr union</code> 这个对象以字节为单位的大小</li></ul><p><code>cmd</code> 可以为一下几种类型，基本上可以分为操作 eBPF Map 和操作 eBPF 程序两种类型： </p><ul><li><code>BPF_MAP_CREATE</code>：创建一个 <code>eBPF Map</code> 并且返回指向该 Map 的文件描述符</li><li><code>BPF_MAP_LOOKUP_ELEM</code>：在某个 Map 中根据 key 查找元素并返回其 value</li><li><code>BPF_MAP_UPDATE_ELEM</code>：在某个 Map 中创建或者更新一个元素 key/value 对</li><li><code>BPF_MAP_DELETE_ELEM</code>：在某个 Map 中根据 key 删除一个元素</li><li><code>BPF_MAP_GET_NEXT_KEY</code>：在某个 Map 中根据 key 查找元素然后返回下一个元素的 key</li><li><code>BPF_PROG_LOAD</code>：校验并加载 eBPF 程序，返回与该程序关联的文件描述符</li><li>…</li></ul><p><code>bpf_attr union</code> 的结构如下所示，根据不同的 <code>cmd</code> 可以填充不同的信息。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">union</span> bpf_attr &#123;</span><br><span class="line">  <span class="class"><span class="keyword">struct</span> &#123;</span>    <span class="comment">/* Used by BPF_MAP_CREATE */</span></span><br><span class="line">    __u32         map_type;</span><br><span class="line">    __u32         key_size;    <span class="comment">/* size of key in bytes */</span></span><br><span class="line">    __u32         value_size;  <span class="comment">/* size of value in bytes */</span></span><br><span class="line">    __u32         max_entries; <span class="comment">/* maximum number of entries in a map */</span></span><br><span class="line">  &#125;;</span><br><span class="line"></span><br><span class="line">  <span class="class"><span class="keyword">struct</span> &#123;</span>    <span class="comment">/* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY commands */</span></span><br><span class="line">    __u32         map_fd;</span><br><span class="line">    __aligned_u64 key;</span><br><span class="line">    <span class="keyword">union</span> &#123;</span><br><span class="line">      __aligned_u64 value;</span><br><span class="line">      __aligned_u64 next_key;</span><br><span class="line">    &#125;;</span><br><span class="line">    __u64         flags;</span><br><span class="line">  &#125;;</span><br><span class="line"></span><br><span class="line">  <span class="class"><span class="keyword">struct</span> &#123;</span>    <span class="comment">/* Used by BPF_PROG_LOAD */</span></span><br><span class="line">    __u32         prog_type;</span><br><span class="line">    __u32         insn_cnt;</span><br><span class="line">    __aligned_u64 insns;      <span class="comment">/* 'const struct bpf_insn *' */</span></span><br><span class="line">    __aligned_u64 license;    <span class="comment">/* 'const char *' */</span></span><br><span class="line">    __u32         log_level;  <span class="comment">/* verbosity level of verifier */</span></span><br><span class="line">    __u32         log_size;   <span class="comment">/* size of user buffer */</span></span><br><span class="line">    __aligned_u64 log_buf;    <span class="comment">/* user supplied 'char *' buffer */</span></span><br><span class="line">    __u32         kern_version; <span class="comment">/* checked when prog_type=kprobe (since Linux 4.1) */</span></span><br><span class="line">  &#125;;</span><br><span class="line">&#125; __attribute__((aligned(<span class="number">8</span>)));</span><br></pre></td></tr></table></figure><h4 id="使用-eBPF-程序的命令"><a href="#使用-eBPF-程序的命令" class="headerlink" title="使用 eBPF 程序的命令"></a>使用 eBPF 程序的命令</h4><p><code>BPF_PROG_LOAD</code> 命令用于校验和加载 eBPF 程序，其需要填充的参数 <code>bpf_xattr</code>，下面展示了在 <code>libbpf</code> 中<a href="https://elixir.bootlin.com/linux/v5.4/source/tools/lib/bpf/bpf.c#L316" target="_blank" rel="external nofollow noopener noreferrer"> <code>bpf_load_program</code></a> 的实现，可以看到最终是调用了 <code>bpf</code> 系统调用。</p><figure class="highlight c"><figcaption><span>/tools/lib/bpf/bpf.c</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_load_program</span><span class="params">(<span class="keyword">enum</span> bpf_prog_type type, <span class="keyword">const</span> struct bpf_insn *insns,</span></span></span><br><span class="line"><span class="function"><span class="params">     <span class="keyword">size_t</span> insns_cnt, <span class="keyword">const</span> <span class="keyword">char</span> *license,</span></span></span><br><span class="line"><span class="function"><span class="params">     __u32 kern_version, <span class="keyword">char</span> *log_buf,</span></span></span><br><span class="line"><span class="function"><span class="params">     <span class="keyword">size_t</span> log_buf_sz)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">bpf_load_program_attr</span> <span class="title">load_attr</span>;</span></span><br><span class="line"></span><br><span class="line"><span class="built_in">memset</span>(&amp;load_attr, <span class="number">0</span>, <span class="keyword">sizeof</span>(struct bpf_load_program_attr));</span><br><span class="line">load_attr.prog_type = type;</span><br><span class="line">load_attr.expected_attach_type = <span class="number">0</span>;</span><br><span class="line">load_attr.name = <span class="literal">NULL</span>;</span><br><span class="line">load_attr.insns = insns;</span><br><span class="line">load_attr.insns_cnt = insns_cnt;</span><br><span class="line">load_attr.license = license;</span><br><span class="line">load_attr.kern_version = kern_version;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> bpf_load_program_xattr(&amp;load_attr, log_buf, log_buf_sz);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_load_program_xattr</span><span class="params">(<span class="keyword">const</span> struct bpf_load_program_attr *load_attr,</span></span></span><br><span class="line"><span class="function"><span class="params">   <span class="keyword">char</span> *log_buf, <span class="keyword">size_t</span> log_buf_sz)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line">  fd = sys_bpf_prog_load(&amp;attr, <span class="keyword">sizeof</span>(attr));</span><br><span class="line"><span class="keyword">if</span> (fd &gt;= <span class="number">0</span>)</span><br><span class="line"><span class="keyword">return</span> fd;</span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">inline</span> <span class="keyword">int</span> <span class="title">sys_bpf_prog_load</span><span class="params">(<span class="keyword">union</span> bpf_attr *attr, <span class="keyword">unsigned</span> <span class="keyword">int</span> <span class="built_in">size</span>)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="keyword">int</span> fd;</span><br><span class="line"></span><br><span class="line"><span class="keyword">do</span> &#123;</span><br><span class="line">fd = sys_bpf(BPF_PROG_LOAD, attr, <span class="built_in">size</span>);</span><br><span class="line">&#125; <span class="keyword">while</span> (fd &lt; <span class="number">0</span> &amp;&amp; errno == EAGAIN);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> fd;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="使用-eBPF-Map-的命令"><a href="#使用-eBPF-Map-的命令" class="headerlink" title="使用 eBPF Map 的命令"></a>使用 eBPF Map 的命令</h4><p>和前面一样，查看 <code>libbpf</code> 中 <a href="https://elixir.bootlin.com/linux/v5.4/source/tools/lib/bpf/bpf.c#L123" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_create_map</code></a> 的实现，可以看到最终也调用了 bpf 系统调用：</p><figure class="highlight c"><figcaption><span>/tools/lib/bpf/bpf.c</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_create_map</span><span class="params">(<span class="keyword">enum</span> bpf_map_type map_type, <span class="keyword">int</span> key_size,</span></span></span><br><span class="line"><span class="function"><span class="params">   <span class="keyword">int</span> value_size, <span class="keyword">int</span> max_entries, __u32 map_flags)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">bpf_create_map_attr</span> <span class="title">map_attr</span> = &#123;</span>&#125;;</span><br><span class="line"></span><br><span class="line">map_attr.map_type = map_type;</span><br><span class="line">map_attr.map_flags = map_flags;</span><br><span class="line">map_attr.key_size = key_size;</span><br><span class="line">map_attr.value_size = value_size;</span><br><span class="line">map_attr.max_entries = max_entries;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> bpf_create_map_xattr(&amp;map_attr);</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_create_map_xattr</span><span class="params">(<span class="keyword">const</span> struct bpf_create_map_attr *create_attr)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="keyword">union</span> bpf_attr attr;</span><br><span class="line"></span><br><span class="line"><span class="built_in">memset</span>(&amp;attr, <span class="string">'\0'</span>, <span class="keyword">sizeof</span>(attr));</span><br><span class="line"></span><br><span class="line">attr.map_type = create_attr-&gt;map_type;</span><br><span class="line">attr.key_size = create_attr-&gt;key_size;</span><br><span class="line">attr.value_size = create_attr-&gt;value_size;</span><br><span class="line">attr.max_entries = create_attr-&gt;max_entries;</span><br><span class="line">attr.map_flags = create_attr-&gt;map_flags;</span><br><span class="line"><span class="keyword">if</span> (create_attr-&gt;name)</span><br><span class="line"><span class="built_in">memcpy</span>(attr.map_name, create_attr-&gt;name,</span><br><span class="line">       <span class="built_in">min</span>(<span class="built_in">strlen</span>(create_attr-&gt;name), BPF_OBJ_NAME_LEN - <span class="number">1</span>));</span><br><span class="line">attr.numa_node = create_attr-&gt;numa_node;</span><br><span class="line">attr.btf_fd = create_attr-&gt;btf_fd;</span><br><span class="line">attr.btf_key_type_id = create_attr-&gt;btf_key_type_id;</span><br><span class="line">attr.btf_value_type_id = create_attr-&gt;btf_value_type_id;</span><br><span class="line">attr.map_ifindex = create_attr-&gt;map_ifindex;</span><br><span class="line">attr.inner_map_fd = create_attr-&gt;inner_map_fd;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> sys_bpf(BPF_MAP_CREATE, &amp;attr, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>libbpf</code> 中 <a href="https://elixir.bootlin.com/linux/v5.4/source/tools/lib/bpf/bpf.c#L371" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_map_lookup_elem</code></a> 的实现：</p><figure class="highlight c"><figcaption><span>/tools/lib/bpf/bpf.c</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_map_lookup_elem</span><span class="params">(<span class="keyword">int</span> fd, <span class="keyword">const</span> <span class="keyword">void</span> *key, <span class="keyword">void</span> *value)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="keyword">union</span> bpf_attr attr;</span><br><span class="line"></span><br><span class="line"><span class="built_in">memset</span>(&amp;attr, <span class="number">0</span>, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">attr.map_fd = fd;</span><br><span class="line">attr.key = ptr_to_u64(key);</span><br><span class="line">attr.value = ptr_to_u64(value);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> sys_bpf(BPF_MAP_LOOKUP_ELEM, &amp;attr, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>libbpf</code> 中 <a href="https://elixir.bootlin.com/linux/v5.4/source/tools/lib/bpf/bpf.c#L357" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_map_update_elem</code></a> 的实现：</p><figure class="highlight c"><figcaption><span>/tools/lib/bpf/bpf.c</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_map_update_elem</span><span class="params">(<span class="keyword">int</span> fd, <span class="keyword">const</span> <span class="keyword">void</span> *key, <span class="keyword">const</span> <span class="keyword">void</span> *value,</span></span></span><br><span class="line"><span class="function"><span class="params">__u64 flags)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="keyword">union</span> bpf_attr attr;</span><br><span class="line"></span><br><span class="line"><span class="built_in">memset</span>(&amp;attr, <span class="number">0</span>, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">attr.map_fd = fd;</span><br><span class="line">attr.key = ptr_to_u64(key);</span><br><span class="line">attr.value = ptr_to_u64(value);</span><br><span class="line">attr.flags = flags;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> sys_bpf(BPF_MAP_UPDATE_ELEM, &amp;attr, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>libbpf</code> 中 <a href="https://elixir.bootlin.com/linux/v5.4/source/tools/lib/bpf/bpf.c#L408" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_map_delete_elem</code></a> 的实现：</p><figure class="highlight c"><figcaption><span>/tools/lib/bpf/bpf.c</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_map_delete_elem</span><span class="params">(<span class="keyword">int</span> fd, <span class="keyword">const</span> <span class="keyword">void</span> *key)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="keyword">union</span> bpf_attr attr;</span><br><span class="line"></span><br><span class="line"><span class="built_in">memset</span>(&amp;attr, <span class="number">0</span>, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">attr.map_fd = fd;</span><br><span class="line">attr.key = ptr_to_u64(key);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> sys_bpf(BPF_MAP_DELETE_ELEM, &amp;attr, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>libbpf</code> 中 <a href="https://elixir.bootlin.com/linux/v5.4/source/tools/lib/bpf/bpf.c#L419" target="_blank" rel="external nofollow noopener noreferrer"><code>bpf_map_get_next_key</code></a> 的实现：</p><figure class="highlight c"><figcaption><span>/tools/lib/bpf/bpf.c</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_map_get_next_key</span><span class="params">(<span class="keyword">int</span> fd, <span class="keyword">const</span> <span class="keyword">void</span> *key, <span class="keyword">void</span> *next_key)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="keyword">union</span> bpf_attr attr;</span><br><span class="line"></span><br><span class="line"><span class="built_in">memset</span>(&amp;attr, <span class="number">0</span>, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">attr.map_fd = fd;</span><br><span class="line">attr.key = ptr_to_u64(key);</span><br><span class="line">attr.next_key = ptr_to_u64(next_key);</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> sys_bpf(BPF_MAP_GET_NEXT_KEY, &amp;attr, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>注意，这里的 <code>libbpf</code> 函数和之前提到的 <code>helper functions</code> 还不太一样，你可以在 <a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html" target="_blank" rel="external nofollow noopener noreferrer">Linux Manual Page: bpf-helpers</a> 看到当前 Linux 支持的 <code>Helper functions</code>。以 <code>bpf_map_update_elem</code> 为例，eBPF 程序通过调用 <code>helper function</code>，其参数如下：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">msg</span> &#123;</span></span><br><span class="line">__s32 seq;</span><br><span class="line">__u64 cts;</span><br><span class="line">__u8 comm[MAX_LENGTH];</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="function">struct bpf_map_def <span class="title">SEC</span><span class="params">(<span class="string">"maps"</span>)</span> <span class="built_in">map</span> </span>= &#123;</span><br><span class="line">.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,</span><br><span class="line">.key_size = <span class="keyword">sizeof</span>(<span class="keyword">int</span>),</span><br><span class="line">.value_size = <span class="keyword">sizeof</span>(__u32),</span><br><span class="line">.max_entries = <span class="number">0</span>,</span><br><span class="line">&#125;;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">void</span> *<span class="title">bpf_map_lookup_elem</span><span class="params">(struct bpf_map *<span class="built_in">map</span>, <span class="keyword">const</span> <span class="keyword">void</span> *key)</span></span></span><br></pre></td></tr></table></figure><p>这里的第一个参数来自于 <code>SEC(&quot;.maps&quot;)</code> 语法糖创建的 <code>bpf_map</code>。</p><p>对于用户态程序，则其函数原型如下，其中通过 fd 来访问 eBPF map。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">bpf_map_lookup_elem</span><span class="params">(<span class="keyword">int</span> fd, <span class="keyword">const</span> <span class="keyword">void</span> *key, <span class="keyword">void</span> *value)</span></span></span><br></pre></td></tr></table></figure><h3 id="BPF-程序类型"><a href="#BPF-程序类型" class="headerlink" title="BPF 程序类型"></a>BPF 程序类型</h3><p>函数<code>BPF_PROG_LOAD</code>加载的程序类型规定了四件事：</p><ul><li>程序可以附加在哪里</li><li>验证器允许调用内核中的哪些帮助函数</li><li>网络包的数据是否可以直接访问</li><li>作为第一个参数传递给程序的对象类型</li></ul><p>实际上，程序类型本质上定义了一个API。甚至还创建了新的程序类型，以区分允许调用的不同的函数列表（比如<code>BPF_PROG_TYPE_CGROUP_SKB</code> 对比 <code>BPF_PROG_TYPE_SOCKET_FILTER</code>）。</p><p>bpf 程序会被hook到内核不同的hook点上。不同的hook点的入口参数，能力有所不同。因而定义了不同的 prog type。不同的prog type 的bpf程序能够调用的kernel function 集合也不一样。当bpf 程序加载到内核时，内核的verifier程序会根据bpf prog type，检查程序的入口参数，调用了哪些 helper function。</p><p>目前内核支持的eBPF程序类型列表如下所示：</p><ul><li><code>BPF_PROG_TYPE_SOCKET_FILTER</code>：一种网络数据包过滤器</li><li><code>BPF_PROG_TYPE_KPROBE</code>：确定kprobe是否应该触发</li><li><code>BPF_PROG_TYPE_SCHED_CLS</code>：一种网络流量控制分类器</li><li><code>BPF_PROG_TYPE_SCHED_ACT</code>：一种网络流量控制动作</li><li><code>BPF_PROG_TYPE_TRACEPOINT</code>：确定 tracepoint是否应该触发</li><li><code>BPF_PROG_TYPE_XDP</code>：从设备驱动程序接收路径运行的网络数据包过滤器</li><li><code>BPF_PROG_TYPE_PERF_EVENT</code>：确定是否应该触发perf事件处理程序</li><li><code>BPF_PROG_TYPE_CGROUP_SKB</code>：一种用于控制组的网络数据包过滤器</li><li><code>BPF_PROG_TYPE_CGROUP_SOCK</code>：一种由于控制组的网络包筛选器，它被允许修改套接字选项</li><li><code>BPF_PROG_TYPE_LWT_*</code>：用于轻量级隧道的网络数据包过滤器</li><li><code>BPF_PROG_TYPE_SOCK_OPS</code>：一个用于设置套接字参数的程序</li><li><code>BPF_PROG_TYPE_SK_SKB</code>：一个用于套接字之间转发数据包的网络包过滤器</li><li><code>BPF_PROG_CGROUP_DEVICE</code>：确定是否允许设备操作</li></ul><p>随着新程序类型的添加，内核开发人员同时发现也需要添加新的数据结构。</p><p>举个例子BPF_PROG_TYPE_SCHED_CLS bpf prog ， 能够访问哪些bpf helper function呢？让我们来看看源代码是如何实现的。</p><p>每一种prog type 会定义一个 <code>struct bpf_verifier_ops</code> 结构体。当 prog load 到内核时，内核会根据它的 type，调用相应结构体的get_func_proto 函数。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">const</span> <span class="class"><span class="keyword">struct</span> <span class="title">bpf_verifier_ops</span> <span class="title">tc_cls_act_verifier_ops</span> = &#123;</span></span><br><span class="line">        .get_func_proto         = tc_cls_act_func_proto,</span><br><span class="line">.convert_ctx_access     = tc_cls_act_convert_ctx_access,</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p>对于 BPF_PROG_TYPE_SCHED_CLS 类型的 BPF 代码，verifier 会调用 <code>tc_cls_act_func_proto</code> ，以检查程序调用的helper function 是否都是合法的。</p><h3 id="BPF-代码调用时机"><a href="#BPF-代码调用时机" class="headerlink" title="BPF 代码调用时机"></a>BPF 代码调用时机</h3><p>每一种 prog type 的调用时机都不同。</p><h4 id="BPF-PROG-TYPE-SCHED-CLS"><a href="#BPF-PROG-TYPE-SCHED-CLS" class="headerlink" title="BPF_PROG_TYPE_SCHED_CLS"></a>BPF_PROG_TYPE_SCHED_CLS</h4><p>BPF_PROG_TYPE_SCHED_CLS 的调用过程如下。</p><h5 id="Egress-方向"><a href="#Egress-方向" class="headerlink" title="Egress 方向"></a>Egress 方向</h5><p>egress 方向上，tcp/ip 协议栈运行之后，有一个hook点。这个hook点可以attach BPF_PROG_TYPE_SCHED_CLS type 的 egress 方向的bpf prog。 在这段bpf 代码执行之后，才会运行qos，tcpdump, xmit 到网卡driver的代码。在这段 bpf 代码中你可以修改报文里面的内容，地址等。修改之后，通过 tcpdump可以看到，因为tcpdump代码在此之后才执行。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">static</span> <span class="keyword">int</span> __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)</span><br><span class="line"></span><br><span class="line">&#123;</span><br><span class="line">skb = sch_handle_egress(skb, &amp;rc, dev);</span><br><span class="line"><span class="comment">// enqueue tc qos</span></span><br><span class="line"><span class="comment">// dequeue tc qos</span></span><br><span class="line"><span class="comment">// dev_hard_start_xmit</span></span><br><span class="line"><span class="comment">// tcpdump works here! dev_queue_xmit_nit</span></span><br><span class="line"><span class="comment">// nic driver-&gt;ndo_start_xmit </span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h5 id="Ingress-方向"><a href="#Ingress-方向" class="headerlink" title="Ingress 方向"></a>Ingress 方向</h5><p>ingress 方向上，在 deliver to tcp/ip 协议栈之前，在 tcpdump 之后，有一个hook点。这个hook点可以attach BPF_PROG_TYPE_SCHED_CLS type 的ingress 方向的bpf prog。在这里你也可以修改报文。但是修改之后的结果在tcpdump中是看不到的。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">static</span> <span class="keyword">int</span> __netif_receive_skb_core(struct sk_buff **pskb, <span class="keyword">bool</span> pfmemalloc,</span><br><span class="line">                                    struct packet_type **ppt_prev)</span><br><span class="line">&#123;</span><br><span class="line"><span class="comment">// generic xdp bpf hook</span></span><br><span class="line"><span class="comment">// tcpdump </span></span><br><span class="line"><span class="comment">// tc ingress hook</span></span><br><span class="line">skb = sch_handle_ingress(skb, &amp;pt_prev, &amp;ret, orig_dev, &amp;another);</span><br><span class="line"><span class="comment">// deliver to tcp/ip stack or bridge/ipvlan device</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h5 id="执行入口-cls-bpf-classify"><a href="#执行入口-cls-bpf-classify" class="headerlink" title="执行入口 cls_bpf_classify"></a>执行入口 cls_bpf_classify</h5><p>无论 egress还是ingress 方向，真正执行bpf 指令的入口都是 cls_bpf_classify。它遍历 tcf_proto中的bpf prog link list， 对每一个bpf prog 执行BPF_PROG_RUN(prog-&gt;filter, skb)</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">static</span> <span class="keyword">int</span> <span class="title">cls_bpf_classify</span><span class="params">(struct sk_buff *skb, <span class="keyword">const</span> struct tcf_proto *tp,</span></span></span><br><span class="line"><span class="function"><span class="params">                            struct tcf_result *res)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cls_bpf_head</span> *<span class="title">head</span> = <span class="title">rcu_dereference_bh</span>(<span class="title">tp</span>-&gt;<span class="title">root</span>);</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cls_bpf_prog</span> *<span class="title">prog</span>;</span></span><br><span class="line"></span><br><span class="line">list_for_each_entry_rcu(prog, &amp;head-&gt;plist, link) &#123;</span><br><span class="line">                <span class="keyword">int</span> filter_res;</span><br><span class="line"><span class="keyword">if</span> (tc_skip_sw(prog-&gt;gen_flags)) &#123;</span><br><span class="line">                        filter_res = prog-&gt;exts_integrated ? TC_ACT_UNSPEC : <span class="number">0</span>;</span><br><span class="line">                &#125; <span class="keyword">else</span> <span class="keyword">if</span> (at_ingress) &#123;</span><br><span class="line">                        <span class="comment">/* It is safe to push/pull even if skb_shared() */</span></span><br><span class="line">                        __skb_push(skb, skb-&gt;mac_len);</span><br><span class="line">                        bpf_compute_data_pointers(skb);</span><br><span class="line">                        filter_res = BPF_PROG_RUN(prog-&gt;filter, skb);</span><br><span class="line">                        __skb_pull(skb, skb-&gt;mac_len);</span><br><span class="line">                &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">                        bpf_compute_data_pointers(skb);</span><br><span class="line">                        filter_res = BPF_PROG_RUN(prog-&gt;filter, skb);</span><br><span class="line">                &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>BPF_PROG_RUN 会执行 JIT compile 的bpf 指令，如果内核不支持JIT，则会调用解释器执行bpf的byte code。</p><p>BPF_PROG_RUN 传给bpf prog的入口参数是skb，其类型是 <code>struct sk_buff</code>,  定义在文件include/linux/skbuff.h中。</p><p>但是在bpf 代码中，为了安全，不能直接访问 <code>sk_buff</code>。bpf中是通过访问 <code>struct __sk_buff</code> 来访问struct sk_buff的。<code>__sk_buff</code> 是 <code>sk_buff</code> 的一个子集，是sk_buff面向bpf 程序的接口。bpf代码中对 <code>__sk_buff</code> 的访问会在verifier程序中翻译成对sk_buff相应fileds的访问。</p><p>在加载bpf prog的时候，verifier会调用上面 <code>tc_cls_act_verifier_ops</code> 结构体里面的tc_cls_act_convert_ctx_access的钩子。它最终会调用下面的函数修改ebpf的指令，使得对 <code>__sk_buff</code> 的访问变成对 <code>struct sk_buff</code> 的访问。</p><h3 id="BPF-Attach-type"><a href="#BPF-Attach-type" class="headerlink" title="BPF Attach type"></a>BPF Attach type</h3><p>一种 type 的bpf prog 可以挂到内核中不同的hook点，这些不同的hook点就是不同的attach type。</p><p>其对应关系在 <a href="https://elixir.bootlin.com/linux/v5.17-rc8/source/kernel/bpf/syscall.c#L3137" target="_blank" rel="external nofollow noopener noreferrer">下面函数</a> 中定义了。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">attach_type_to_prog_type(<span class="keyword">enum</span> bpf_attach_type attach_type)</span><br><span class="line">&#123;</span><br><span class="line">        <span class="keyword">switch</span> (attach_type) &#123;</span><br><span class="line">        <span class="keyword">case</span> BPF_CGROUP_INET_INGRESS:</span><br><span class="line">        <span class="keyword">case</span> BPF_CGROUP_INET_EGRESS:</span><br><span class="line">                <span class="keyword">return</span> BPF_PROG_TYPE_CGROUP_SKB;</span><br><span class="line">        <span class="keyword">case</span> BPF_CGROUP_INET_SOCK_CREATE:</span><br><span class="line">        <span class="keyword">case</span> BPF_CGROUP_INET_SOCK_RELEASE:</span><br><span class="line">        <span class="keyword">case</span> BPF_CGROUP_INET4_POST_BIND:</span><br><span class="line">        <span class="keyword">case</span> BPF_CGROUP_INET6_POST_BIND:</span><br><span class="line">                <span class="keyword">return</span> BPF_PROG_TYPE_CGROUP_SOCK;</span><br><span class="line">  .....</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>当bpf prog 通过系统调用bpf() attach到具体的hook点时，其入口参数中就需要指定attach type。</p><p>有趣的是，BPF_PROG_TYPE_SCHED_CLS 类型的 bpf prog 不能通过bpf系统调用来attach，因为它没有定义对应的 attach type。故它的 attach 需要通过netlink interface 额外的实现，还是非常复杂的。</p><h3 id="常用-prog-type-介绍"><a href="#常用-prog-type-介绍" class="headerlink" title="常用 prog type 介绍"></a>常用 prog type 介绍</h3><p>内核中的 prog type 目前有30种。每一种type 能做的事情有所差异，这里只讲讲我平时工作用过的几种。</p><p>理解一种prog type的最好的方法是</p><ul><li>查表 attach_type_to_prog_type，得到它的 attach type，</li><li>再搜索内核代码，看这些 attach type 在内核哪里被调用了。</li><li>最后看看它的入口参数和 return value 的处理过程，基本就能理解其作用了。</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">include/uapi/linux/bpf.h</span><br><span class="line"></span><br><span class="line"><span class="keyword">enum</span> bpf_prog_type &#123;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="BPF-PROG-TYPE-SOCKET-FILTER"><a href="#BPF-PROG-TYPE-SOCKET-FILTER" class="headerlink" title="BPF_PROG_TYPE_SOCKET_FILTER"></a>BPF_PROG_TYPE_SOCKET_FILTER</h4><p>是第一个被添加到内核的程序类型。当你attach一个bpf程序到socket上，你可以获取到被socket处理的所有数据包。socket过滤不允许你修改这些数据包以及这些数据包的目的地。仅仅是提供给你观察这些数据包。在你的程序中可以获取到诸如protocol type类型等。</p><p>以 tcp 为 example，调用的地点是 tcp_v4_rcv-&gt;tcp_filter-&gt;sk_filter_trim_cap 作用是过滤报文，或者trim报文。udp, icmp中也有相关的调用。</p><h4 id="BPF-PROG-TYPE-SOCK-OPS"><a href="#BPF-PROG-TYPE-SOCK-OPS" class="headerlink" title="BPF_PROG_TYPE_SOCK_OPS"></a>BPF_PROG_TYPE_SOCK_OPS</h4><p>在 tcp 协议 event 发生时调用的bpf 钩子，定义了15种event。这些event的 attach type 都是BPF_CGROUP_SOCK_OPS。不同的调用点会传入不同的enum, 比如：</p><ul><li>BPF_SOCK_OPS_TCP_CONNECT_CB 是主动 tcp connect call 的；</li><li>BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB是被动connect 成功时调用的。</li></ul><p>主要作用：tcp 调优，event 统计等。</p><p>BPF_PROG_TYPE_SOCK_OPS 这种程序类型，允许你当数据包在内核网络协议栈的各个阶段传输的时候，去修改套接字的链接选项。他们attach到cgroups上，和BPF_PROG_TYPE_CGROUP_SOCK以及BPF_PROG_TYPE_CGROUP_SKB很像，但是不同的是，他们可以在整个连接的生命周期内被调用好多次。你的bpf程序会接受到一个op的参数，该参数代表内核将通过套接字链接执行的操作。因此，你知道在链接的生命周期内何时调用该程序。另一方面，你可以获取ip地址，端口等。你还可以修改链接的链接的选项以设置超时并更改数据包的往返延迟时间。</p><p>举个例子，Facebook 使用它来为同一数据中心内的连接设置短恢复时间目标（RTO）。RTO是一种时间，它指的是网络在出现故障后的恢复时间，这个指标也表示网络在受到不可接受到情况下的，不能被使用的时间。Facebook认为，在同一数据中心中，应该有一个很短的RTO,Facebook修改了这个时间，使用bpf程序。</p><h4 id="BPF-PROG-TYPE-CGROUP-SOCK-ADDR"><a href="#BPF-PROG-TYPE-CGROUP-SOCK-ADDR" class="headerlink" title="BPF_PROG_TYPE_CGROUP_SOCK_ADDR"></a>BPF_PROG_TYPE_CGROUP_SOCK_ADDR</h4><p>它对应很多attach type，一般在bind, connect 时调用, 传入 sock 的地址。</p><p>主要作用：例如 cilium中 clusterip 的实现，在主动 connect 时，修改了目的ip地址，就是利用这个。</p><p>BPF_PROG_TYPE_CGROUP_SOCK_ADDR，这种类型的程序使您可以在由特定cgroup控制的用户空间程序中操纵IP地址和端口号。 在某些情况下，当您要确保一组特定的用户空间程序使用相同的IP地址和端口时，系统将使用多个IP地址.当您将这些用户空间程序放在同一cgroup中时，这些BPF程序使您可以灵活地操作这些绑定。 这样可以确保这些应用程序的所有传入和传出连接均使用BPF程序提供的IP和端口。</p><h4 id="BPF-PROG-TYPE-SK-MSG"><a href="#BPF-PROG-TYPE-SK-MSG" class="headerlink" title="BPF_PROG_TYPE_SK_MSG"></a>BPF_PROG_TYPE_SK_MSG</h4><p>BPF_PROG_TYPE_SK_MSG， These types of programs let you control whether a message sent to a socket should be delivered.当内核创建了一个socket，它会被存储在前面提到的map中。当你attach一个程序到这个socket map的时候，所有的被发送到那些socket的message都会被filter。在filter message之前，内核拷贝了这些data，因此你可以读取这些message，而且可以给出你的决定：例如，SK_PASS和SK_DROP。</p><h4 id="BPF-PROG-TYPE-SK-SKB"><a href="#BPF-PROG-TYPE-SK-SKB" class="headerlink" title="BPF_PROG_TYPE_SK_SKB"></a>BPF_PROG_TYPE_SK_SKB</h4><p>调用点：tcp sendmsg 时会调用。</p><p>主要作用：做sock redir 用的。</p><p>BPF_PROG_TYPE_SK_SKB，这类程序可以让你获取socket maps和socket redirects。socket maps可以让你获得一些socket的引用。当你有了这些引用，你可以使用相关的helpers，去重定向一个incoming 的packet ，从一个socket去另外一个scoket.这在使用BPF来做负载均衡时是非常有用的。你可以在socket之间转发网络数据包，而不需要离开内核空间。Cillium和facebook的Katran 广泛的使用这种类型的程序去做流量控制。</p><h4 id="BPF-PROG-TYPE-CGROUP-SOCKOPT"><a href="#BPF-PROG-TYPE-CGROUP-SOCKOPT" class="headerlink" title="BPF_PROG_TYPE_CGROUP_SOCKOPT"></a>BPF_PROG_TYPE_CGROUP_SOCKOPT</h4><p>调用点：getsockopt, setsockopt</p><h4 id="BPF-PROG-TYPE-KPROBE"><a href="#BPF-PROG-TYPE-KPROBE" class="headerlink" title="BPF_PROG_TYPE_KPROBE"></a>BPF_PROG_TYPE_KPROBE</h4><p>类似 ftrace 的kprobe，在函数出入口的 hook 点，debug 用的。</p><h4 id="BPF-PROG-TYPE-TRACEPOINT"><a href="#BPF-PROG-TYPE-TRACEPOINT" class="headerlink" title="BPF_PROG_TYPE_TRACEPOINT"></a>BPF_PROG_TYPE_TRACEPOINT</h4><p>类似 ftrace 的 tracepoint。</p><h4 id="BPF-PROG-TYPE-SCHED-CLS-1"><a href="#BPF-PROG-TYPE-SCHED-CLS-1" class="headerlink" title="BPF_PROG_TYPE_SCHED_CLS"></a>BPF_PROG_TYPE_SCHED_CLS</h4><p>如上面的例子</p><h4 id="BPF-PROG-TYPE-XDP"><a href="#BPF-PROG-TYPE-XDP" class="headerlink" title="BPF_PROG_TYPE_XDP"></a>BPF_PROG_TYPE_XDP</h4><p>网卡驱动收到packet时，尚未生成 sk_buff 数据结构之前的一个hook点。</p><p>BPF_PROG_TYPE_XDP 允许你的 bpf 程序，在网络数据包到达 kernel 很早的时候。在这样的bpf程序中，你仅仅可能获取到一点点的信息，因为kernel还没有足够的时间去处理。因为时间足够的早，所以你可以在网络很高的层面上去处理这些 packet。</p><p>XDP定义了很多的处理方式，例如</p><ul><li>XDP_PASS 就意味着，你会把packet交给内核的另一个子系统去处理</li><li>XDP_DROP就意味着，内核应该丢弃这个数据包</li><li>XDP_TX意味着，你可以把这个包转发到network interface card(NIC)第一次接收到这个包的时候</li></ul><h4 id="BPF-PROG-TYPE-CGROUP-SKB"><a href="#BPF-PROG-TYPE-CGROUP-SKB" class="headerlink" title="BPF_PROG_TYPE_CGROUP_SKB"></a>BPF_PROG_TYPE_CGROUP_SKB</h4><p>BPF_PROG_TYPE_CGROUP_SKB 允许你过滤整个cgroup的网络流量。在这种程序类型中，你可以在网络流量到达这个 cgoup 中的程序前做一些控制。内核试图传递给同一 cgroup 中任何进程的任何数据包都将通过这些过滤器之一。同时，您可以决定 cgroup 中的进程通过该接口发送网络数据包时该怎么做。其实，你可以发现它和 BPF_PROG_TYPE_SOCKET_FILTER 的类型很类似。最大的不同是cgroup_skb是attach到这个cgroup中的所有进程，而不是特殊的进程。在container的环境中，bpf是非常有用的。</p><ul><li>ingress 方向上，tcp 收到报文时（tcp_v4_rcv），会调用这个bpf做过滤。 </li><li>egress方向上，ip 在出报文时（ip_finish_output）会调用它做丢包过滤 输入参数是skb。</li></ul><h4 id="BPF-PROG-TYPE-CGROUP-SOCK"><a href="#BPF-PROG-TYPE-CGROUP-SOCK" class="headerlink" title="BPF_PROG_TYPE_CGROUP_SOCK"></a>BPF_PROG_TYPE_CGROUP_SOCK</h4><p>在sock create, release, post_bind 时调用的。主要用来做一些权限检查的。</p><p>BPF_PROG_TYPE_CGROUP_SOCK，这种类型的 bpf 程序允许你，在一个cgroup中的任何进程打开一个 socket 的时候，去执行你的Bpf程序。这个行为和 CGROUP_SKB 的行为类似，但是它是提供给你 cgoup 中的进程打开一个新的 socket 的时候的情况，而不是给你网络数据包通过的权限控制。这对于为可以打开套接字的程序组提供安全性和访问控制很有用，而不必分别限制每个进程的功能。</p><h2 id="eBPF-工具链"><a href="#eBPF-工具链" class="headerlink" title="eBPF 工具链"></a>eBPF 工具链</h2><h3 id="bcc"><a href="#bcc" class="headerlink" title="bcc"></a>bcc</h3><p>BCC 是 BPF 的编译工具集合，前端提供 Python/Lua API，本身通过 C/C++ 语言实现，集成 LLVM/Clang 对 BPF 程序进行重写、编译和加载等功能， 提供一些更人性化的函数给用户使用。</p><p>虽然 BCC 竭尽全力地简化 BPF 程序开发人员的工作，但其“黑魔法” （使用 Clang 前端修改了用户编写的 BPF 程序）使得出现问题时，很难找到问题的所在以及解决方法。必须记住命名约定和自动生成的跟踪点结构 。且由于 libbcc 库内部集成了庞大的 LLVM/Clang 库，使其在使用过程中会遇到一些问题：</p><ol><li>在每个工具启动时，都会占用较高的 CPU 和内存资源来编译 BPF 程序，在系统资源已经短缺的服务器上运行可能引起问题；</li><li>依赖于内核头文件包，必须将其安装在每个目标主机上。即便如此，如果需要内核中未 export 的内容，则需要手动将类型定义复制/粘贴到 BPF 代码中；</li><li>由于 BPF 程序是在运行时才编译，因此很多简单的编译错误只能在运行时检测到，影响开发体验。</li></ol><p>随着 BPF CO-RE 的落地，我们可以直接使用内核开发人员提供的 libbpf 库来开发 BPF 程序，开发方式和编写普通 C 用户态程序一样：一次编译生成小型的二进制文件。Libbpf 作为 BPF 程序加载器，接管了重定向、加载、验证等功能，BPF 程序开发者只需要关注 BPF 程序的正确性和性能即可。这种方式将开销降到了最低，且去除了庞大的依赖关系，使得整体开发流程更加顺畅。</p><p>性能优化大师 Brendan Gregg 在用 libbpf + BPF CO-RE 转换一个 BCC 工具后给出了性能对比数据：</p><blockquote><p>As my colleague Jason pointed out, the memory footprint of opensnoop as CO-RE is much lower than opensnoop.py. 9 Mbytes for CO-RE vs 80 Mbytes for Python.</p></blockquote><p>我们可以看到在运行时相比 BCC 版本，libbpf + BPF CO-RE 版本节约了近 9 倍的内存开销，这对于物理内存资源已经紧张的服务器来说会更友好。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31_ebpf-bcc.png"></p><p>关于 BCC 可以参考 <a href="https://houmin.cc/posts/6a8748a1/">我的这篇文章介绍</a></p><h3 id="bpftrace"><a href="#bpftrace" class="headerlink" title="bpftrace"></a>bpftrace</h3><blockquote><p>bpftrace is a high-level tracing language for Linux eBPF and available in recent Linux kernels (4.x). bpftrace uses LLVM as a backend to compile scripts to eBPF bytecode and makes use of BCC for interacting with the Linux eBPF subsystem as well as existing Linux tracing capabilities: kernel dynamic tracing (kprobes), user-level dynamic tracing (uprobes), and tracepoints. The bpftrace language is inspired by awk, C and predecessor tracers such as DTrace and SystemTap.</p></blockquote><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31_ebpf-bpftrace.png"></p><h3 id="eBPF-Go-Library"><a href="#eBPF-Go-Library" class="headerlink" title="eBPF Go Library"></a>eBPF Go Library</h3><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-21_ebpf-go-library.png"></p><h3 id="libbpf"><a href="#libbpf" class="headerlink" title="libbpf"></a>libbpf</h3><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31_ebpf-c-library.png"></p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="http://www.tcpdump.org/papers/bpf-usenix93.pdf" target="_blank" rel="external nofollow noopener noreferrer">The BSD Packet Filter: A New Architecture for User-level Packet Capture, Steven McCanne and Van Jacobso, December 19, 1992</a></li><li><a href="https://ebpf.io/what-is-ebpf/" target="_blank" rel="external nofollow noopener noreferrer">eBPF Documentation: What is eBPF?</a></li><li><a href="https://lwn.net/Articles/740157/" target="_blank" rel="external nofollow noopener noreferrer">LWN: A thorough introduction to eBPF</a></li><li><a href="https://docs.cilium.io/en/stable/bpf/" target="_blank" rel="external nofollow noopener noreferrer">Cilium Documentation: BPF and XDP Reference Guide</a></li><li><a href="https://www.youtube.com/watch?v=slBAYUDABDA" target="_blank" rel="external nofollow noopener noreferrer">eBPF summit: The Future of eBPF based Networking and Security</a></li><li><a href="https://cilium.io/blog/2020/11/10/ebpf-future-of-networking/" target="_blank" rel="external nofollow noopener noreferrer">eBPF - The Future of Networking &amp; Security</a></li><li><a href="https://www.youtube.com/watch?v=f-oTe-dmfyI" target="_blank" rel="external nofollow noopener noreferrer">eBPF - Rethinking the Linux Kernel</a></li><li><a href="https://man7.org/linux/man-pages/man2/bpf.2.html" target="_blank" rel="external nofollow noopener noreferrer">Linux Manual Page: bpf(2)</a></li><li><a href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html" target="_blank" rel="external nofollow noopener noreferrer">Linux Manual Page: bpf-helpers</a></li><li><a href="https://www.kernel.org/doc/Documentation/networking/filter.txt" target="_blank" rel="external nofollow noopener noreferrer">Linux Kernel Documentation: Linux Socket Filtering aka Berkeley Packet Filter (BPF)</a></li><li><a href="https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/" target="_blank" rel="external nofollow noopener noreferrer">Dive into BPF: a list of reading material</a></li><li><a href="https://lwn.net/Kernel/Index/#Berkeley_Packet_Filter" target="_blank" rel="external nofollow noopener noreferrer">LWN: eBPF materials</a></li><li><a href="https://www.ebpf.top/post/ebpf_c_env/" target="_blank" rel="external nofollow noopener noreferrer">基于 Ubuntu 20.04 的 eBPF 环境搭建</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;eBPF 源于 &lt;a href=&quot;https://en.wikipedia.org/wiki/Berkeley_Packet_Filter&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;BPF&lt;/a&gt;，本质上是处于内核中的一个高效与灵活的虚类虚拟机组件，以一种安全的方式在许多内核 hook 点执行字节码。BPF 最初的目的是用于高效网络报文过滤，经过重新设计，eBPF 不再局限于网络协议栈，已经成为内核顶级的子系统，演进为一个通用执行引擎。开发者可基于 eBPF 开发性能分析工具、软件定义网络、安全等诸多场景。本文将介绍 eBPF 的前世今生，并构建一个 eBPF 环境进行开发实践，文中所有的代码可以在我的 &lt;a href=&quot;https://github.com/&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;Github&lt;/a&gt; 中找到。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-31-ebpf.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="网络" scheme="https://houmin.cc/tags/%E7%BD%91%E7%BB%9C/"/>
    
      <category term="BPF" scheme="https://houmin.cc/tags/BPF/"/>
    
      <category term="linux" scheme="https://houmin.cc/tags/linux/"/>
    
      <category term="tracing" scheme="https://houmin.cc/tags/tracing/"/>
    
      <category term="XDP" scheme="https://houmin.cc/tags/XDP/"/>
    
      <category term="tc" scheme="https://houmin.cc/tags/tc/"/>
    
  </entry>
  
  <entry>
    <title>RDMA 架构与实践</title>
    <link href="https://houmin.cc/posts/454a90d3/"/>
    <id>https://houmin.cc/posts/454a90d3/</id>
    <published>2021-02-15T04:26:26.000Z</published>
    <updated>2022-11-09T15:13:45.389Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>RDMA，即 <code>Remote Direct Memory Access</code>，是一种绕过<strong>远程</strong>主机 <code>OS kernel</code> 访问其内存中数据的技术，概念源自于 <code>DMA</code> 技术。在 DMA 技术中，外部设备（PCIe 设备）能够绕过 CPU 直接访问 <code>host memory</code>；而 RDMA 则是指外部设备能够绕过 CPU，不仅可以访问本地主机的内存，还能够访问另一台主机上的用户态内存。由于不经过操作系统，不仅节省了大量 CPU 资源，同样也<strong>提高了系统吞吐量</strong>、<strong>降低了系统的网络通信延迟</strong>，在高性能计算和深度学习训练中得到了广泛的应用。本文将介绍 RDMA 的架构与原理，并讲解 RDMA 网络使用方法，测试代码在 <a href="https://github.com/SimpCosm/cake/tree/master/rdma" target="_blank" rel="external nofollow noopener noreferrer">Github</a> 上可以找到。</p><a id="more"></a><h2 id="技术背景"><a href="#技术背景" class="headerlink" title="技术背景"></a>技术背景</h2><p>计算机网络通信中最重要两个衡量指标主要是 <strong>带宽</strong> 和 <strong>延迟</strong>，通信延迟主要是指：</p><ul><li><p><strong>Transmission Delay</strong>：</p><ul><li>The time taken to transmit a packet from the host to the transmission medium</li><li>计算方式：$Delay_{t} = L / Bandwidth$，其中 L 是要传输的数据包 L bit，Bandwidth 为链路带宽</li><li>如果两端的带宽高，则传输时间短，传输延迟低</li></ul></li><li><p><strong>Propagation delay</strong></p><ul><li><p>After the packet is transmitted to the transmission medium, it has to go through the medium to reach the destination. Hence the time taken by the last bit of the packet to reach the destination is called propagation delay. </p></li><li><p>计算方法：$Delay_p = Distance / Velocity$，其中 Distance 是传输链路的距离，Velocity 是物理介质传输速度</p><figure class="highlight angelscript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">Velocity =<span class="number">3</span> X <span class="number">108</span> m/s (<span class="keyword">for</span> air)</span><br><span class="line">Velocity= <span class="number">2.1</span> X <span class="number">108</span> m/s (<span class="keyword">for</span> optical fibre)</span><br></pre></td></tr></table></figure></li></ul></li><li><p><strong>Queueing delay</strong></p><ul><li>Let the packet is received by the destination, the packet will not be processed by the destination immediately. It has to wait in queue in something called as buffer. So the amount of time it waits in queue before being processed is called queueing delay. </li><li>In general we can’t calculate queueing delay because we don’t have any formula for that. </li></ul></li><li><p><strong>Processing delay</strong></p><ul><li>message handling time at sending/receive ends</li><li>buffer管理、在不同内存空间中消息复制、以及消息发送完成后的系统中断</li></ul></li></ul><p>现实计算机网络中的通信场景中，主要是以发送小消息为主，因此处理延迟是提升性能的关键。</p><p>传统的 TCP/IP 网络通信，数据需要通过用户空间发送到远程机器的用户空间，在这个过程中需要经历若干次内存拷贝：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_traditional-vs-rdma.png"></p><ul><li>数据发送方需要讲数据从用户空间 Buffer 复制到内核空间的 Socket Buffer</li><li>数据发送方要在内核空间中添加数据包头，进行数据封装</li><li>数据从内核空间的 Socket Buffer 复制到 NIC Buffer 进行网络传输</li><li>数据接受方接收到从远程机器发送的数据包后，要将数据包从 NIC Buffer 中复制到内核空间的 Socket Buffer</li><li>经过一系列的多层网络协议进行数据包的解析工作，解析后的数据从内核空间的 Socket Buffer 被复制到用户空间 Buffer</li><li>这个时候再进行系统上下文切换，用户应用程序才被调用</li></ul><p>在高速网络条件下，传统的 TPC/IP 网络在<strong>主机侧数据移动和复制操作带来的高开销</strong>限制了可以在机器之间发送的带宽。为了提高数据传输带宽，人们提出了多种解决方案，这里主要介绍下面两种：</p><ul><li>TCP Offloading Engine</li><li>Remote Direct Memroy Access</li></ul><h3 id="TCP-Offloading-Engine"><a href="#TCP-Offloading-Engine" class="headerlink" title="TCP Offloading Engine"></a>TCP Offloading Engine</h3><p>在主机通过网络进行通信的过程中，CPU 需要耗费大量资源进行多层网络协议的数据包处理工作，包括数据复制、协议处理和中断处理。当主机收到网络数据包时，会引发大量的网络 I/O 中断，CPU 需要对 I/O 中断信号进行响应和确认。为了将 CPU 从这些操作中解放出来，人们发明了TOE（TCP/IP Offloading Engine）技术，将上述主机处理器的工作转移到网卡上。TOE 技术需要特定支持 Offloading  的网卡，这种特定网卡能够支持封装多层网络协议的数据包。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_toe.png"></p><ul><li>TOE 技术将原来在协议栈中进行的IP分片、TCP分段、重组、checksum校验等操作，转移到网卡硬件中进行，降低系统CPU的消耗，提高服务器处理性能。</li><li>普通网卡处理每个数据包都要触发一次中断，TOE 网卡则让每个应用程序完成一次完整的数据处理进程后才触发一次中断，显著减轻服务器对中断的响应负担。</li><li>TOE 网卡在接收数据时，在网卡内进行协议处理，因此，它不必将数据复制到内核空间缓冲区，而是直接复制到用户空间的缓冲区，这种“零拷贝”方式避免了网卡和服务器间的不必要的数据往复拷贝。</li></ul><h3 id="RDMA"><a href="#RDMA" class="headerlink" title="RDMA"></a>RDMA</h3><p>为了消除传统网络通信带给计算任务的瓶颈，我们希望更快和更轻量级的网络通信，由此提出了RDMA技术。RDMA利用 Kernel Bypass 和 Zero Copy技术提供了低延迟的特性，同时减少了CPU占用，减少了内存带宽瓶颈，提供了很高的带宽利用率。RDMA提供了给基于 IO 的通道，这种通道允许一个应用程序通过RDMA设备对远程的虚拟内存进行直接的读写。</p><p>RDMA 技术有以下几个特点：</p><ul><li><strong>CPU Offload</strong>：无需CPU干预，应用程序可以访问远程主机内存而不消耗远程主机中的任何CPU。远程主机内存能够被读取而不需要远程主机上的进程（或CPU)参与。远程主机的CPU的缓存(cache)不会被访问的内存内容所填充</li><li><strong>Kernel Bypass</strong>：RDMA 提供一个专有的 Verbs interface 而不是传统的TCP/IP Socket interface。应用程序可以直接在用户态执行数据传输，不需要在内核态与用户态之间做上下文切换</li><li><strong>Zero Copy</strong>：每个应用程序都能直接访问集群中的设备的虚拟内存，这意味着应用程序能够直接执行数据传输，在不涉及到网络软件栈的情况下，数据能够被直接发送到缓冲区或者能够直接从缓冲区里接收，而不需要被复制到网络层。</li></ul><p>下面是 RDMA 整体框架架构图，从图中可以看出，RDMA在应用程序用户空间，提供了一系列 Verbs 接口操作RDMA硬件。RDMA绕过内核直接从用户空间访问RDMA 网卡。RNIC网卡中包括 Cached Page Table Entry，用来将虚拟页面映射到相应的物理页面。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_rdma-arch.png"></p><h2 id="RDMA-详解"><a href="#RDMA-详解" class="headerlink" title="RDMA 详解"></a>RDMA 详解</h2><p>目前RDMA有三种不同的硬件实现，它们都可以使用同一套API来使用，但它们有着不同的物理层和链路层：</p><ul><li><strong>Infiniband：</strong>基于 InfiniBand 架构的 RDMA 技术，由 IBTA（InfiniBand Trade Association）提出。搭建基于 IB 技术的 RDMA 网络需要专用的 IB 网卡和 IB 交换机。从性能上，很明显Infiniband网络最好，但网卡和交换机是价格也很高，然而RoCEv2和iWARP仅需使用特殊的网卡就可以了，价格也相对便宜很多。</li><li><strong>iWARP：</strong>Internet Wide Area RDMA Protocal，基于 TCP/IP 协议的 RDMA 技术，由 IETF 标 准定义。iWARP 支持在标准以太网基础设施上使用 RDMA 技术，而不需要交换机支持无损以太网传输，但服务器需要使用支持iWARP 的网卡。与此同时，受 TCP 影响，性能稍差。</li><li><strong>RoCE：</strong>基于以太网的 RDMA 技术，也是由 IBTA 提出。RoCE支持在标准以太网基础设施上使用RDMA技术，但是需要交换机支持无损以太网传输，需要服务器使用 RoCE 网卡，性能与 IB 相当。</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_rdma-ib-iwarp-roce.jpeg"></p><h3 id="I-O-瓶颈"><a href="#I-O-瓶颈" class="headerlink" title="I/O 瓶颈"></a>I/O 瓶颈</h3><p>时间回退到二十世纪的最后一年，随着 CPU 性能的迅猛发展，早在 1992 年 Intel 提出的 <a href="https://en.wikipedia.org/wiki/Peripheral_Component_Interconnect" target="_blank" rel="external nofollow noopener noreferrer">PCI</a> 技术已经满足不了人民群众日益增长的 I/O 需求，I/O 系统的性能已经成为制约服务器性能的主要矛盾。尽管在 1998 年，IBM 联合 HP 、Compaq 提出了 <a href="https://en.wikipedia.org/wiki/PCI-X" target="_blank" rel="external nofollow noopener noreferrer">PCI-X</a> 作为 PCI 技术的扩展升级，将通信带宽提升到 1066 MB/sec，人们认为 PCI-X 仍然无法满足高性能服务器性能的要求，要求构建下一代 I/O 架构的呼声此起彼伏。经过一系列角逐，Infiniband 融合了当时两个竞争的设计 Future I/O 和 Next Generation I/O，建立了 Infiniband 行业联盟，也即 <a href="https://www.infinibandta.org/" target="_blank" rel="external nofollow noopener noreferrer">BTA (InfiniBand Trade Association)</a>，包括了当时的各大厂商 Compaq、Dell、HP、IBM、Intel、Microsoft 和 Sun。在当时，InfiniBand 被视为替换 PCI 架构的下一代 I/O 架构，并在 2000 年发布了 1.0 版本的 Infiniband 架构 Specification，2001 年 Mellanox 公司推出了支持 10 Gbit/s 通信速率的设备。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_pci-bottleneck.png"></p><p>然而好景不长，2000 年互联网泡沫被戳破，人们对于是否要投资技术上如此跨越的技术产生犹豫。Intel 转而宣布要开发自己的 <a href="https://en.wikipedia.org/wiki/PCI_Express" target="_blank" rel="external nofollow noopener noreferrer">PCIe</a> 架构，微软也停止了 IB 的开发。尽管如此，Sun 和 日立等公司仍然坚持对 InfiniBand 技术的研发，并由于其强大的性能优势逐渐在集群互联、存储系统、超级计算机内部互联等场景得到广泛应用，其软件协议栈也得到标准化，<a href="https://en.wikipedia.org/wiki/InfiniBand#History" target="_blank" rel="external nofollow noopener noreferrer">Linux 也添加了对于 Infiniband 的支持</a>。进入2010年代，随着大数据和人工智能的爆发，InfiniBand 的应用场景从原来的超算等场景逐步扩散，得到了更加广泛的应用，InfiniBand 市场领导者 Mellanox 被 NVIDIA 收购，另一个主要玩家 QLogic 被 Intel 收购，Oracle 也开始制造自己的 InfiniBand 互联芯片和交换单元。到了 2020 年代，Mellanox 最新发布的 NDR 理论有效带宽已经可以达到 <a href="https://www.nvidia.com/en-us/networking/ndr/" target="_blank" rel="external nofollow noopener noreferrer">单端口 400 Gb/s</a>，为了运行 400 Gb/s 的 HCA 可以使用 PCIe Gen5x16 或者 PCIe Gen4x32。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-roadmap.jpg"></p><h3 id="架构组成"><a href="#架构组成" class="headerlink" title="架构组成"></a>架构组成</h3><p>InfiniBand 架构为系统通信定义了多种设备：channel adapter、switch、router、subnet manager，它提供了一种基于通道的点对点消息队列转发模型，每个应用都可通过创建的虚拟通道直接获取本应用的数据消息，无需其他操作系统及协议栈的介入。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-arch2.png"></p><p>在一个子网中，必须有至少每个节点有一个 channel adapter，并且有一个 subnet manager 来管理 Link。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-arch.png"></p><h4 id="Channel-Adapters"><a href="#Channel-Adapters" class="headerlink" title="Channel Adapters"></a>Channel Adapters</h4><p>可安装在主机或者其他任何系统(如存储设备)上的网络适配器，这种组件为数据包的始发地或者目的地，支持 Infiniband 定义的所有软件 Verbs</p><ul><li>Host Channel Adapter：HCA</li><li>Target Channel Adapter：TCA</li></ul><h4 id="Switch"><a href="#Switch" class="headerlink" title="Switch"></a>Switch</h4><p>Switch 包含多个 InfiniBand 端口，它根据每个数据包 LRH 里面的 LID，负责将一个端口上收到的数据包发送到另一个端口。除了 Management Packets，Switch 不产生或者消费任何 Packets。它包含有 Subnet Manager 配置的转发表，能够响应 Subnet Manager 的 Management Packets。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-components.png"></p><h4 id="Router"><a href="#Router" class="headerlink" title="Router"></a>Router</h4><p>Router 根据 L3 中的 GRH，负责将 Packet 从一个子网转发到另一个子网，当被转到到另一子网时，Router 会重建数据包中的 LID。</p><h4 id="Subnet-Manager"><a href="#Subnet-Manager" class="headerlink" title="Subnet Manager"></a>Subnet Manager</h4><p>Subnet Manager 负责配置本地子网，使其保持工作：</p><ul><li>发现子网的物理拓扑</li><li>给子网中的每个端口分配 LIC 和其他属性（如活动MTU、活动速度）</li><li>给子网交换机配置转发表</li><li>检测拓扑变化（如子网中节点的增删）</li><li>处理子网中的各种错误</li></ul><h3 id="分层设计"><a href="#分层设计" class="headerlink" title="分层设计"></a>分层设计</h3><p>InfiniBand 有着自己的协议栈，从上到下依次包括传输层、网络层、数据链路层和物理层：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-layers.png"></p><p>对应着不同的层，数据包的封装如下，下面将对每一层的封装详细介绍：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-headers.jpeg"></p><h4 id="Physical-Layer"><a href="#Physical-Layer" class="headerlink" title="Physical Layer"></a>Physical Layer</h4><p>物理层定义了 InfiniBand 具有的电气和机械特性，InfiniBand 支持光纤和铜作为传输介质。在物理层支持不同的 Link 速度，每个 Link 由四根线组成（每个方向两条），Link 可以聚合以提高速率，目前绝大多数的系统采用 4 Link。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-phys.png"></p><p>以 QDR 为例，线上的 <code>Signalling Rate</code> 为 10 Gb/s，由于采用 <code>8b/10b</code> 编码，实际有效带宽单 Link 为 10 Gb/s * 8/10 = 8 Gb/s，如果是 4 Link，则带宽可以达到 32 Gb/s。因为是双向的，所以 4 Link 全双工的速率可以达到 64 Gb/s。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-performance.png"></p><h4 id="Link-Layer"><a href="#Link-Layer" class="headerlink" title="Link Layer"></a>Link Layer</h4><p>Link Layer 是 InfiniBand 架构的核心，包含以下部分：</p><ul><li>Packets：链路层由两种类型的Packets，Data Packet 和 Management Packet，数据包最大可以为 4KB，数据包传输的类型包括两种类型<ul><li>Memory：RDMA read/write，atomic operation</li><li>Channel：send/receive，multicast transmission</li></ul></li><li>Switching：在子网中，Packet 的转发和交换是在链路层完成的<ul><li>一个子网内的每个设备有一个由 subnet manager分配的 16 bit  Local ID (<strong>LID</strong>)</li><li>每个 Packet 中有一个 Local Route Header (LRH) 指定了要发送的目标 LID</li><li>在一个子网中通过 LID 来负责寻址</li></ul></li><li>QoS：链路层提供了 QoS 保证，不需要数据缓冲<ul><li>Virtual Lanes：一种在一条物理链路上创建多条虚拟链路的机制。虚拟通道表示端口的一组用于收发数据包的缓冲区。支持的 VL 数是端口的一个属性。</li><li>每个 Link 支持 15 个标准的 VL 和一个用于 Management 的 VL15，VL15 具有最高等级，VL0 具有最低等级</li><li>Service Level：InfiniBand 支持多达 16 个服务等级，但是并没有指定每个等级的策略。InfiniBand 通过将 SL 和 VL 映射支持 QoS</li></ul></li><li>Credit Based Flow Control</li><li>Data Integrity：链路层通过 Packet 中的 CRC 字段来进行数据完整性校验，其组成包括 ICRC 和 VCRC。</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-lrh.png"></p><h4 id="Network-Layer"><a href="#Network-Layer" class="headerlink" title="Network Layer"></a>Network Layer</h4><p>网络层负责将 Packet 从一个子网路由到另一个子网：</p><ul><li>在子网间传输的 Packet 都有一个 Gloabl Route Header (GRH)。在这个 Header 中包括了该 Packet 的128 bit 的 源 IPv6 地址和目的 IPv6 地址</li><li>每个设备都有一个全局的 UID (GUID)，路由器通过每个Packet的 GUID 来实现在不同子网间的转发</li></ul><p>下面是 GRH 报头的格式，长40字节，可选，用于组播数据包以及需要穿越多个子网的数据包。它使用 GID 描述了源端口和目标端口，其格式与 IPv6 报头相同。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-grh.png"></p><h4 id="Transport-Layer"><a href="#Transport-Layer" class="headerlink" title="Transport Layer"></a>Transport Layer</h4><p>传输层负责 Packet 的按序传输、根据 MTU 分段和很多传输层的服务(reliable connection, reliable datagram, unreliable connection, unreliable datagram, raw datagram)。InfiniBand 的传输层提供了一个巨大的提升，因为所有的函数都是在硬件中实现的。</p><p><img alt="InfiniBand 支持的服务" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-tl.png"></p><p>按照连接和可靠两个标准，可以划分出下图四种不同的传输模式：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_rdma-service.png"></p><ul><li>可靠连接（RC）一个QP只和另一个QP相连，消息通过一个QP的发送队列可靠地传输到另一个QP的接收队列。数据包<strong>按序交付</strong>，RC连接很类似于TCP连接。</li><li>不可靠连接（UC）一个QP只和另一个QP相连，连接是不可靠的，所以数据包可能有丢失。传输层出错的消息不会进行重传，错误处理必须由高层的协议来进行。</li><li>不可靠数据报（UD）一个 QP 可以和其它任意的 UD QP 进行数据传输和单包数据的接收。不保证按序性和交付性。交付的数据包可能被接收端丢弃。支持多播消息（一对多），UD连接很类似于UDP连接。</li></ul><p>每种模式中可用的操作如下表所示，目前的RDMA硬件提供一种数据报传输：不可靠的数据报（UD），并且不支持memory verbs。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_rdma-send-recv.png"></p><p>下面是传输层的 Base Transport Header 的结构，长度为 12 字节，指定了源 QP 和 目标 QP、操作、数据包序列号和分区。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-bth.png"></p><ul><li>Partition Key：InfiniBand 中每个端口 Device 都有一个由 SM 配置 <code>P_Key</code> 表，每个 QP 都与这个表中的一个 <code>P_Key</code> 索引相关联。只有当两个 QP 相关联的 <code>P_Key</code> 键值相同时，它们才能互相收发数据包。</li><li>Destination QP：24 bit 的目标 QP ID。</li></ul><p>根据传输层的服务类别和操作，有不定长度的扩展传输报头(Extended Transport Header，ETH)，比如下面是进行时候的 ETH：</p><p>下面是 RDMA ETH，面向于 RDMA 操作：</p><p><img alt="RDMA ETH" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-reth.png"></p><p>下面是 Datagram ETH，面向与 UD 和 RD 类型的服务：</p><p><img alt="Datagram ETH" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-deth.png"></p><ul><li>Queue Key：仅当两个不可靠 QP 的 Q_Key 相同时，它们才能接受对方的单播或组播消息，用于授权访问目标 QP 的 Queue。</li><li>Source QP：24 bit 的source QP ID，用于回复数据包的Destination QP</li></ul><p>下面是 Reliable Datagram ETH，面向于 RC 类型的服务，其中有 End2End Context 字段：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-rdeth.png"></p><h3 id="RoCE"><a href="#RoCE" class="headerlink" title="RoCE"></a>RoCE</h3><p>InfiniBand 架构获得了极好的性能，但是其不仅要求在服务器上安装专门的 InfiniBand 网卡，还需要专门的交换机硬件，成本十分昂贵。而在企业界大量部署的是以太网络，为了复用现有的以太网，同时获得 InfiniBand 强大的性能，IBTA 组织推出了 RoCE（RDMA over Converged Ethernet）。RoCE 支持在以太网上承载 IB 协议，实现 RDMA over Ethernet，这样一来，仅需要在服务器上安装支持 RoCE 的网卡，而在交换机和路由器仍然使用标准的以太网基础设施。网络侧需要支持<strong>无损以太网络</strong>，这是由于 IB 的丢包处理机制中，任意一个报文的丢失都会造成大量的重传，严重影响数据传输性能。</p><p>RoCE 与 InfiniBand 技术有相同的软件应用层及传输控制层，仅网络层及以太网链路层存在差异，如下图所示：</p><p><img alt="RoCE v1" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_rdma-header.png"></p><p>RoCE 协议分为两个版本：</p><ul><li><strong>RoCE v1协议：</strong>基于以太网承载 RDMA，只能部署于二层网络，它的报文结构是在原有的 IB 架构的报文上增加二层以太网的报文头，通过 Ethertype <code>0x8915</code>标识 RoCE 报文。 </li><li><strong>RoCE v2协议：</strong>基于 UDP/IP 协议承载 RDMA，可部署于三层网络，它的报文结构是在原有的 IB 架构的报文上增加 UDP 头、IP 头和二层以太网报文头，通过 UDP 目的端口号 4791 标 识 RoCE 报文。RoCE v2 支持基于源端口号 hash，采用 ECMP 实现负载分担，提高了网络的利用率。</li></ul><h3 id="iWARP"><a href="#iWARP" class="headerlink" title="iWARP"></a>iWARP</h3><p>iWARP 从以下几个方面降低了主机侧网络负载：</p><ul><li>TCP/IP 处理流程从 CPU 卸载到 RDMA 网卡处理，降低了 CPU 负载。</li><li>消除内存拷贝：应用程序可以直接将数据传输到对端应用程序内存中，显著降低 CPU 负载。</li><li>减少应用程序上、下文切换：应用程序可以绕过操作系统，直接在用户空间对 RDMA 网卡下发命令，降低了开销，显著降低了应用程序上、下文切换造成的延迟。</li></ul><p>由于 TCP 协议能够提供流量控制和拥塞管理，因此 iWARP 不需要以太网支持无损传输，仅通过普通以太网交换机和 iWARP 网卡即可实现，因此能够在广域网上应用，具有较好的扩展性。</p><h2 id="RDMA-编程"><a href="#RDMA-编程" class="headerlink" title="RDMA 编程"></a>RDMA 编程</h2><h3 id="传输模式"><a href="#传输模式" class="headerlink" title="传输模式"></a>传输模式</h3><p>RDMA有两种基本操作，包括 <code>Memory verbs</code> 和 <code>Messaging verbs</code>：</p><ul><li><code>Memory verbs</code>：包括read、write和atomic操作，属于单边操作，只需要本端明确信息的源和目的地址，远端应用不必感知此次通信，数据的读或存都通过远端的DMA在RNIC与应用buffer之间完成，再由远端RNIC封装成消息返回到本端。<ul><li>RDMA Read：从远程主机读取部分内存。调用者指定远程虚拟地址，像本地内存地址一样用来拷贝。在执行 RDMA 读操作之前，远程主机必须提供适当的权限来访问它的内存。一旦权限设置完成， RDMA 读操作就可以在对远程主机没有任何通知的条件下执行。不管是 RDMA 读还是 RDMA 写，远程主机都不会意识到操作正在执行 （除了权限和相关资源的准备操作）。</li><li>RDMA Write：与 RDMA Read 类似，只是数据写到远端主机中。RDMA写操作在执行时不通知远程主机。然而带即时数的RDMA写操作会将即时数通知给远程主机。</li><li>RDMA Atomic：包括原子取、原子加、原子比较和原子交换，属于RDMA原子操作的扩展。</li></ul></li><li><code>Messaging verbs</code>：包括send和receive操作，属于双边操作，即必须要远端的应用感知参与才能完成收发。<ul><li>RDMA Send：发送操作允许你把数据发送到远程 QP 的接收队列里。接收端必须已经事先<strong>注册</strong>好了用来接收数据的缓冲 区。发送者无法控制数据在远程主机中的放置位置。可选择是否使用即时数，一个4位的即时数可以和数据缓冲一起被传送。这个即时数发送到接收端是作为接收的通知，不包含在数据缓冲之中。</li><li>RDMA Receive：这是与发送操作相对应的操作。接收主机被告知接收到数据缓冲，还可能附带一个即时数。接收端应用 程序负责接收缓冲区的维护和发布。</li></ul></li></ul><p><a href="http://rdmaconsortium.org/" target="_blank" rel="external nofollow noopener noreferrer">RDMA Consortium</a> 和 <a href="http://www.infinibandta.org/" target="_blank" rel="external nofollow noopener noreferrer">IBTA</a> 主导了RDMA，RDMAC是IETF的一个补充，它主要定义的是iWRAP和iSER，IBTA是infiniband的全部标准制定者，并补充了RoCE v1 v2的标准化。应用和RNIC之间的传输接口层（software transport interface）被称为Verbs。IBTA解释了RDMA传输过程中应具备的特性行为，而并没有规定Verbs的具体接口和数据结构原型。这部分工作由另一个组织OFA（<a href="https://www.openfabrics.org/" target="_blank" rel="external nofollow noopener noreferrer">Open Fabric Alliance</a>）来完成，OFA提供了RDMA传输的一系列Verbs API。OFA开发出了OFED（Open Fabric Enterprise Distribution）协议栈，支持多种RDMA传输层协议。</p><p>OFED<strong>中除了提供向下与RNIC</strong>基本的队列消息服务，向上还提供了ULP<strong>（Upper Layer Protocols</strong>），通过ULPs<strong>，上层应用不需要直接到Verbs API</strong>对接，而是借助于ULP<strong>与应用对接，常见的应用不需要做修改，就可以跑在RDMA</strong>传输层上。</p><h3 id="基本概念"><a href="#基本概念" class="headerlink" title="基本概念"></a>基本概念</h3><h4 id="Send-Request"><a href="#Send-Request" class="headerlink" title="Send Request"></a>Send Request</h4><p>SR 定义了数据的发送量、从哪里、发送方式、是否通过 RDMA、到哪里。 </p><p>结构 ibv_send_wr 用来描述 SR。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_send_wr</span> &#123;</span></span><br><span class="line"><span class="keyword">uint64_t</span>wr_id;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_send_wr</span>     *<span class="title">next</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_sge</span>       *<span class="title">sg_list</span>;</span></span><br><span class="line"><span class="keyword">int</span>num_sge;</span><br><span class="line"><span class="keyword">enum</span> ibv_wr_opcodeopcode;</span><br><span class="line"><span class="keyword">unsigned</span> <span class="keyword">int</span>send_flags;</span><br><span class="line"><span class="comment">/* When opcode is *_WITH_IMM: Immediate data in network byte order.</span></span><br><span class="line"><span class="comment"> * When opcode is *_INV: Stores the rkey to invalidate</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="keyword">union</span> &#123;</span><br><span class="line">__be32imm_data;</span><br><span class="line"><span class="keyword">uint32_t</span>invalidate_rkey;</span><br><span class="line">&#125;;</span><br><span class="line"><span class="keyword">union</span> &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> &#123;</span></span><br><span class="line"><span class="keyword">uint64_t</span>remote_addr;</span><br><span class="line"><span class="keyword">uint32_t</span>rkey;</span><br><span class="line">&#125; rdma;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> &#123;</span></span><br><span class="line"><span class="keyword">uint64_t</span>remote_addr;</span><br><span class="line"><span class="keyword">uint64_t</span>compare_add;</span><br><span class="line"><span class="keyword">uint64_t</span>swap;</span><br><span class="line"><span class="keyword">uint32_t</span>rkey;</span><br><span class="line">&#125; atomic;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> &#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_ah</span>  *<span class="title">ah</span>;</span></span><br><span class="line"><span class="keyword">uint32_t</span>remote_qpn;</span><br><span class="line"><span class="keyword">uint32_t</span>remote_qkey;</span><br><span class="line">&#125; ud;</span><br><span class="line">&#125; wr;</span><br><span class="line"><span class="keyword">union</span> &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> &#123;</span></span><br><span class="line"><span class="keyword">uint32_t</span>    remote_srqn;</span><br><span class="line">&#125; xrc;</span><br><span class="line">&#125; qp_type;</span><br><span class="line"><span class="keyword">union</span> &#123;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> &#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_mw</span>*<span class="title">mw</span>;</span></span><br><span class="line"><span class="keyword">uint32_t</span>rkey;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_mw_bind_info</span><span class="title">bind_info</span>;</span></span><br><span class="line">&#125; bind_mw;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> &#123;</span></span><br><span class="line"><span class="keyword">void</span>       *hdr;</span><br><span class="line"><span class="keyword">uint16_t</span>hdr_sz;</span><br><span class="line"><span class="keyword">uint16_t</span>mss;</span><br><span class="line">&#125; tso;</span><br><span class="line">&#125;;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><h4 id="Receive-Request"><a href="#Receive-Request" class="headerlink" title="Receive Request"></a>Receive Request</h4><p>RR 定义用来放置通过 RDMA 操作接收到的数据的缓冲区。如没有定义缓冲区，并且有个传输者尝试执行一个发送操作或者一个带即时数的 RDMA 写操作，那么接收者将会发出接收未就绪的错误（RNR）。</p><p>结构 <code>ibv_recv_wr</code> 用来描述 RR。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_recv_wr</span> &#123;</span></span><br><span class="line"><span class="keyword">uint64_t</span>wr_id;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_recv_wr</span>     *<span class="title">next</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_sge</span>       *<span class="title">sg_list</span>;</span></span><br><span class="line"><span class="keyword">int</span>num_sge;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><h4 id="Queue-Pairs"><a href="#Queue-Pairs" class="headerlink" title="Queue Pairs"></a>Queue Pairs</h4><p>RDMA提供了基于消息队列的点对点通信，每个应用都可以直接获取自己的消息，无需操作系统和协议栈的介入。消息服务建立在通信双方本端和远端应用之间创建的Channel-IO连接之上。当应用需要通信时，就会创建一条Channel连接，每条Channel的首尾端点是两对Queue Pairs（QP）。每对QP由Send Queue（SQ）和Receive Queue（RQ）构成，这些队列中管理着各种类型的消息。QP会被映射到应用的虚拟地址空间，使得应用直接通过它访问RNIC网卡。除了QP描述的两种基本队列之外，RDMA还提供一种队列Complete Queue（CQ），CQ用来知会用户WQ上的消息已经被处理完。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-15_infiniband-stack.png"></p><p>RDMA提供了一套软件传输接口，方便用户创建传输请求Work Request(WR），WR中描述了应用希望传输到Channel对端的消息内容，WR 通知QP中的某个队列Work Queue(WQ)。在 WQ 中，用户的 WR 被转化为Work Queue Element（WQE）的格式，等待RNIC的异步调度解析，并从WQE指向的Buffer中拿到真正的消息发送到 Channel 对端。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_qp</span> &#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_context</span>     *<span class="title">context</span>;</span></span><br><span class="line"><span class="keyword">void</span>       *qp_context;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_pd</span>       *<span class="title">pd</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_cq</span>       *<span class="title">send_cq</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_cq</span>       *<span class="title">recv_cq</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_srq</span>       *<span class="title">srq</span>;</span></span><br><span class="line"><span class="keyword">uint32_t</span>handle;</span><br><span class="line"><span class="keyword">uint32_t</span>qp_num;</span><br><span class="line"><span class="keyword">enum</span> ibv_qp_state       state;</span><br><span class="line"><span class="keyword">enum</span> ibv_qp_typeqp_type;</span><br><span class="line"></span><br><span class="line"><span class="keyword">pthread_mutex_t</span>mutex;</span><br><span class="line"><span class="keyword">pthread_cond_t</span>cond;</span><br><span class="line"><span class="keyword">uint32_t</span>events_completed;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><h4 id="Completion-Queue"><a href="#Completion-Queue" class="headerlink" title="Completion Queue"></a>Completion Queue</h4><p>发送到 SQ 和 RQ 的工作请求都被视为未完成，工作请求未完成期间，它指向的内存缓冲区的内容是不确定的。CQ 包含了发送到工作队列（WQ）中已完成的工作请求（WR）。每次完成表示一个特定的 WR 执行完毕（包括成功完成的 WR 和不成功完成的 WR）。完成队列是一个用来告知应用程序已经结束的工作请求的信息（状态、操作码、大小、来源）的机制。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-13_rdma-queue.svg"></p><p>CQ有n个完成队列实体（CQE），CQE 的数量在CQ创建时指定。当一个CQE被 <strong>轮询</strong> 到，它就从CQ中被删除。CQ是一个CQE的 FIFO 队列。CQ能服务于发送队列、接收队列或者同时服务于这两种队列。多个不同QP中的工作请求（WQ）可联系到同一个CQ上。</p><p>结构 <code>ibv_cq</code> 用来描述CQ。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_cq</span> &#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_context</span>     *<span class="title">context</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_comp_channel</span> *<span class="title">channel</span>;</span></span><br><span class="line"><span class="keyword">void</span>       *cq_context;</span><br><span class="line"><span class="keyword">uint32_t</span>handle;</span><br><span class="line"><span class="keyword">int</span>cqe;</span><br><span class="line"></span><br><span class="line"><span class="keyword">pthread_mutex_t</span>mutex;</span><br><span class="line"><span class="keyword">pthread_cond_t</span>cond;</span><br><span class="line"><span class="keyword">uint32_t</span>comp_events_completed;</span><br><span class="line"><span class="keyword">uint32_t</span>async_events_completed;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><h4 id="Memory-Registration"><a href="#Memory-Registration" class="headerlink" title="Memory Registration"></a>Memory Registration</h4><p>RDMA 设备访问的每一个内存缓冲区都必须注册，在注册过程中，将对内存缓冲区执行如下操作：</p><ul><li>将连续的内存缓冲区分成内存页，将这些内存空间提供给网络适配器作为虚拟的连续缓冲区，缓冲区使用虚拟地址</li><li>将虚拟内存映射到物理内存，注册进程将虚拟地址与物理地址的映射表写入网络适配器。</li><li>检查内存页权限，确保它们支持为 MR(Memory Region) 发出请求的权限</li><li>锁定内存页权限，以防它们被换出，确保虚拟内存到物理内存的映射不变</li></ul><p>注册成功后，内存有两个键：</p><ul><li>本地键 <code>lkey</code>：供本地工作请求用来访问内存的 key</li><li>远程键 <code>rkey</code>：供远程机器通过 RDMA 访问内存的 key</li></ul><p>在工作请求中，将使用这些 key 来访问内存缓冲区，同一内存缓冲区可以被多次注册（甚至设置不同的操作权限），并且每次注册都会生成不同的 key。</p><p>结构 <code>ibv_mr</code> 用来描述内存注册。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_mr</span> &#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_context</span>     *<span class="title">context</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_pd</span>       *<span class="title">pd</span>;</span></span><br><span class="line"><span class="keyword">void</span>       *addr;</span><br><span class="line"><span class="keyword">size_t</span>length;</span><br><span class="line"><span class="keyword">uint32_t</span>handle;</span><br><span class="line"><span class="keyword">uint32_t</span>lkey;</span><br><span class="line"><span class="keyword">uint32_t</span>rkey;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><h4 id="Memory-Window"><a href="#Memory-Window" class="headerlink" title="Memory Window"></a>Memory Window</h4><p>启用远程内存访问的方式有以下两种：</p><ul><li>注册允许远程内存访问的内存缓冲区</li><li>注册内存区并将其绑定到内存窗口</li></ul><p>这两种方式都将创建一个 rkey，可用来访问制定的内存。然而，如果想要这个rkey 无效，以禁止访问该内存时。采用注销内存区的方式实现起来比较繁琐。而使用内存窗口，并根据需要进行绑定和解除绑定，对于启动和禁用运城内存访问简单灵活得多。</p><p>内存窗口作用于以下场景：</p><ul><li>动态地授予和回收已注册缓冲区的远程访问权限，这种方式相较于将缓冲区取消注册、再注册或者重注册，有更低的性能损耗代价。</li><li>想为不同的远程代理授予不同的远程访问方式，或者在一个已注册的缓冲区中不同范围授予哪些权限。</li></ul><p>内存窗口和内存注册之间的关联操作叫做绑定。不同的MW可以做用于同一个MR，即使有不同的访问权限。</p><h4 id="Address-Vector"><a href="#Address-Vector" class="headerlink" title="Address Vector"></a>Address Vector</h4><p>地址向量用来描述本地节点到远程节点的路由。在QP的每个UC/RC中，都有一个地址向量存在于QP的上下文中。在UD的QP中，每个提交的发送请求（SR）中都应该定义地址向量。</p><p>结构 <code>ibv_ah</code>用来描述地址向量。</p><h4 id="Global-Routing-Header（GRH）"><a href="#Global-Routing-Header（GRH）" class="headerlink" title="Global Routing Header（GRH）"></a>Global Routing Header（GRH）</h4><p>GRH用于子网之间的路由。当用到RoCE时，GRH用于子网内部的路由，并且是强制使用的，强制使用GRH是为了保证应用程序即支持IB又支持RoCE。当全局路由用在给予UD的QP时，在接受缓冲区的前40自己会包含有一个GRH。这个区域专门存储全局路由信息，为了回应接收到的数据包，会产生一个合适的地址向量。如果向量用在UD中，接收请求RR应该总是有额外的40字节用来GRH。</p><p>结构 <code>ibv_grh</code> 用来描述GRH。</p><h4 id="Protection-Domain"><a href="#Protection-Domain" class="headerlink" title="Protection Domain"></a>Protection Domain</h4><p>保护域是一种集合，它的内部元素只能与集合内部的其它元素相互作用。这些元素可以是AH、QP、MR、和SRQ。保护域用于QP与内存注册和内存窗口相关联，这是一种授权和管理网络适配器对主机系统内存的访问。PD也用于将给予不可靠数据报（UD）的QP关联到地址处理（AH），这是一种对UD目的端的访问控制。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_pd</span> &#123;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_context</span>     *<span class="title">context</span>;</span></span><br><span class="line"><span class="keyword">uint32_t</span>handle;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><h3 id="通信过程"><a href="#通信过程" class="headerlink" title="通信过程"></a>通信过程</h3><h4 id="获取设备列表"><a href="#获取设备列表" class="headerlink" title="获取设备列表"></a>获取设备列表</h4><p>首先必须检查得到本机可用的IB设备列表，列表中的每个设备都包含一个名字和GUID。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 1 获取设备列表 */</span></span><br><span class="line"><span class="keyword">int</span> num_devices;</span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_device</span> **<span class="title">dev_list</span> = <span class="title">ibv_get_device_list</span>(&amp;<span class="title">num_devices</span>);</span></span><br><span class="line"><span class="keyword">if</span> (!dev_list || !num_devices)</span><br><span class="line">&#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to get IB devices\n"</span>);</span><br><span class="line">rc = <span class="number">1</span>;</span><br><span class="line"><span class="keyword">goto</span> main_exit;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="打开要请求的设备"><a href="#打开要请求的设备" class="headerlink" title="打开要请求的设备"></a>打开要请求的设备</h4><p>遍历设备列表，通过设备的GUID或者名字选择并打开它，获取一个上下文：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 2 打开设备，获取设备上下文 */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_device</span> *<span class="title">ib_dev</span> = <span class="title">dev_list</span>[0];</span></span><br><span class="line">res.ib_ctx = ibv_open_device(ib_dev);</span><br><span class="line"><span class="keyword">if</span> (!res.ib_ctx)</span><br><span class="line">&#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to open device \n"</span>);</span><br><span class="line">rc = <span class="number">1</span>;</span><br><span class="line"><span class="keyword">goto</span> main_exit;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>一般在这里需要释放设备列表占用的资源</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 3 释放设备列表占用的资源 */</span></span><br><span class="line">ibv_free_device_list(dev_list);</span><br><span class="line">dev_list = <span class="literal">NULL</span>;</span><br><span class="line">ib_dev = <span class="literal">NULL</span>;</span><br></pre></td></tr></table></figure><h4 id="查询设备的工作能力"><a href="#查询设备的工作能力" class="headerlink" title="查询设备的工作能力"></a>查询设备的工作能力</h4><p>设备的工作能力能使用户了解已打开设备支持的特性和能力 <code>ibv_port_attr</code>。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 4 查询设备端口状态 */</span></span><br><span class="line"><span class="keyword">if</span> (ibv_query_port(res.ib_ctx, <span class="number">1</span>, &amp;res.port_attr))</span><br><span class="line">&#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"ibv_query_port on port failed\n"</span>);</span><br><span class="line">rc = <span class="number">1</span>;</span><br><span class="line"><span class="keyword">goto</span> main_exit;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="分配保护域以及您的资源"><a href="#分配保护域以及您的资源" class="headerlink" title="分配保护域以及您的资源"></a>分配保护域以及您的资源</h4><p>保护域（PD）允许用户限制哪些组件只能相互交互。这个组件可以是AH、QP、MR、MW、和SRQ。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 5 创建PD（Protection Domain） */</span></span><br><span class="line">res.pd = ibv_alloc_pd(res.ib_ctx);</span><br><span class="line"><span class="keyword">if</span> (!res.pd)</span><br><span class="line">&#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"ibv_alloc_pd failed\n"</span>);</span><br><span class="line">rc = <span class="number">1</span>;</span><br><span class="line"><span class="keyword">goto</span> main_exit;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="创建-CQ"><a href="#创建-CQ" class="headerlink" title="创建 CQ"></a>创建 CQ</h4><p>一个CQ包含完成的工作请求（WR），每个WR将生成放置在CQ中的完成队列实体CQE，CQE将表明WR是否成功完成：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 6 创建CQ（Complete Queue） */</span></span><br><span class="line"><span class="keyword">int</span> cq_size = <span class="number">10</span>;</span><br><span class="line">res.cq = ibv_create_cq(res.ib_ctx, cq_size, <span class="literal">NULL</span>, <span class="literal">NULL</span>, <span class="number">0</span>);</span><br><span class="line"><span class="keyword">if</span> (!res.cq)</span><br><span class="line">&#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to create CQ with %u entries\n"</span>, cq_size);</span><br><span class="line">rc = <span class="number">1</span>;</span><br><span class="line"><span class="keyword">goto</span> main_exit;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="注册一个内存区域"><a href="#注册一个内存区域" class="headerlink" title="注册一个内存区域"></a>注册一个内存区域</h4><p>在注册过程中，用户设置内存权限并接收 <code>lkey</code> 和 <code>rkey</code>，稍后将使用这些秘钥来访问此内存缓冲区：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 7 注册MR（Memory Region） */</span></span><br><span class="line"><span class="keyword">int</span> <span class="built_in">size</span> = MSG_SIZE;</span><br><span class="line">res.buf = (<span class="keyword">char</span> *)<span class="built_in">malloc</span>(<span class="built_in">size</span>);</span><br><span class="line"><span class="keyword">if</span> (!res.buf)</span><br><span class="line">&#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to malloc %Zu bytes to memory buffer\n"</span>, <span class="built_in">size</span>);</span><br><span class="line">rc = <span class="number">1</span>;</span><br><span class="line"><span class="keyword">goto</span> main_exit;</span><br><span class="line">&#125;</span><br><span class="line"><span class="built_in">memset</span>(res.buf, <span class="number">0</span>, <span class="built_in">size</span>);</span><br><span class="line"></span><br><span class="line"><span class="keyword">int</span> mr_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE;</span><br><span class="line">res.mr = ibv_reg_mr(res.pd, res.buf, <span class="built_in">size</span>, mr_flags);</span><br><span class="line"><span class="keyword">if</span> (!res.mr)</span><br><span class="line">&#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"ibv_reg_mr failed with mr_flags=0x%x\n"</span>, mr_flags);</span><br><span class="line">rc = <span class="number">1</span>;</span><br><span class="line"><span class="keyword">goto</span> main_exit;</span><br><span class="line">&#125;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stdout</span>, <span class="string">"MR was registered with addr=%p, lkey=0x%x, rkey=0x%x, flags=0x%x\n"</span>,</span><br><span class="line">res.buf, res.mr-&gt;lkey, res.mr-&gt;rkey, mr_flags);</span><br></pre></td></tr></table></figure><h4 id="创建-QP"><a href="#创建-QP" class="headerlink" title="创建 QP"></a>创建 QP</h4><p>创建 QP 还将创建关联的发送队列和接收队列：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 8 创建QP（Queue Pair） */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_qp_init_attr</span> <span class="title">qp_init_attr</span>;</span></span><br><span class="line"><span class="built_in">memset</span>(&amp;qp_init_attr, <span class="number">0</span>, <span class="keyword">sizeof</span>(qp_init_attr));</span><br><span class="line">qp_init_attr.qp_type = IBV_QPT_RC;</span><br><span class="line">qp_init_attr.sq_sig_all = <span class="number">1</span>;</span><br><span class="line">qp_init_attr.send_cq = res.cq;</span><br><span class="line">qp_init_attr.recv_cq = res.cq;</span><br><span class="line">qp_init_attr.cap.max_send_wr = <span class="number">1</span>;</span><br><span class="line">qp_init_attr.cap.max_recv_wr = <span class="number">1</span>;</span><br><span class="line">qp_init_attr.cap.max_send_sge = <span class="number">1</span>;</span><br><span class="line">qp_init_attr.cap.max_recv_sge = <span class="number">1</span>;</span><br><span class="line">res.qp = ibv_create_qp(res.pd, &amp;qp_init_attr);</span><br><span class="line"><span class="keyword">if</span> (!res.qp)</span><br><span class="line">&#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to create QP\n"</span>);</span><br><span class="line">rc = <span class="number">1</span>;</span><br><span class="line"><span class="keyword">goto</span> main_exit;</span><br><span class="line">&#125;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stdout</span>, <span class="string">"QP was created, QP number=0x%x\n"</span>, res.qp-&gt;qp_num);</span><br></pre></td></tr></table></figure><h4 id="交换控制信息"><a href="#交换控制信息" class="headerlink" title="交换控制信息"></a>交换控制信息</h4><p>可以通过 Socket 或者 RDMA_CM API 来交换控制信息，这里演示的是使用 Socket 交换信息：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 9 交换控制信息 */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cm_con_data_t</span> <span class="title">local_con_data</span>;</span>  <span class="comment">// 发送给远程主机的信息</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cm_con_data_t</span> <span class="title">remote_con_data</span>;</span> <span class="comment">// 接收远程主机发送过来的信息</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">cm_con_data_t</span> <span class="title">tmp_con_data</span>;</span></span><br><span class="line"></span><br><span class="line">local_con_data.addr = htonll((<span class="keyword">uintptr_t</span>)res.buf);</span><br><span class="line">local_con_data.rkey = htonl(res.mr-&gt;rkey);</span><br><span class="line">local_con_data.qp_num = htonl(res.qp-&gt;qp_num);</span><br><span class="line">local_con_data.lid = htons(res.port_attr.lid);</span><br><span class="line"><span class="keyword">if</span> (sock_sync_data(server_ip, <span class="keyword">sizeof</span>(struct <span class="keyword">cm_con_data_t</span>), (<span class="keyword">char</span> *)&amp;local_con_data, (<span class="keyword">char</span> *)&amp;tmp_con_data) &lt; <span class="number">0</span>)</span><br><span class="line">&#123;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to exchange connection data between sides\n"</span>);</span><br><span class="line">rc = <span class="number">1</span>;</span><br><span class="line"><span class="keyword">goto</span> main_exit;</span><br><span class="line">&#125;</span><br><span class="line">remote_con_data.addr = ntohll(tmp_con_data.addr);</span><br><span class="line">remote_con_data.rkey = ntohl(tmp_con_data.rkey);</span><br><span class="line">remote_con_data.qp_num = ntohl(tmp_con_data.qp_num);</span><br><span class="line">remote_con_data.lid = ntohs(tmp_con_data.lid);</span><br><span class="line"><span class="comment">/* save the remote side attributes, we will need it for the post SR */</span></span><br><span class="line">res.remote_props = remote_con_data;</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stdout</span>, <span class="string">"Remote address = 0x%"</span> PRIx64 <span class="string">"\n"</span>, remote_con_data.addr);</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stdout</span>, <span class="string">"Remote rkey = 0x%x\n"</span>, remote_con_data.rkey);</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stdout</span>, <span class="string">"Remote QP number = 0x%x\n"</span>, remote_con_data.qp_num);</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stdout</span>, <span class="string">"Remote LID = 0x%x\n"</span>, remote_con_data.lid);</span><br></pre></td></tr></table></figure><h4 id="转换-QP-状态"><a href="#转换-QP-状态" class="headerlink" title="转换 QP 状态"></a>转换 QP 状态</h4><p>QP 有一个状态机，用于指定 QP 在各种状态下能够做什么：</p><ul><li>RESET：重置状态，QP 刚创建时即处于 RESET 状态，此时不能在 QP 中添加发送请求或接收请求，所有入站消息都被默默丢弃</li><li>INIT：已初始化状态，此时不能添加发送请求，可以添加接收请求，但是请求不会被处理，所有入站消息都被默默丢弃。最好在QP处于这种状态时将接收请求加入到其中，再切换到 RTR 状态。这样可以避免发送消息的远程 QP 在需要使用接收请求时没有接收请求可用的情况发生。</li><li>RTR：Ready To Receive 状态，此时不能添加发送请求，但是可以添加并且处理接收请求，所有入站信息都将得到处理。在这种状态下收到的第一条消息，将触发异步事件「通信已建立」</li><li>RTS：Ready To Send 状态，此时可以添加和处理发送和接收请求，所有入站信息都将得到处理</li><li>SQD：Send Queue Drained 状态，此时 QP 将完成所有已进入处理程序的发送请求的处理工作</li><li>SQE：Send Queue Error 状态，传输类型为不可靠的 QP，当其发送队列出现错误时，RDMA 设备会自动将其切换到这个状态</li><li>ERROR：错误状态，此时所有未处理的工作请求都被删除</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_rdma-qp-state-machine.png"></p><ul><li>状态：RESET -&gt; INIT -&gt; RTR -&gt; RTS</li><li>要严格按照顺序进行转换</li><li>INIT之后就可以调用 ibv_post_recv 提交一个receive buffer了</li><li>当 QP进入RTR(ready to receive)状态以后，便开始进行接收处理</li><li>RTR之后便可以转为RTS(ready to send)，RTS状态下可以调用ibv_post_send</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 10 转换QP状态 */</span></span><br><span class="line"><span class="comment">// RESET -&gt; INIT</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_qp_attr</span> <span class="title">attr</span>;</span></span><br><span class="line"><span class="keyword">int</span> flags;</span><br><span class="line"><span class="built_in">memset</span>(&amp;attr, <span class="number">0</span>, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">attr.qp_state = IBV_QPS_INIT;</span><br><span class="line">attr.port_num = <span class="number">1</span>; <span class="comment">// IB 端口号</span></span><br><span class="line">attr.pkey_index = <span class="number">0</span>;</span><br><span class="line">attr.qp_access_flags = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_WRITE;</span><br><span class="line">flags = IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS;</span><br><span class="line">rc = ibv_modify_qp(res.qp, &amp;attr, flags);</span><br><span class="line"><span class="keyword">if</span> (rc)</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to modify QP state to INIT\n"</span>);</span><br><span class="line"></span><br><span class="line"><span class="comment">//INIT -&gt; RTR(Ready To Receive)</span></span><br><span class="line"><span class="built_in">memset</span>(&amp;attr, <span class="number">0</span>, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">attr.qp_state = IBV_QPS_RTR;</span><br><span class="line">attr.path_mtu = IBV_MTU_256;</span><br><span class="line">attr.dest_qp_num = res.remote_props.qp_num;</span><br><span class="line">attr.rq_psn = <span class="number">0</span>;</span><br><span class="line">attr.max_dest_rd_atomic = <span class="number">1</span>;</span><br><span class="line">attr.min_rnr_timer = <span class="number">0x12</span>;</span><br><span class="line">attr.ah_attr.is_global = <span class="number">0</span>;</span><br><span class="line">attr.ah_attr.dlid = res.remote_props.lid;</span><br><span class="line">attr.ah_attr.sl = <span class="number">0</span>;</span><br><span class="line">attr.ah_attr.src_path_bits = <span class="number">0</span>;</span><br><span class="line">attr.ah_attr.port_num = <span class="number">1</span>;</span><br><span class="line">flags = IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER;</span><br><span class="line">rc = ibv_modify_qp(res.qp, &amp;attr, flags);</span><br><span class="line"><span class="keyword">if</span> (rc)</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to modify QP state to RTR\n"</span>);</span><br><span class="line"></span><br><span class="line"><span class="comment">//RTR -&gt; RTS(Ready To Send)</span></span><br><span class="line"><span class="built_in">memset</span>(&amp;attr, <span class="number">0</span>, <span class="keyword">sizeof</span>(attr));</span><br><span class="line">attr.qp_state = IBV_QPS_RTS;</span><br><span class="line">attr.timeout = <span class="number">0x12</span>;</span><br><span class="line">attr.retry_cnt = <span class="number">6</span>;</span><br><span class="line">attr.rnr_retry = <span class="number">0</span>;</span><br><span class="line">attr.sq_psn = <span class="number">0</span>;</span><br><span class="line">attr.max_rd_atomic = <span class="number">1</span>;</span><br><span class="line">flags = IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC;</span><br><span class="line">rc = ibv_modify_qp(res.qp, &amp;attr, flags);</span><br><span class="line"><span class="keyword">if</span> (rc)</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to modify QP state to RTS\n"</span>);</span><br></pre></td></tr></table></figure><h4 id="创建发送-接收任务"><a href="#创建发送-接收任务" class="headerlink" title="创建发送/接收任务"></a>创建发送/接收任务</h4><ul><li>ibv_send_wr（send work request）</li><li>该任务会被提交到QP中的SQ（Send Queue）中</li><li>发送任务有三种操作：Send,Read,Write</li><li>Send操作需要对方执行相应的Receive操作</li><li>Read/Write直接操作对方内存，对方无感知</li><li>把要发送的数据的内存地址，大小，密钥告诉HCA</li><li>Read/Write还需要告诉HCA远程的内存地址和密钥</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 11 创建发送任务ibv_send_wr */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_send_wr</span> <span class="title">sr</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_sge</span> <span class="title">sge</span>;</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_send_wr</span> *<span class="title">bad_wr</span> = <span class="title">NULL</span>;</span></span><br><span class="line"><span class="keyword">int</span> rc;</span><br><span class="line"><span class="comment">/* prepare the scatter/gather entry */</span></span><br><span class="line"><span class="built_in">memset</span>(&amp;sge, <span class="number">0</span>, <span class="keyword">sizeof</span>(sge));</span><br><span class="line">sge.addr = (<span class="keyword">uintptr_t</span>)res-&gt;buf;</span><br><span class="line">sge.length = MSG_SIZE;</span><br><span class="line">sge.lkey = res-&gt;mr-&gt;lkey;</span><br><span class="line"><span class="comment">/* prepare the send work request */</span></span><br><span class="line"><span class="built_in">memset</span>(&amp;sr, <span class="number">0</span>, <span class="keyword">sizeof</span>(sr));</span><br><span class="line">sr.next = <span class="literal">NULL</span>;</span><br><span class="line">sr.wr_id = <span class="number">0</span>;</span><br><span class="line">sr.sg_list = &amp;sge;</span><br><span class="line">sr.num_sge = <span class="number">1</span>;</span><br><span class="line">sr.opcode = opcode;</span><br><span class="line">sr.send_flags = IBV_SEND_SIGNALED;</span><br><span class="line"><span class="keyword">if</span> (opcode != IBV_WR_SEND)</span><br><span class="line">&#123;</span><br><span class="line">sr.wr.rdma.remote_addr = res-&gt;remote_props.addr;</span><br><span class="line">sr.wr.rdma.rkey = res-&gt;remote_props.rkey;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="提交发送-接收任务"><a href="#提交发送-接收任务" class="headerlink" title="提交发送/接收任务"></a>提交发送/接收任务</h4><ul><li>发送 <code>ibv_post_send</code></li><li>接收 <code>ibv_post_recv</code></li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">rc = ibv_post_send(res-&gt;qp, &amp;sr, &amp;bad_wr);</span><br><span class="line"><span class="keyword">if</span> (rc)</span><br><span class="line"><span class="built_in">fprintf</span>(<span class="built_in">stderr</span>, <span class="string">"failed to post SR\n"</span>);</span><br><span class="line"><span class="keyword">return</span> rc;</span><br></pre></td></tr></table></figure><h4 id="轮询任务完成信息"><a href="#轮询任务完成信息" class="headerlink" title="轮询任务完成信息"></a>轮询任务完成信息</h4><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* 13 轮询任务结果 */</span></span><br><span class="line"><span class="class"><span class="keyword">struct</span> <span class="title">ibv_wc</span> <span class="title">wc</span>;</span></span><br><span class="line"><span class="keyword">int</span> poll_result;</span><br><span class="line"><span class="keyword">int</span> rc = <span class="number">0</span>;</span><br><span class="line"><span class="keyword">do</span></span><br><span class="line">&#123;</span><br><span class="line">poll_result = ibv_poll_cq(res-&gt;cq, <span class="number">1</span>, &amp;wc);</span><br><span class="line">&#125; <span class="keyword">while</span> (poll_result == <span class="number">0</span>);</span><br></pre></td></tr></table></figure><h3 id="RDMA-单边操作"><a href="#RDMA-单边操作" class="headerlink" title="RDMA 单边操作"></a>RDMA 单边操作</h3><blockquote><p>单边操作传输方式是RDMA与传统网络传输的最大不同，提供直接访问远程的虚拟地址，无须远程应用的参与，这种方式适用于批量数据传输。</p></blockquote><p>READ和WRITE是单边操作，只需要本端明确信息的源和目的地址，远端应用不必感知此次通信，数据的读或写都通过RDMA在RNIC与应用Buffer之间完成，再由远端RNIC封装成消息返回到本端。</p><h4 id="RDMA-Read"><a href="#RDMA-Read" class="headerlink" title="RDMA Read"></a>RDMA Read</h4><p>对于单边操作，以存储网络环境下的存储为例，数据的流程如下：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_rdma-read.png"></p><ol><li>首先A、B建立连接，QP已经创建并且初始化。</li><li>数据被存档在 B 的 buffer地址 VB，注意VB应该提前注册到B的RNIC (并且它是一个Memory Region) ，并拿到返回的local key，相当于RDMA操作这块buffer的权限。</li><li>B 把数据地址 VB，key封装到专用的报文传送到A，这相当于B把数据buffer的操作权交给了A。同时B在它的WQ中注册进一个WR，以用于接收数据传输的A返回的状态。</li><li>A 在收到 B 的送过来的数据 VB 和 R_key 后，RNIC 会把它们连同自身存储地址 VA 到封装 RDMA READ 请求，将这个消息请求发送给B，这个过程A、B两端不需要任何软件参与，就可以将 B 的数据存储到 B 的 VA虚拟地址。</li><li>B在存储完成后，会向A返回整个数据传输的状态信息。</li></ol><h4 id="RDMA-Write"><a href="#RDMA-Write" class="headerlink" title="RDMA Write"></a>RDMA Write</h4><p>对于单边操作，以存储网络环境下的存储为例，数据的流程如下：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_rdma-write.png"></p><ol><li>首先A、B建立连接，QP已经创建并且初始化。</li><li>数据 remote目标存储buffer地址VB，注意VB应该提前注册到B的RNIC(并且它是一个Memory Region)，并拿到返回的local key，相当于RDMA操作这块buffer的权限。</li><li>B把数据地址VB，key封装到专用的报文传送到A，这相当于B把数据buffer的操作权交给了A。同时B在它的WQ中注册进一个WR，以用于接收数据传输的A返回的状态。</li><li>A在收到B的送过来的数据VB和R_key后，RNIC会把它们连同自身发送地址VA到封装RDMA WRITE请求，这个过程A、B两端不需要任何软件参与，就可以将A的数据发送到B的VB虚拟地址。</li><li>A在发送数据完成后，会向B返回整个数据传输的状态信息。</li></ol><h3 id="RDMA-双边操作"><a href="#RDMA-双边操作" class="headerlink" title="RDMA 双边操作"></a>RDMA 双边操作</h3><blockquote><p>双边操作与传统网络的底层buffer pool<strong>类似，收发双方的参与过程并无差别，区别在零拷贝、kernel bypass</strong>，实际上传统网络中一些高级的网络SOC <strong>已经实现类似功能。对于RDMA</strong>，这是一种复杂的消息传输模式，多用于传输短的控制消息。</p></blockquote><p>RDMA 中 SEND/RECEIVE 是双边操作，即必须要远端的应用感知参与才能完成收发。在实际中，SEND/RECEIVE多用于连接控制类报文，而数据报文多是通过READ/WRITE来完成的。对于双边操作为例，主机 A 向主机 B 发送数据的流程如下：</p><ul><li>首先，A 和 B 都要创建并初始化好各自的QP、CQ，并且为 RDMA 注册了 Memory Region，A 想发送数据给 B</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_rdma-send-example-step-11.jpg"></p><ul><li>A 和 B 分别向自己的WQ中注册WQE，对于A，WQ=SQ，WQE描述指向一个等到被发送的数据；对于B，WQ=RQ，WQE描述指向一块用于存储数据的Buffer</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_rdma-send-example-step-2.jpg"></p><ul><li>A 的 HCA 作为硬件总是从 SQ 中取出 WQE，解析到这是一个SEND消息，将数据直接从 A 的 Buffer 中发给 B。数据流到达B的RNIC后，B 的 HCA 将会从 RQ 中取出 WQE，并把数据直接存储到 WQE 指向的存储位置。</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_rdma-send-example-step-3.jpg"></p><ul><li>AB 通信完成后，A的CQ中会产生一个完成消息 CQE 表示发送完成。与此同时，B 的 CQ 中也会产生一个完成消息表示接收完成。每个WQ中WQE的处理完成都会产生一个CQE。 即使传输发生错误，也会产生 CQE，CQE 中会有字段表明传输的状态。</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_rdma-send-example-step-4.jpg"></p><h2 id="工具使用"><a href="#工具使用" class="headerlink" title="工具使用"></a>工具使用</h2><h3 id="带宽测试"><a href="#带宽测试" class="headerlink" title="带宽测试"></a>带宽测试</h3><h4 id="ib-read-bw"><a href="#ib-read-bw" class="headerlink" title="ib_read_bw"></a>ib_read_bw</h4><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ServerA：ib_read_bw -a -d mlx4_0</span><br><span class="line">ServerB: ib_read_bw -a -F &lt;ServerAIP&gt; -d mlx4_0 --report_gbits</span><br></pre></td></tr></table></figure><p>示例如下：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Server A</span></span><br><span class="line"><span class="comment"># 这里 -q 指定 QP 数为 2，-x 指定 GID Index</span></span><br><span class="line">$ ib_read_bw -q 2 -x 3 --report_g --run_infinitely</span><br><span class="line"></span><br><span class="line">************************************</span><br><span class="line">* Waiting <span class="keyword">for</span> client to connect... *</span><br><span class="line">************************************</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line">                    RDMA_Read BW Test</span><br><span class="line"> Dual-port       : OFFDevice         : mlx5_1</span><br><span class="line"> Number of qps   : 2Transport <span class="built_in">type</span> : IB</span><br><span class="line"> Connection <span class="built_in">type</span> : RCUsing SRQ      : OFF</span><br><span class="line"> PCIe relax order: ON</span><br><span class="line"> CQ Moderation   : 1</span><br><span class="line"> Mtu             : 1024[B]</span><br><span class="line"> Link <span class="built_in">type</span>       : Ethernet</span><br><span class="line"> GID index       : 3</span><br><span class="line"> Outstand reads  : 16</span><br><span class="line"> rdma_cm QPs : OFF</span><br><span class="line"> Data ex. method : Ethernet</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line"> <span class="built_in">local</span> address: LID 0000 QPN 0x02d9 PSN 0xf326ce OUT 0x10 RKey 0x060d13 VAddr 0x007fa3bccfc000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12</span><br><span class="line"> <span class="built_in">local</span> address: LID 0000 QPN 0x02da PSN 0x1403a0 OUT 0x10 RKey 0x060d13 VAddr 0x007fa3bcd0c000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12</span><br><span class="line"> remote address: LID 0000 QPN 0x02c9 PSN 0x861fee OUT 0x10 RKey 0x050f13 VAddr 0x007f55afc0f000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05</span><br><span class="line"> remote address: LID 0000 QPN 0x02ca PSN 0xfad640 OUT 0x10 RKey 0x050f13 VAddr 0x007f55afc1f000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line"> <span class="comment">#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Server B</span></span><br><span class="line">$ ib_read_bw 172.18.0.237 -q 2 -x 3 --report_g --run_infinitely</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line">                    RDMA_Read BW Test</span><br><span class="line"> Dual-port       : OFFDevice         : mlx5_3</span><br><span class="line"> Number of qps   : 2Transport <span class="built_in">type</span> : IB</span><br><span class="line"> Connection <span class="built_in">type</span> : RCUsing SRQ      : OFF</span><br><span class="line"> PCIe relax order: ON</span><br><span class="line"> TX depth        : 128</span><br><span class="line"> CQ Moderation   : 1</span><br><span class="line"> Mtu             : 1024[B]</span><br><span class="line"> Link <span class="built_in">type</span>       : Ethernet</span><br><span class="line"> GID index       : 3</span><br><span class="line"> Outstand reads  : 16</span><br><span class="line"> rdma_cm QPs : OFF</span><br><span class="line"> Data ex. method : Ethernet</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line"> <span class="built_in">local</span> address: LID 0000 QPN 0x02c9 PSN 0x861fee OUT 0x10 RKey 0x050f13 VAddr 0x007f55afc0f000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05</span><br><span class="line"> <span class="built_in">local</span> address: LID 0000 QPN 0x02ca PSN 0xfad640 OUT 0x10 RKey 0x050f13 VAddr 0x007f55afc1f000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05</span><br><span class="line"> remote address: LID 0000 QPN 0x02d9 PSN 0xf326ce OUT 0x10 RKey 0x060d13 VAddr 0x007fa3bccfc000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12</span><br><span class="line"> remote address: LID 0000 QPN 0x02da PSN 0x1403a0 OUT 0x10 RKey 0x060d13 VAddr 0x007fa3bcd0c000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line"> <span class="comment">#bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]</span></span><br><span class="line"> 65536      829982           0.00               87.06     0.166061</span><br><span class="line"> 65536      828835           0.00               86.94     0.165832</span><br><span class="line"> 65536      828849           0.00               86.95     0.165835</span><br><span class="line"> 65536      828828           0.00               86.94     0.165831</span><br><span class="line"> 65536      828801           0.00               86.94     0.165825</span><br><span class="line"> 65536      828795           0.00               86.94     0.165824</span><br><span class="line"> 65536      828852           0.00               86.95     0.165835</span><br></pre></td></tr></table></figure><h4 id="ib-write-bw"><a href="#ib-write-bw" class="headerlink" title="ib_write_bw"></a>ib_write_bw</h4><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ServerA: ib_write_bw -a -d mlx4_0</span><br><span class="line">ServerB: ib_write_bw -a -F &lt;ServerAIP&gt; -d mlx4_0 --report_gbits</span><br></pre></td></tr></table></figure><h4 id="ib-send-bw"><a href="#ib-send-bw" class="headerlink" title="ib_send_bw"></a>ib_send_bw</h4><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ServerA: ib_send_bw -a -d mlx4_0</span><br><span class="line">ServerB: ib_send_bw -a -F &lt;ServerAIP&gt; -d mlx4_0 --report_gbits</span><br></pre></td></tr></table></figure><h3 id="延迟测试"><a href="#延迟测试" class="headerlink" title="延迟测试"></a>延迟测试</h3><p>延迟测试也有三个命令，使用方法与上类似：</p><ul><li><code>ib_read_lat</code></li><li><code>ib_write_lat</code></li><li><code>ib_send_lat</code></li></ul><p>以 <code>ib_read_lat</code> 为例，测试结果如下：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Server A</span></span><br><span class="line">$ ib_read_lat -x 3 --report_g</span><br><span class="line"></span><br><span class="line">************************************</span><br><span class="line">* Waiting <span class="keyword">for</span> client to connect... *</span><br><span class="line">************************************</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line">                    RDMA_Read Latency Test</span><br><span class="line"> Dual-port       : OFFDevice         : mlx5_1</span><br><span class="line"> Number of qps   : 1Transport <span class="built_in">type</span> : IB</span><br><span class="line"> Connection <span class="built_in">type</span> : RCUsing SRQ      : OFF</span><br><span class="line"> PCIe relax order: ON</span><br><span class="line"> Mtu             : 1024[B]</span><br><span class="line"> Link <span class="built_in">type</span>       : Ethernet</span><br><span class="line"> GID index       : 3</span><br><span class="line"> Outstand reads  : 16</span><br><span class="line"> rdma_cm QPs : OFF</span><br><span class="line"> Data ex. method : Ethernet</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line"> <span class="built_in">local</span> address: LID 0000 QPN 0x02db PSN 0x144f6a OUT 0x10 RKey 0x060d14 VAddr 0x00000001849000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12</span><br><span class="line"> remote address: LID 0000 QPN 0x02cb PSN 0x1c38a6 OUT 0x10 RKey 0x050f14 VAddr 0x00000000e59000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line"></span><br><span class="line"><span class="comment"># Server B</span></span><br><span class="line">$ ib_read_lat 172.18.0.237 -x 3 --report_g</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line">                    RDMA_Read Latency Test</span><br><span class="line"> Dual-port       : OFFDevice         : mlx5_3</span><br><span class="line"> Number of qps   : 1Transport <span class="built_in">type</span> : IB</span><br><span class="line"> Connection <span class="built_in">type</span> : RCUsing SRQ      : OFF</span><br><span class="line"> PCIe relax order: ON</span><br><span class="line"> TX depth        : 1</span><br><span class="line"> Mtu             : 1024[B]</span><br><span class="line"> Link <span class="built_in">type</span>       : Ethernet</span><br><span class="line"> GID index       : 3</span><br><span class="line"> Outstand reads  : 16</span><br><span class="line"> rdma_cm QPs : OFF</span><br><span class="line"> Data ex. method : Ethernet</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line"> <span class="built_in">local</span> address: LID 0000 QPN 0x02cb PSN 0x1c38a6 OUT 0x10 RKey 0x050f14 VAddr 0x00000000e59000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:05</span><br><span class="line"> remote address: LID 0000 QPN 0x02db PSN 0x144f6a OUT 0x10 RKey 0x060d14 VAddr 0x00000001849000</span><br><span class="line"> GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:00:12</span><br><span class="line">---------------------------------------------------------------------------------------</span><br><span class="line"> <span class="comment">#bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]</span></span><br><span class="line"> 2       1000          10.87          14.62        11.40           11.48       0.34   12.74   14.62</span><br><span class="line">---------------------------------------------------------------------------------------</span><br></pre></td></tr></table></figure><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf" target="_blank" rel="external nofollow noopener noreferrer">Introduction to InfiniBand White Paper</a></li><li><a href="https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/Hardware/DavidDeming_Infiniband_Architectural_Overview.pdf" target="_blank" rel="external nofollow noopener noreferrer">InfiniBand Architecture Overview</a></li><li><a href="https://www.afs.enea.it/asantoro/V1r1_2_1.Release_12062007.pdf" target="_blank" rel="external nofollow noopener noreferrer">InfiniBand Architecture Specification Release 1.2.1</a></li><li><a href="https://zcopy.wordpress.com/2010/10/08/quick-concepts-part-1-%E2%80%93-introduction-to-rdma/" target="_blank" rel="external nofollow noopener noreferrer">Introduction to RDMA</a></li><li><a href="https://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf" target="_blank" rel="external nofollow noopener noreferrer">RDMA Aware Network Programming User Manual</a></li><li><a href="https://github.com/jcxue/RDMA-Tutorial" target="_blank" rel="external nofollow noopener noreferrer">RDMA-Tutorial</a></li><li><a href="http://www.rdmamojo.com/" target="_blank" rel="external nofollow noopener noreferrer">RDMA Mojo Blog</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;RDMA，即 &lt;code&gt;Remote Direct Memory Access&lt;/code&gt;，是一种绕过&lt;strong&gt;远程&lt;/strong&gt;主机 &lt;code&gt;OS kernel&lt;/code&gt; 访问其内存中数据的技术，概念源自于 &lt;code&gt;DMA&lt;/code&gt; 技术。在 DMA 技术中，外部设备（PCIe 设备）能够绕过 CPU 直接访问 &lt;code&gt;host memory&lt;/code&gt;；而 RDMA 则是指外部设备能够绕过 CPU，不仅可以访问本地主机的内存，还能够访问另一台主机上的用户态内存。由于不经过操作系统，不仅节省了大量 CPU 资源，同样也&lt;strong&gt;提高了系统吞吐量&lt;/strong&gt;、&lt;strong&gt;降低了系统的网络通信延迟&lt;/strong&gt;，在高性能计算和深度学习训练中得到了广泛的应用。本文将介绍 RDMA 的架构与原理，并讲解 RDMA 网络使用方法，测试代码在 &lt;a href=&quot;https://github.com/SimpCosm/cake/tree/master/rdma&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;Github&lt;/a&gt; 上可以找到。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-13_rdma-queue.svg" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="RDMA" scheme="https://houmin.cc/tags/RDMA/"/>
    
      <category term="网络" scheme="https://houmin.cc/tags/%E7%BD%91%E7%BB%9C/"/>
    
  </entry>
  
  <entry>
    <title>【异构计算】NVIDIA XID Message</title>
    <link href="https://houmin.cc/posts/feaa4605/"/>
    <id>https://houmin.cc/posts/feaa4605/</id>
    <published>2021-01-21T06:19:26.000Z</published>
    <updated>2022-11-09T15:13:45.391Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p><code>Xid Message</code> 由 NVIDIA 驱动报告的错误信息，一般卸载操作系统的内核日志或者是事件日志中。Xid消息表明发生了一般的GPU错误，通常是由于驱动程序对GPU的编程不正确或发送给GPU的命令损坏所致。这些消息可能表示硬件问题、NVIDIA软件问题或用户应用程序问题。</p><a id="more"></a><p>Xid Message 的产生可能有以下三种：</p><ul><li>Hardware Problem</li><li>NVIDIA Software Problem</li><li>User Application Problem</li></ul><p>Xid Message 可以用作错误诊断，辅助调试报告的错误。在所有不同版本的NVIDIA驱动中，Xid Message 的含义保持一致。</p><h2 id="查看-Xid-Errors"><a href="#查看-Xid-Errors" class="headerlink" title="查看 Xid Errors"></a>查看 Xid Errors</h2><p>在 Linux 中，Xid Error 的信息在 <code>/var/log/messages</code> 中，可以看到错误信息。下图展示的是 XID 14 的错误信息：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ grep <span class="string">"NVRM: Xid"</span> /var/<span class="built_in">log</span>/messages</span><br><span class="line">[…] NVRM: GPU at 0000:03:00: GPU-b850f46d-d5ea-c752-ddf3-c4453e44d3f7 </span><br><span class="line">[…] NVRM: Xid (0000:03:00): 14, Channel 00000001</span><br></pre></td></tr></table></figure><p>在 NVIDIA 提供的 NVML 库中可以监听 GPU 的 Xid Error，下面是 Go 监听的示例代码：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line">eventSet := nvml.NewEventSet()</span><br><span class="line"><span class="keyword">defer</span> nvml.DeleteEventSet(eventSet)</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> _, gpu := <span class="keyword">range</span> devices &#123;</span><br><span class="line">err = nvml.RegisterEventForDevice(eventSet, nvml.XidCriticalError, gpu)</span><br><span class="line"><span class="comment">// ...</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> &#123;</span><br><span class="line"><span class="keyword">select</span> &#123;</span><br><span class="line"><span class="keyword">case</span> &lt;-stop:</span><br><span class="line"><span class="keyword">return</span></span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">e, err := nvml.WaitForEvent(eventSet, <span class="number">5000</span>)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &amp;&amp; e.Etype != nvml.XidCriticalError &#123;</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// <span class="doctag">FIXME:</span> formalize the full list and document it.</span></span><br><span class="line"><span class="comment">// http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4</span></span><br><span class="line"><span class="comment">// Application errors: the GPU should still be healthy</span></span><br><span class="line"><span class="keyword">if</span> e.Edata == <span class="number">31</span> || e.Edata == <span class="number">43</span> || e.Edata == <span class="number">45</span> &#123;</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> e.UUID == <span class="literal">nil</span> || <span class="built_in">len</span>(*e.UUID) == <span class="number">0</span> &#123;</span><br><span class="line"><span class="comment">// All devices are unhealthy</span></span><br><span class="line">log.Printf(<span class="string">"XidCriticalError: Xid=%d, All devices will go unhealthy."</span>, e.Edata)</span><br><span class="line"><span class="keyword">for</span> _, d := <span class="keyword">range</span> devices &#123;</span><br><span class="line">unhealthy &lt;- d</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line">&#125;</span><br><span class="line">   <span class="comment">//...</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h2 id="Common-Xid-Errors"><a href="#Common-Xid-Errors" class="headerlink" title="Common Xid Errors"></a>Common Xid Errors</h2><h3 id="XID-13：GR-SW-Notify-Error"><a href="#XID-13：GR-SW-Notify-Error" class="headerlink" title="XID 13：GR: SW Notify Error"></a>XID 13：GR: SW Notify Error</h3><p>XID 13 号错误是通用的用户进程的错误，一般是用户访问数组越界、或者非法指令、非法寄存器的问题。这种问题在很少的情况下才会是硬件问题或者内核驱动的问题，基本上是用户进程的问题。</p><p>当这种问题发生时，NVIDIA 推荐如下步骤：</p><ol><li>Run the application in cuda-gdb or cuda-memcheck , or</li><li>Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or</li><li>File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.</li></ol><h3 id="XID-31-Fifo-MMU-Error"><a href="#XID-31-Fifo-MMU-Error" class="headerlink" title="XID 31: Fifo: MMU Error"></a>XID 31: Fifo: MMU Error</h3><p>XID 31 号错误是由 MMU 报告的错误，比如当一个用户进程对一个非法地址访问的时候。一般来说，这是用户程序级别的bug，也有可能是驱动或者硬件bug。</p><p>当这种问题发生时，NVIDIA 推荐如下步骤：</p><ol><li>Run the application in cuda-gdb or cuda-memcheck , or</li><li>Run the application with CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb, or</li><li>File a bug if the previous two come back inconclusive to eliminate potential NVIDIA driver or hardware bug.</li></ol><h3 id="XID-32-PBDMA-Error"><a href="#XID-32-PBDMA-Error" class="headerlink" title="XID 32: PBDMA Error"></a>XID 32: PBDMA Error</h3><p>XID 32 号错误是由 DMA Controller 上报的，DMA Controller 负责在 NVIDIA 驱动和 GPU之前通过 PCIe总线进行通信。</p><p>一般来说，这种问题是由 PCI 的质量问题导致，一般也不是由用户程序造成的。</p><h3 id="XID-43-Reset-Channel-VERIF-Error"><a href="#XID-43-Reset-Channel-VERIF-Error" class="headerlink" title="XID 43: Reset Channel VERIF Error"></a>XID 43: Reset Channel VERIF Error</h3><p>XID 43 号错误发生在当探测到用户程序可能因此故障，这时候必须终止用户程序。这种情况下，GPU还是处于健康的状态。</p><p>在大多数情况，这种问题是用户进程导致的，而不是驱动的bug</p><h3 id="XID-45-OS-Preemptive-Channel-Removal"><a href="#XID-45-OS-Preemptive-Channel-Removal" class="headerlink" title="XID 45: OS: Preemptive Channel Removal"></a>XID 45: OS: Preemptive Channel Removal</h3><p>XID 45 号错误发生在 用户进程 Abort 了，这时候内核驱动需要终止在GPU上运行的GPU Application。<code>Ctrl-C</code>、CPU Reset、Sigkill 都是这种场景。</p><p>大多数情况下，这种问题是用户进程导致的，而不是驱动的bug</p><h3 id="XID-48-DBE-Double-Bit-Error-ECC-Error"><a href="#XID-48-DBE-Double-Bit-Error-ECC-Error" class="headerlink" title="XID 48: DBE(Double Bit Error) ECC Error"></a>XID 48: DBE(Double Bit Error) ECC Error</h3><p>XID 48 号错误发生在当 GPU 探测到GPU上有一个不可纠正的错误，这个错误也会报告给用户进程。这种情况下，可要 GPU Reset 或者 Node 重启来修复这个问题。<code>nvidia-smi</code> 工具会提供一个ECC错误的总结。</p><h2 id="Xid-Error-Listing"><a href="#Xid-Error-Listing" class="headerlink" title="Xid Error Listing"></a>Xid Error Listing</h2><p>下表展示了所有的Xid Error信息：</p><div class="table-container"><table><thead><tr><th>XID</th><th>Failure</th><th>Causes</th><th></th><th></th><th></th><th></th><th></th><th></th></tr></thead><tbody><tr><td></td><td></td><td>HW Error</td><td>Driver Error</td><td>User App Error</td><td>System Memory Corruption</td><td>Bus Error</td><td>Thermal Issue</td><td>FB Corruption</td></tr><tr><td>1</td><td>Invalid or corrupted push buffer stream</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>2</td><td>Invalid or corrupted push buffer stream</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>3</td><td>Invalid or corrupted push buffer stream</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>4</td><td>Invalid or corrupted push buffer stream</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td></td><td>GPU semaphore timeout</td><td></td><td>X</td><td>X</td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>5</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>6</td><td>Invalid or corrupted push buffer stream</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>7</td><td>Invalid or corrupted push buffer address</td><td></td><td>X</td><td></td><td></td><td>X</td><td></td><td>X</td></tr><tr><td>8</td><td>GPU stopped processing</td><td></td><td>X</td><td>X</td><td></td><td>X</td><td>X</td><td></td></tr><tr><td>9</td><td>Driver error programming GPU</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>10</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>11</td><td>Invalid or corrupted push buffer stream</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>12</td><td>Driver error handling GPU exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>13</td><td>Graphics Engine Exception</td><td></td><td>X</td><td>X</td><td>X</td><td>X</td><td>X</td><td>X</td></tr><tr><td>14</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>15</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>16</td><td>Display engine hung</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>17</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>18</td><td>Bus mastering disabled in PCI Config Space</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>19</td><td>Display Engine error</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>20</td><td>Invalid or corrupted Mpeg push buffer</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>21</td><td>Invalid or corrupted Motion Estimation push buffer</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>22</td><td>Invalid or corrupted Video Processor push buffer</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>23</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>24</td><td>GPU semaphore timeout</td><td></td><td>X</td><td>X</td><td>X</td><td>X</td><td>X</td><td>X</td></tr><tr><td>25</td><td>Invalid or illegal push buffer stream</td><td></td><td>X</td><td>X</td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>26</td><td>Framebuffer timeout</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>27</td><td>Video processor exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>28</td><td>Video processor exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>29</td><td>Video processor exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>30</td><td>GPU semaphore access error</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>31</td><td>GPU memory page fault</td><td></td><td>X</td><td>X</td><td></td><td></td><td></td><td></td></tr><tr><td>32</td><td>Invalid or corrupted push buffer stream</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td>X</td><td>X</td></tr><tr><td>33</td><td>Internal micro-controller error</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>34</td><td>Video processor exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>35</td><td>Video processor exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>36</td><td>Video processor exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>37</td><td>Driver firmware error</td><td></td><td>X</td><td></td><td>X</td><td>X</td><td></td><td></td></tr><tr><td>38</td><td>Driver firmware error</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>39</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>40</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>41</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>42</td><td>Video processor exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>43</td><td>GPU stopped processing</td><td></td><td>X</td><td>X</td><td></td><td></td><td></td><td></td></tr><tr><td>44</td><td>Graphics Engine fault during context switch</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>45</td><td>Preemptive cleanup, due to previous errors — Most likely to see when running multiple cuda applications and hitting a DBE</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>46</td><td>GPU stopped processing</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>47</td><td>Video processor exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>48</td><td>Double Bit ECC Error</td><td>X</td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>49</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>50</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>51</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>52</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>53</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>54</td><td>Auxiliary power is not connected to the GPU board</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>55</td><td>Unused</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>56</td><td>Display Engine error</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>57</td><td>Error programming video memory interface</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td>X</td></tr><tr><td>58</td><td>Unstable video memory interface detected</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td></td><td>EDC error – clarified in printout</td><td>X</td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>59</td><td>Internal micro-controller error(older drivers)</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>60</td><td>Video processor exception</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>61</td><td>Internal micro-controller breakpoint/warning(newer drivers)</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>62</td><td>Internal micro-controller halt(newer drivers)</td><td>X</td><td>X</td><td></td><td></td><td></td><td>X</td><td></td></tr><tr><td>63</td><td>ECC page retirement recording event</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td>X</td></tr><tr><td>64</td><td>ECC page retirement recording failure</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>65</td><td>Video processor exception</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>66</td><td>Illegal access by driver</td><td></td><td>X</td><td>X</td><td></td><td></td><td></td><td></td></tr><tr><td>67</td><td>Illegal access by driver</td><td></td><td>X</td><td>X</td><td></td><td></td><td></td><td></td></tr><tr><td>68</td><td>Video processor exception</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>69</td><td>Graphics Engine class error</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>70</td><td>CE3: Unknown Error</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>71</td><td>CE4: Unknown Error</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>72</td><td>CE5: Unknown Error</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>73</td><td>NVENC2 Error</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>74</td><td>NVLINK Error</td><td>X</td><td>X</td><td></td><td></td><td>X</td><td></td><td></td></tr><tr><td>75</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>76</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>77</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>78</td><td>vGPU Start Error</td><td></td><td>X</td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>79</td><td>GPU has fallen off the bus</td><td>X</td><td>X</td><td></td><td>X</td><td>X</td><td>X</td><td></td></tr><tr><td>80</td><td>Corrupted data sent to GPU</td><td>X</td><td>X</td><td></td><td>X</td><td>X</td><td></td><td>X</td></tr><tr><td>81</td><td>VGA Subsystem Error</td><td>X</td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>82</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>83</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>84</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>85</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>86</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>87</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>88</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>89</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>90</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>91</td><td>Reserved</td><td></td><td></td><td></td><td></td><td></td><td></td><td></td></tr><tr><td>92</td><td>High single-bit ECC error rate</td><td>X</td><td>X</td><td></td><td></td><td></td><td></td></tr></tbody></table></div><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://docs.nvidia.com/deploy/xid-errors/index.html" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA XID Errors</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;&lt;code&gt;Xid Message&lt;/code&gt; 由 NVIDIA 驱动报告的错误信息，一般卸载操作系统的内核日志或者是事件日志中。Xid消息表明发生了一般的GPU错误，通常是由于驱动程序对GPU的编程不正确或发送给GPU的命令损坏所致。这些消息可能表示硬件问题、NVIDIA软件问题或用户应用程序问题。&lt;/p&gt;
    
    </summary>
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="GPU" scheme="https://houmin.cc/tags/GPU/"/>
    
      <category term="NVIDIA" scheme="https://houmin.cc/tags/NVIDIA/"/>
    
      <category term="XID" scheme="https://houmin.cc/tags/XID/"/>
    
  </entry>
  
  <entry>
    <title>【系统监控】GPU 监控</title>
    <link href="https://houmin.cc/posts/b4058e1b/"/>
    <id>https://houmin.cc/posts/b4058e1b/</id>
    <published>2021-01-06T03:01:35.000Z</published>
    <updated>2022-11-09T15:13:45.393Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><h2 id="问题背景"><a href="#问题背景" class="headerlink" title="问题背景"></a>问题背景</h2><p>在使用GPU进行深度学习相关的训练与推理时，需要查看当前集群中GPU的使用情况：</p><ul><li>需要通过当前GPU设备资源使用情况判断是否可以再部署新的应用，判断集群是否需要扩容，为GPU服务提供对齐CPU的容量保障服务，补齐容量保障中的GPU短板</li><li>需要通过当前GPU设备资源使用情况分析使用中存在的瓶颈和短板，推进优化，提高资源利用率和服务性能</li></ul><a id="more"></a><p>为了获得GPU的监控数据，NVIDIA 提供了以下三种方法：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-06_nvidia-managing-tools.png"></p><ul><li><a href="https://developer.nvidia.com/nvidia-management-library-nvml" target="_blank" rel="external nofollow noopener noreferrer">NVML</a>：NVIDIA Management Library，基于C进行监控和管理GPU的库，<code>nvidia-smi</code> 命令即是基于此实现的</li><li><a href="https://developer.nvidia.com/dcgm" target="_blank" rel="external nofollow noopener noreferrer">DCGM</a>：Data Center GPU Manager，基于NVML和CUDA实现的一整套GPU的监控和管理工具</li><li>第三方工具：基于 DCGM 或者 NVML 开发的第三方监控工具，可以与Prometheus等工具结合，提供数据库、UI等工具</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-06_nvidia-dcgm.png"></p><p>对比这三种工具的特点：</p><ul><li>NVML<ul><li>无状态的查询，只支持查询当前数据</li><li>属于低级别控制GPU的API</li><li>基于NVML库开发的管理工具运行成本低，开发成本高</li><li>基于NVML库开发的管理工具必须与GPU运行在同一个节点</li></ul></li><li>DCGM<ul><li>可以查询几个小时的数据指标</li><li>提供了GPU的健康检查和诊断</li><li>可以对一组GPU进行批量查询</li><li>允许以 <code>remote/local</code> 两种方式运行</li></ul></li><li>第三方工具<ul><li>提供了database、graphs和好看的UI</li></ul></li></ul><p>本文后续将主要介绍 DCGM。</p><h2 id="DCGM"><a href="#DCGM" class="headerlink" title="DCGM"></a>DCGM</h2><p>下图展示了 <code>DCGM</code> 在集群中运行的方式，<code>DCGM</code> 以 Agent 的形式部署在计算节点上，管理节点上的工具可以通过 <code>DCGM</code> 提供的API管理和监控GPU。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-06_nvidia-dcgm-icon.png"></p><p><code>DCGM</code> 提供了一下四种关键特性：</p><ul><li><strong>Active Health Monitoring</strong></li><li><strong>GPU Diagnostics</strong></li><li><strong>Policy and Alerting</strong></li><li><strong>Configuration Managerment</strong></li></ul><h3 id="安装部署"><a href="#安装部署" class="headerlink" title="安装部署"></a>安装部署</h3><p>DCGM 需要单独下载安装，在NVIDIA官网<a href="https://developer.nvidia.com/dcgm#Downloads" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA</a>下载对应的安装包，这里选择下载rpm包即可，下载完成后：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 卸载可能已安装的旧版本DCGM</span></span><br><span class="line">$ yum remove datacenter-gpu-manager</span><br><span class="line"><span class="comment"># 安装</span></span><br><span class="line">$ rpm -ivh datacenter-gpu-manager-2.0.13-1-x86_64.rpm</span><br></pre></td></tr></table></figure><ul><li>DCGM的动态链接库会被安装到<code>/usr/lib64</code>目录</li><li>Python库会被安装到<code>/usr/local/dcgm/bindings</code>目录</li></ul><p>DCGM 是一个面向集群管理的工具，所以在实际使用前，需要先在目标机器启动一个agent，<code>nv-hostengine</code>，具体启动命令如下</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 启动 nv-hostengine</span></span><br><span class="line">$ nv-hostengine --port 39999 --<span class="built_in">bind</span>-interface 127.0.0.1</span><br><span class="line">Host Engine Listener Started</span><br><span class="line">Started host engine version 2.0.13 using port number: 39999</span><br><span class="line"></span><br><span class="line"><span class="comment"># 查看设备列表</span></span><br><span class="line">$ dcgmi discovery --host 127.0.0.1:39999 -l</span><br><span class="line">4 GPUs found.</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| GPU ID | Device Information                                                   |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| 0      | Name: Tesla T4                                                       |</span><br><span class="line">|        | PCI Bus ID: 00000000:00:08.0                                         |</span><br><span class="line">|        | Device UUID: GPU-0bf43c76-0f1a-f49f-a362-92d5b9bbbc9f                |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| 1      | Name: Tesla T4                                                       |</span><br><span class="line">|        | PCI Bus ID: 00000000:00:09.0                                         |</span><br><span class="line">|        | Device UUID: GPU-c55a4e5e-47dd-48c2-99d0-2630042bf619                |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| 2      | Name: Tesla T4                                                       |</span><br><span class="line">|        | PCI Bus ID: 00000000:00:0A.0                                         |</span><br><span class="line">|        | Device UUID: GPU-95e5fe58-f03d-815e-c871-65637b623aca                |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| 3      | Name: Tesla T4                                                       |</span><br><span class="line">|        | PCI Bus ID: 00000000:00:0B.0                                         |</span><br><span class="line">|        | Device UUID: GPU-70747d1b-2b7a-9895-29f5-485608c1742e                |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">0 NvSwitches found.</span><br><span class="line">+-----------+</span><br><span class="line">| Switch ID |</span><br><span class="line">+-----------+</span><br><span class="line">+-----------+</span><br><span class="line"></span><br><span class="line"><span class="comment"># 关闭 nv-hostengine，这里作演示用，后续的过程还要继续打开</span></span><br><span class="line">$ nv-hostengine –t</span><br><span class="line">Host engine successfully terminated.</span><br></pre></td></tr></table></figure><p>其中，<code>--port</code> <code>--bind-interface</code> 两个参数分别用来设置监听的端口和绑定的IP地址。同时也支持使用 <code>UNIX_SOCKET</code> 通信</p><p>在启动 <code>nv-hostengine</code> 之后，我们就可以使用 <code>dcgmi</code> 来操作</p><h3 id="组操作"><a href="#组操作" class="headerlink" title="组操作"></a>组操作</h3><p>和NVML不同，DCGM 的大部分功能都是面向组的，所以在使用DCGM之前，首先需要创建组，然后才能使用DCGM提供的各种功能。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 获取设备列表后，可以用如下命令创建组</span></span><br><span class="line"><span class="comment"># 创建成功后，该命令会输出如下，返回设备的组ID，后续的操作中都会用到组ID，例如下面的组ID 2</span></span><br><span class="line">$ dcgmi group --host 127.0.0.1:39999 -c GPU_GROUP</span><br><span class="line">Successfully created group <span class="string">"GPU_GROUP"</span> with a group ID of 2</span><br><span class="line"></span><br><span class="line">$ dcgmi group --host 127.0.0.1:39999 -l</span><br><span class="line">+-------------------+----------------------------------------------------------+</span><br><span class="line">| GROUPS                                                                       |</span><br><span class="line">| 1 group found.                                                               |</span><br><span class="line">+===================+==========================================================+</span><br><span class="line">| Groups            |                                                          |</span><br><span class="line">| -&gt; 2              |                                                          |</span><br><span class="line">|    -&gt; Group ID    | 2                                                        |</span><br><span class="line">|    -&gt; Group Name  | GPU_GROUP                                                |</span><br><span class="line">|    -&gt; Entities    | None                                                     |</span><br><span class="line">+-------------------+----------------------------------------------------------+</span><br><span class="line"></span><br><span class="line">$ dcgmi discovery --host 127.0.0.1:39999 -l </span><br><span class="line">4 GPUs found.</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| GPU ID | Device Information                                                   |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| 0      | Name: Tesla T4                                                       |</span><br><span class="line">|        | PCI Bus ID: 00000000:00:08.0                                         |</span><br><span class="line">|        | Device UUID: GPU-0bf43c76-0f1a-f49f-a362-92d5b9bbbc9f                |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| 1      | Name: Tesla T4                                                       |</span><br><span class="line">|        | PCI Bus ID: 00000000:00:09.0                                         |</span><br><span class="line">|        | Device UUID: GPU-c55a4e5e-47dd-48c2-99d0-2630042bf619                |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| 2      | Name: Tesla T4                                                       |</span><br><span class="line">|        | PCI Bus ID: 00000000:00:0A.0                                         |</span><br><span class="line">|        | Device UUID: GPU-95e5fe58-f03d-815e-c871-65637b623aca                |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">| 3      | Name: Tesla T4                                                       |</span><br><span class="line">|        | PCI Bus ID: 00000000:00:0B.0                                         |</span><br><span class="line">|        | Device UUID: GPU-70747d1b-2b7a-9895-29f5-485608c1742e                |</span><br><span class="line">+--------+----------------------------------------------------------------------+</span><br><span class="line">0 NvSwitches found.</span><br><span class="line">+-----------+</span><br><span class="line">| Switch ID |</span><br><span class="line">+-----------+</span><br><span class="line">+-----------+</span><br><span class="line"></span><br><span class="line"><span class="comment"># 创建组后可以用如下命令给组中添加设备</span></span><br><span class="line">$ dcgmi group --host 127.0.0.1:39999 -g 2 -a 0,1</span><br><span class="line">Add to group operation successful.</span><br><span class="line"></span><br><span class="line">$ dcgmi group --host 127.0.0.1:39999 -g 2 -i</span><br><span class="line">+-------------------+----------------------------------------------------------+</span><br><span class="line">| GROUP INFO                                                                   |</span><br><span class="line">+===================+==========================================================+</span><br><span class="line">| 2                 |                                                          |</span><br><span class="line">| -&gt; Group ID       | 2                                                        |</span><br><span class="line">| -&gt; Group Name     | GPU_GROUP                                                |</span><br><span class="line">| -&gt; Entities       | GPU 0, GPU 1                                             |</span><br><span class="line">+-------------------+----------------------------------------------------------+</span><br><span class="line"></span><br><span class="line"><span class="comment"># 使用如下命令可以从组中删除设备</span></span><br><span class="line">$ dcgmi group --host 127.0.0.1:39999 -g 2 -r 0,1</span><br><span class="line">Remove from group operation successful.</span><br><span class="line"></span><br><span class="line"><span class="comment"># 使用如下命令可以从删除组</span></span><br><span class="line">$ dcgmi group --host 127.0.0.1:39999 -d 2</span><br></pre></td></tr></table></figure><blockquote><p>注意：group和设备之间是多对多关系</p></blockquote><h3 id="Job-Statistics"><a href="#Job-Statistics" class="headerlink" title="Job Statistics"></a>Job Statistics</h3><p>当有一个Job需要通过GPU加速计算的时候，我们想知道：</p><ul><li>我的Job运行在哪个GPU上</li><li>我的Job使用了多少GPU</li><li>在我的Job运行过程中是否有任何的错误和Warning</li><li>系统的GPU是否都健康并且准备好了下一个Job的计算</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 当前 Group 3 如下</span></span><br><span class="line">$ dcgmi group --host 127.0.0.1:39999 -g 3 -i</span><br><span class="line">+-------------------+----------------------------------------------------------+</span><br><span class="line">| GROUP INFO                                                                   |</span><br><span class="line">+===================+==========================================================+</span><br><span class="line">| 3                 |                                                          |</span><br><span class="line">| -&gt; Group ID       | 3                                                        |</span><br><span class="line">| -&gt; Group Name     | GPU_GROUP                                                |</span><br><span class="line">| -&gt; Entities       | GPU 0, GPU 1, GPU 2, GPU 3                               |</span><br><span class="line">+-------------------+----------------------------------------------------------+</span><br><span class="line"></span><br><span class="line"><span class="comment"># 在使用dcgmi获取GPU统计数据，需要先打开数据分析功能，具体命令如下</span></span><br><span class="line">$ dcgmi stats --host 127.0.0.1:39999 -g 3 --<span class="built_in">enable</span></span><br><span class="line">Successfully started process watches.</span><br><span class="line"></span><br><span class="line"><span class="comment"># 打开数据分析功能后，可以使用如下命令查看具体的进程的统计信息</span></span><br><span class="line"><span class="comment"># 假设这里启动了一个CUDA应用进程正在使用GPU进行计算</span></span><br><span class="line">$ dcgmi stats --host 127.0.0.1:39999 -g 3 -p 41861 -v</span><br><span class="line">Successfully retrieved process info <span class="keyword">for</span> PID: 41861. Process ran on 1 GPUs.</span><br><span class="line">+------------------------------------------------------------------------------+</span><br><span class="line">| GPU ID: 3                                                                    |</span><br><span class="line">+====================================+=========================================+</span><br><span class="line">|-----  Execution Stats  ------------+-----------------------------------------|</span><br><span class="line">| Start Time                     *   | Wed Jan  6 16:54:16 2021                |</span><br><span class="line">| End Time                       *   | Still Running                           |</span><br><span class="line">| Total Execution Time (sec)     *   | Still Running                           |</span><br><span class="line">| No. of Conflicting Processes   *   | 0                                       |</span><br><span class="line">+-----  Performance Stats  ----------+-----------------------------------------+</span><br><span class="line">| Energy Consumed (Joules)           | 2985                                    |</span><br><span class="line">| Max GPU Memory Used (bytes)    *   | 12107907072                             |</span><br><span class="line">| SM Clock (MHz)                     | Avg: 1590, Max: 1590, Min: 1590         |</span><br><span class="line">| Memory Clock (MHz)                 | Avg: 5000, Max: 5000, Min: 5000         |</span><br><span class="line">| SM Utilization (%)                 | Avg: 100, Max: 100, Min: 100            |</span><br><span class="line">| Memory Utilization (%)             | Avg: 5, Max: 5, Min: 5                  |</span><br><span class="line">| PCIe Rx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |</span><br><span class="line">| PCIe Tx Bandwidth (megabytes)      | Avg: N/A, Max: N/A, Min: N/A            |</span><br><span class="line">+-----  Event Stats  ----------------+-----------------------------------------+</span><br><span class="line">| Double Bit ECC Errors              | 0                                       |</span><br><span class="line">| PCIe Replay Warnings               | 0                                       |</span><br><span class="line">| Critical XID Errors                | 0                                       |</span><br><span class="line">+-----  Slowdown Stats  -------------+-----------------------------------------+</span><br><span class="line">| Due to - Power (%)                 | 0                                       |</span><br><span class="line">|        - Thermal (%)               | 0                                       |</span><br><span class="line">|        - Reliability (%)           | 0                                       |</span><br><span class="line">|        - Board Limit (%)           | 0                                       |</span><br><span class="line">|        - Low Utilization (%)       | 0                                       |</span><br><span class="line">|        - Sync Boost (%)            | 0                                       |</span><br><span class="line">+-----  Process Utilization  --------+-----------------------------------------+</span><br><span class="line">| PID                                | 41861                                   |</span><br><span class="line">|     Avg SM Utilization (%)         | 99                                      |</span><br><span class="line">|     Avg Memory Utilization (%)     | 3                                       |</span><br><span class="line">+-----  Overall Health  -------------+-----------------------------------------+</span><br><span class="line">| Overall Health                     | Healthy                                 |</span><br><span class="line">+------------------------------------+-----------------------------------------+</span><br><span class="line"></span><br><span class="line">(*) Represents a process statistic. Otherwise device statistic during</span><br><span class="line">    process lifetime listed.</span><br></pre></td></tr></table></figure><h3 id="Configuration-Managerment"><a href="#Configuration-Managerment" class="headerlink" title="Configuration Managerment"></a>Configuration Managerment</h3><p>DCGM 可以更改GPU设置, 具体支持的设置项如下，查看原有设置：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">$ dcgmi config  --host 127.0.0.1:39999 -g 3 --get</span><br><span class="line">+------------------------------+------------------------------+------------------------------+</span><br><span class="line">| GPU_GROUP                                                                                  |</span><br><span class="line">| Group of 4 GPUs                                                                            |</span><br><span class="line">+==============================+==============================+==============================+</span><br><span class="line">| Field                        | Target                       | Current                      |</span><br><span class="line">+------------------------------+------------------------------+------------------------------+</span><br><span class="line">| Compute Mode                 | Not Specified                | Unrestricted                 |</span><br><span class="line">| ECC Mode                     | Not Specified                | Enabled                      |</span><br><span class="line">| Sync Boost                   | Not Specified                | Not Supported                |</span><br><span class="line">| Memory Application Clock     | Not Specified                | 5001                         |</span><br><span class="line">| SM Application Clock         | Not Specified                | 585                          |</span><br><span class="line">| Power Limit                  | Not Specified                | 70                           |</span><br><span class="line">+------------------------------+------------------------------+------------------------------+</span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 具体参数说明</span></span><br><span class="line">$ dcgmi config -h</span><br><span class="line"></span><br><span class="line"> config -- Used to configure settings <span class="keyword">for</span> groups of GPUs.</span><br><span class="line"></span><br><span class="line">Usage: dcgmi config</span><br><span class="line">   dcgmi config [--host &lt;IP/FQDN&gt;] [-g &lt;groupId&gt;] --enforce</span><br><span class="line">   dcgmi config [--host &lt;IP/FQDN&gt;] [-g &lt;groupId&gt;] --get [-v] [-j]</span><br><span class="line">   dcgmi config [--host &lt;IP/FQDN&gt;] [-g &lt;groupId&gt;] --<span class="built_in">set</span> [-e &lt;0/1&gt;] [-s</span><br><span class="line">        &lt;0/1&gt;] [-a &lt;mem,proc&gt;] [-P &lt;<span class="built_in">limit</span>&gt;] [-c &lt;mode&gt;]</span><br><span class="line">  </span><br><span class="line">  ...</span><br><span class="line">  -c  --compmode   mode       Configure Compute Mode. Can be any of the</span><br><span class="line">                               following:</span><br><span class="line">                               0 - Unrestricted</span><br><span class="line">                               1 - Prohibited</span><br><span class="line">                               2 - Exclusive Process</span><br><span class="line">  -P  --powerlimit <span class="built_in">limit</span>      Configure Power Limit (Watts).</span><br><span class="line">  -a  --appclocks  mem,proc   Configure Application Clocks. Must use memory,proc</span><br><span class="line">                               clocks (csv) format(MHz).</span><br><span class="line">  -s  --syncboost  0/1        Configure Syncboost. (1 to Enable, 0 to Disable)</span><br><span class="line">  -e  --eccmode    0/1        Configure Ecc mode. (1 to Enable, 0 to Disable)</span><br><span class="line">  </span><br><span class="line"><span class="comment"># 更改设置</span></span><br><span class="line">$ dcgmi config  --host 127.0.0.1:39999 -g 3 --<span class="built_in">set</span> -c 2</span><br><span class="line"></span><br><span class="line"><span class="comment"># 查询结果</span></span><br><span class="line">$ dcgmi config  --host 127.0.0.1:39999 -g 3 --get</span><br><span class="line">+------------------------------+------------------------------+------------------------------+</span><br><span class="line">| GPU_GROUP                                                                                  |</span><br><span class="line">| Group of 4 GPUs                                                                            |</span><br><span class="line">+==============================+==============================+==============================+</span><br><span class="line">| Field                        | Target                       | Current                      |</span><br><span class="line">+------------------------------+------------------------------+------------------------------+</span><br><span class="line">| Compute Mode                 | E. Process                   | E. Process                   |</span><br><span class="line">| ECC Mode                     | Not Specified                | Enabled                      |</span><br><span class="line">| Sync Boost                   | Not Specified                | Not Supported                |</span><br><span class="line">| Memory Application Clock     | Not Specified                | 5001                         |</span><br><span class="line">| SM Application Clock         | Not Specified                | 585                          |</span><br><span class="line">| Power Limit                  | Not Specified                | 70                           |</span><br><span class="line">+------------------------------+------------------------------+------------------------------+</span><br></pre></td></tr></table></figure><blockquote><p>注意，使用DCGM更改设置时，运作模式是一种面向声明的模式，用户通过dcgmi指定需要的目标设置，同时nv-hostengine自动调整设置，使当前设置对齐目标设置</p></blockquote><h3 id="Policy-and-Alerting"><a href="#Policy-and-Alerting" class="headerlink" title="Policy and Alerting"></a>Policy and Alerting</h3><p>dcgm 的提供了policy 功能，policy 本质上是类似于一种Watch机制，首先设定一个<code>违反</code>条件，然后可以根据<code>违反</code>条件设置对应的处理策略。一般而言，可以设置一个条件，然后注册listener，等待dcgm通知。</p><p><img alt data-src="https://developer-blogs.nvidia.com/wp-content/uploads/2016/08/image01-300x125.png"></p><p>例如</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># 通过如下命令设置最大温度50度的条件</span></span><br><span class="line">$ dcgmi policy --host 127.0.0.1:39999 -g 3 --<span class="built_in">set</span> 0,0 -T 50</span><br><span class="line"></span><br><span class="line"><span class="comment"># 设置后的policy，通过如下命令查询</span></span><br><span class="line">$ dcgmi policy --host 127.0.0.1:39999 -g 2 --get</span><br><span class="line">Policy information</span><br><span class="line">+-----------------------------+------------------------------------------------+</span><br><span class="line">| Policy Information                                                           |</span><br><span class="line">| GPU_GROUP                                                                    |</span><br><span class="line">+=============================+================================================+</span><br><span class="line">| Violation conditions        | Max temperature threshold - 50                 |</span><br><span class="line">| Isolation mode              | Manual                                         |</span><br><span class="line">| Action on violation         | None                                           |</span><br><span class="line">| Validation after action     | None                                           |</span><br><span class="line">| Validation failure action   | None                                           |</span><br><span class="line">+-----------------------------+------------------------------------------------+</span><br><span class="line"></span><br><span class="line">$ dcgmi policy --host 127.0.0.1:39999 -g 2 --reg</span><br><span class="line">Timestamp: Wed Jan  6 17:02:27 2021</span><br><span class="line">The maximum thermal <span class="built_in">limit</span> has violated policy manager values.</span><br><span class="line">Temperature: 65</span><br><span class="line">Listening <span class="keyword">for</span> violations.</span><br><span class="line">Timestamp: Wed Jan  6 17:02:37 2021</span><br><span class="line">The maximum thermal <span class="built_in">limit</span> has violated policy manager values.</span><br><span class="line">Temperature: 65</span><br><span class="line">...</span><br></pre></td></tr></table></figure><p>参数设置</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"> --<span class="built_in">set</span>        actn,val   (OR required)  Set the current violation policy.</span><br><span class="line">                             Use csv action,validation (ie. 1,2)</span><br><span class="line">                             -----</span><br><span class="line">                             Action to take when any of the violations</span><br><span class="line">                             specified occur.</span><br><span class="line">                             0 - None</span><br><span class="line">                             1 - GPU Reset</span><br><span class="line">                             -----</span><br><span class="line">                             Validation to take after the violation action has</span><br><span class="line">                             been performed.</span><br><span class="line">                             0 - None</span><br><span class="line">                             1 - System Validation (short)</span><br><span class="line">                             2 - System Validation (medium)</span><br><span class="line">                             3 - System Validation (long)</span><br><span class="line">-x  --xiderrors             Add XID errors to the policy conditions.</span><br><span class="line">-n  --nvlinkerrors           Add NVLink errors to the policy conditions.</span><br><span class="line">-p  --pcierrors             Add PCIe replay errors to the policy conditions.</span><br><span class="line">-e  --eccerrors             Add ECC double bit errors to the policy</span><br><span class="line">                             conditions.</span><br><span class="line">-P  --maxpower   max        Specify the maximum power a group<span class="string">'s GPUs can reach</span></span><br><span class="line"><span class="string">                             before triggering a violation.</span></span><br><span class="line"><span class="string">-T  --maxtemp    max        Specify the maximum temperature a group'</span>s GPUs can</span><br><span class="line">                             reach before triggering a violation.</span><br><span class="line">-M  --maxpages   max        Specify the maximum number of retired pages that</span><br><span class="line">                             will trigger a violation.</span><br></pre></td></tr></table></figure><h3 id="Health-check"><a href="#Health-check" class="headerlink" title="Health check"></a>Health check</h3><p><code>DCGM</code> 的健康检查是无侵入式的检查，提供了实时监控和聚合的健康数据，其运行机制是</p><ol><li>打开健康检查，设置需要检查的项</li><li>DCGM在后台运行，根据设置监控对应组件状态</li><li>用户通过<code>dcgmi health</code>命令查询当前发现的错误</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">$ dcgmi health --check -g 1</span><br><span class="line">Health Monitor Report</span><br><span class="line">+------------------+---------------------------------------------------------+</span><br><span class="line">| Overall Health:   Healthy                                                  |</span><br><span class="line">+==================+=========================================================+</span><br><span class="line"></span><br><span class="line">$ dcgmi health --check -g 1 </span><br><span class="line">Health Monitor Report</span><br><span class="line">+----------------------------------------------------------------------+</span><br><span class="line">| Group 1       | Overall Health: Warning                              |</span><br><span class="line">+==================+===================================================+</span><br><span class="line">| GPU ID: 0     | Warning                                              |</span><br><span class="line">|               | PCIe system: Warning - Detected more than 8 PCIe     |</span><br><span class="line">|               | replays per minute <span class="keyword">for</span> GPU 0: 13                     |</span><br><span class="line">+---------------+------------------------------------------------------+</span><br><span class="line">| GPU ID: 1     | Warning                                              |</span><br><span class="line">|               | InfoROM system: Warning - A corrupt InfoROM has been |</span><br><span class="line">|               | detected <span class="keyword">in</span> GPU 1.                                   |</span><br><span class="line">+---------------+------------------------------------------------------+</span><br></pre></td></tr></table></figure><h3 id="GPU-Diagnostics"><a href="#GPU-Diagnostics" class="headerlink" title="GPU Diagnostics"></a>GPU Diagnostics</h3><p>诊断是主动检查的模式，提供了三个级别的检查，每次运行时会根据运行级别，运行对应的测试程序，来发现问题。</p><p>运行命令如下</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line">$ dcgmi diag --host 127.0.0.1:39999 -g 3 -r 1</span><br><span class="line">Successfully ran diagnostic <span class="keyword">for</span> group.</span><br><span class="line">+---------------------------+------------------------------------------------+</span><br><span class="line">| Diagnostic                | Result                                         |</span><br><span class="line">+===========================+================================================+</span><br><span class="line">|-----  Deployment  --------+------------------------------------------------|</span><br><span class="line">| Blacklist                 | Pass                                           |</span><br><span class="line">| NVML Library              | Pass                                           |</span><br><span class="line">| CUDA Main Library         | Pass                                           |</span><br><span class="line">| Permissions and OS Blocks | Pass                                           |</span><br><span class="line">| Persistence Mode          | Pass                                           |</span><br><span class="line">| Environment Variables     | Pass                                           |</span><br><span class="line">| Page Retirement           | Pass                                           |</span><br><span class="line">| Graphics Processes        | Pass                                           |</span><br><span class="line">| Inforom                   | Pass                                           |</span><br><span class="line">+---------------------------+------------------------------------------------+</span><br></pre></td></tr></table></figure><h3 id="Profile"><a href="#Profile" class="headerlink" title="Profile"></a>Profile</h3><p>profile功能可以用较小的性能消耗获取GPU卡的利用率数据以及进程的性能数据，profile功能对于驱动版本和卡的类型有一些强制要求，具体是</p><ol><li>DCGM 版本大于1.7</li><li>驱动版本大于418.43</li><li>nv-hostengine 以root身份启动</li><li>目前只支持Tesla V100、Tesla T4卡</li></ol><p>可以获取的性能指标有</p><div class="table-container"><table><thead><tr><th style="text-align:left">指标</th><th style="text-align:left">说明</th><th style="text-align:left">FIELD_NAME</th></tr></thead><tbody><tr><td style="text-align:left">Graphics Engine Activity</td><td style="text-align:left">Ratio of time the graphics engine is active. The graphics engine is active if a graphics/compute context is bound and the graphics pipe or compute pipe is busy. PROF_GR_ENGINE_ACTIVE (ID: 1001)</td><td style="text-align:left"></td></tr><tr><td style="text-align:left">SM Activity</td><td style="text-align:left">The ratio of cycles an SM has at least 1 warp assigned (computed from the number of cycles and elapsed cycles)</td><td style="text-align:left">PROF_SM_ACTIVE (ID: 1002)</td></tr><tr><td style="text-align:left">SM Occupancy</td><td style="text-align:left">The ratio of number of warps resident on an SM. (number of resident warps as a percentage of the theoretical maximum number of warps per elapsed cycle)</td><td style="text-align:left">PROF_SM_OCCUPANCY (ID: 1003)</td></tr><tr><td style="text-align:left">Tensor Activity</td><td style="text-align:left">The ratio of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles)</td><td style="text-align:left">PROF_PIPE_TENSOR_ACTIVE (ID: 1004)</td></tr><tr><td style="text-align:left">Memory BW Utilization</td><td style="text-align:left">The ratio of cycles the device memory interface is active sending or receiving data.</td><td style="text-align:left">PROF_DRAM_ACTIVE (ID: 1005)</td></tr><tr><td style="text-align:left">Engine Activity</td><td style="text-align:left">Ratio of cycles the fp64 /fp32 / fp16 / HMMA / IMMA pipes are active.</td><td style="text-align:left">PROF_PIPE_FPXY_ACTIVE (ID: 1006 (FP64); 1007 (FP32); 1008 (FP16))</td></tr><tr><td style="text-align:left">NVLink Activity</td><td style="text-align:left">The number of bytes of active NVLink rx or tx data including both header and payload.</td><td style="text-align:left">DEV_NVLINK_BANDWIDTH_L0</td></tr><tr><td style="text-align:left">PCIe Bandwidth pci<em>_bytes</em>{rx, tx}</td><td style="text-align:left">The number of bytes of active pcie rx or tx data including both header and payload.</td><td style="text-align:left">PROF<em>PCIE</em>[TR]X_BYTES (ID: 1009 (TX); 1010 (RX))</td></tr></tbody></table></div><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-06_dcgm-modularity.png"></p><h2 id="在-k8s-中集成-GPU-Telemetry"><a href="#在-k8s-中集成-GPU-Telemetry" class="headerlink" title="在 k8s 中集成 GPU Telemetry"></a>在 k8s 中集成 GPU Telemetry</h2><p>系统监控通常需要有以下几个组件：</p><ul><li>数据收集组件：collector，作为数据来源</li><li>时序数据库组件：存储收集到的metrics</li><li>可视化组件：将收集到的数据以可视化的界面友好地展示出来</li></ul><p>Prometheus 作为云原生时代优秀的解决方案，其结合 Grafana 和 Alert Manager 等组件实现了 k8s 集群的系统监控，下面是其组件架构，更多内容可以参考我的<a href="https://houmin.cc/posts/18c039ab/">另一篇博文</a>。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-16_prometheus-architecture.png"></p><p>同样，为了获得 GPU 的监控数据，NVIDIA 推出了 <a href="https://github.com/NVIDIA/gpu-monitoring-tools" target="_blank" rel="external nofollow noopener noreferrer"><code>dcgm-exporter</code></a>，它封装了 <code>DCGM</code>，类似于 <code>node-exporter</code> 将 GPU 的数据暴露给 Prometheus：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-06_nvidia-gpu-telemetry.png"></p><h3 id="部署-dcgm-exporter"><a href="#部署-dcgm-exporter" class="headerlink" title="部署 dcgm-exporter"></a>部署 dcgm-exporter</h3><p><code>dcgm-exporter</code> 作为 <code>DaemonSet</code> 运行在每一个装有GPU的Node上，为了使得 Prometheus 能够采集到它收集的数据，同时创建了 <code>Service</code>。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">DaemonSet</span></span><br><span class="line"><span class="attr">namespace:</span> <span class="string">kube-system</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">"dcgm-exporter"</span></span><br><span class="line">  <span class="attr">labels:</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/name:</span> <span class="string">"dcgm-exporter"</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/version:</span> <span class="string">"2.1.1"</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">updateStrategy:</span></span><br><span class="line">    <span class="attr">type:</span> <span class="string">RollingUpdate</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">matchLabels:</span></span><br><span class="line">      <span class="attr">app.kubernetes.io/name:</span> <span class="string">"dcgm-exporter"</span></span><br><span class="line">      <span class="attr">app.kubernetes.io/version:</span> <span class="string">"2.1.1"</span></span><br><span class="line">  <span class="attr">template:</span></span><br><span class="line">    <span class="attr">metadata:</span></span><br><span class="line">      <span class="attr">labels:</span></span><br><span class="line">        <span class="attr">app.kubernetes.io/name:</span> <span class="string">"dcgm-exporter"</span></span><br><span class="line">        <span class="attr">app.kubernetes.io/version:</span> <span class="string">"2.1.1"</span></span><br><span class="line">      <span class="attr">name:</span> <span class="string">"dcgm-exporter"</span></span><br><span class="line">    <span class="attr">spec:</span></span><br><span class="line">      <span class="attr">containers:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">image:</span> <span class="string">"nvidia/dcgm-exporter:2.0.13-2.1.1-ubuntu18.04"</span></span><br><span class="line">        <span class="attr">env:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">"DCGM_EXPORTER_LISTEN"</span></span><br><span class="line">          <span class="attr">value:</span> <span class="string">":9400"</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">"DCGM_EXPORTER_KUBERNETES"</span></span><br><span class="line">          <span class="attr">value:</span> <span class="string">"true"</span></span><br><span class="line">        <span class="attr">name:</span> <span class="string">"dcgm-exporter"</span></span><br><span class="line">        <span class="attr">ports:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">"metrics"</span></span><br><span class="line">          <span class="attr">containerPort:</span> <span class="number">9400</span></span><br><span class="line">        <span class="attr">securityContext:</span></span><br><span class="line">          <span class="attr">runAsNonRoot:</span> <span class="literal">false</span></span><br><span class="line">          <span class="attr">runAsUser:</span> <span class="number">0</span></span><br><span class="line">        <span class="attr">volumeMounts:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">"pod-gpu-resources"</span></span><br><span class="line">          <span class="attr">readOnly:</span> <span class="literal">true</span></span><br><span class="line">          <span class="attr">mountPath:</span> <span class="string">"/var/lib/kubelet/pod-resources"</span></span><br><span class="line">      <span class="attr">volumes:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">"pod-gpu-resources"</span></span><br><span class="line">        <span class="attr">hostPath:</span></span><br><span class="line">          <span class="attr">path:</span> <span class="string">"/var/lib/kubelet/pod-resources"</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">namespace:</span> <span class="string">kube-system</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">"dcgm-exporter"</span></span><br><span class="line">  <span class="attr">labels:</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/name:</span> <span class="string">"dcgm-exporter"</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/version:</span> <span class="string">"2.1.1"</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/name:</span> <span class="string">"dcgm-exporter"</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/version:</span> <span class="string">"2.1.1"</span></span><br><span class="line">  <span class="attr">ports:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">"metrics"</span></span><br><span class="line">    <span class="attr">port:</span> <span class="number">9400</span></span><br></pre></td></tr></table></figure><p>这一步之后，可以获取每个Node上的 Metrics：</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>部署完成后，需要在Prometheus的配置中，给 <code>scrape_configs</code>添加 <code>gpu-metrics</code> 的 job，通过 <code>kubernetes_sd_configs</code> 的服务发现机制找到 <code>dcgm-exporter</code> 对应的服务。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="bullet">-</span> <span class="attr">job_name:</span> <span class="string">gpu-metrics</span></span><br><span class="line">  <span class="attr">scrape_interval:</span> <span class="string">1s</span></span><br><span class="line">  <span class="attr">metrics_path:</span> <span class="string">/metrics</span></span><br><span class="line">  <span class="attr">scheme:</span> <span class="string">http</span></span><br><span class="line">  <span class="attr">kubernetes_sd_configs:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">role:</span> <span class="string">endpoints</span></span><br><span class="line">    <span class="attr">namespaces:</span></span><br><span class="line">      <span class="attr">names:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="string">kube-system</span></span><br><span class="line">    <span class="attr">selectors:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">role:</span> <span class="string">pod</span></span><br><span class="line">          <span class="attr">label:</span> <span class="string">"app.kubernetes.io/name:dcgm-exporter"</span></span><br><span class="line">  <span class="attr">relabel_configs:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">source_labels:</span> <span class="string">[__meta_kubernetes_pod_node_name]</span></span><br><span class="line">    <span class="attr">action:</span> <span class="string">replace</span></span><br><span class="line">    <span class="attr">target_label:</span> <span class="string">kubernetes_node</span></span><br></pre></td></tr></table></figure><h3 id="使用-grafana-监控"><a href="#使用-grafana-监控" class="headerlink" title="使用 grafana 监控"></a>使用 grafana 监控</h3><p>NVIDIA 提供了专用于 <a href="https://grafana.com/grafana/dashboards/12239" target="_blank" rel="external nofollow noopener noreferrer">GPU 监控的 Grafana 面板</a> ，在Grafana 导入面板后，即可看到对应的GPU监控面板：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-06_nvidia-dcgm-grafana.png"></p><h2 id="OpenFalcon-GPU-监控插件"><a href="#OpenFalcon-GPU-监控插件" class="headerlink" title="OpenFalcon GPU 监控插件"></a>OpenFalcon GPU 监控插件</h2><p>OpenFalcon 是小米开源的一套监控系统解决方案，其架构如下图所示。在每个节点上会有一个 <code>falcon-agent</code> 的 daemon 进程，负责对每个节点进行数据采集。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-07_falcon-arch.png"></p><p>为了支持GPU监控，OpenFalcon 有专门的 <a href="https://github.com/open-falcon/gpu-mon" target="_blank" rel="external nofollow noopener noreferrer">GPU 监控插件</a>，它依赖于 <code>DCGM</code> 获得监控指标，下面是一些常用的指标：</p><figure class="highlight properties"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">GPUUtils</span>             <span class="string">GPU 使用率 (%)</span></span><br><span class="line"><span class="attr">MemUtils</span>             <span class="string">GPU 显存使用率(%)</span></span><br><span class="line"><span class="attr">FBUsed</span>               <span class="string">GPU 的显存占用(MB)</span></span><br><span class="line"><span class="attr">Performance</span>          <span class="string">GPU 的性能状态(0-15, 其中0表示最高)</span></span><br><span class="line"><span class="attr">DeviceTemperature</span>    <span class="string">当前GPU设备温度(℃)</span></span><br><span class="line"><span class="attr">PowerUsed</span>            <span class="string">GPU的功率使用</span></span><br><span class="line"><span class="attr">SingleBitError</span>       <span class="string">全部累积的单精度ECC错误</span></span><br><span class="line"><span class="attr">DoubleBitError</span>       <span class="string">全部累积的双精度ECC错误</span></span><br></pre></td></tr></table></figure><h2 id="GPU-Manager-监控数据分析"><a href="#GPU-Manager-监控数据分析" class="headerlink" title="GPU Manager 监控数据分析"></a>GPU Manager 监控数据分析</h2><p>与 <code>OpenFalcon</code> 不同，GPU Manager 使用的是 <code>NVML</code> 库开发，获得对于 GPU Pod 级的监控数据。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(disp *Display)</span> <span class="title">getDeviceUsage</span><span class="params">(pidsInCont []<span class="keyword">int</span>, deviceIdx <span class="keyword">int</span>)</span> *<span class="title">displayapi</span>.<span class="title">DeviceInfo</span></span> &#123;</span><br><span class="line">nvml.Init()</span><br><span class="line"><span class="keyword">defer</span> nvml.Shutdown()</span><br><span class="line"></span><br><span class="line">dev, err := nvml.DeviceGetHandleByIndex(<span class="keyword">uint</span>(deviceIdx))</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">klog.Warningf(<span class="string">"can't find device %d, error %s"</span>, deviceIdx, err)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">processSamples, err := dev.DeviceGetProcessUtilization(<span class="number">1024</span>, time.Second)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">klog.Warningf(<span class="string">"can't get processes utilization from device %d, error %s"</span>, deviceIdx, err)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">processOnDevices, err := dev.DeviceGetComputeRunningProcesses(<span class="number">1024</span>)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">klog.Warningf(<span class="string">"can't get processes info from device %d, error %s"</span>, deviceIdx, err)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">busID, err := dev.DeviceGetPciInfo()</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">klog.Warningf(<span class="string">"can't get pci info from device %d, error %s"</span>, deviceIdx, err)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">sort.Slice(pidsInCont, <span class="function"><span class="keyword">func</span><span class="params">(i, j <span class="keyword">int</span>)</span> <span class="title">bool</span></span> &#123;</span><br><span class="line"><span class="keyword">return</span> pidsInCont[i] &lt; pidsInCont[j]</span><br><span class="line">&#125;)</span><br><span class="line"></span><br><span class="line">usedMemory := <span class="keyword">uint64</span>(<span class="number">0</span>)</span><br><span class="line">usedPids := <span class="built_in">make</span>([]<span class="keyword">int32</span>, <span class="number">0</span>)</span><br><span class="line">usedGPU := <span class="keyword">uint</span>(<span class="number">0</span>)</span><br><span class="line"><span class="keyword">for</span> _, info := <span class="keyword">range</span> processOnDevices &#123;</span><br><span class="line">idx := sort.Search(<span class="built_in">len</span>(pidsInCont), <span class="function"><span class="keyword">func</span><span class="params">(pivot <span class="keyword">int</span>)</span> <span class="title">bool</span></span> &#123;</span><br><span class="line"><span class="keyword">return</span> pidsInCont[pivot] &gt;= <span class="keyword">int</span>(info.Pid)</span><br><span class="line">&#125;)</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> idx &lt; <span class="built_in">len</span>(pidsInCont) &amp;&amp; pidsInCont[idx] == <span class="keyword">int</span>(info.Pid) &#123;</span><br><span class="line">usedPids = <span class="built_in">append</span>(usedPids, <span class="keyword">int32</span>(pidsInCont[idx]))</span><br><span class="line">usedMemory += info.UsedGPUMemory</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> _, sample := <span class="keyword">range</span> processSamples &#123;</span><br><span class="line">idx := sort.Search(<span class="built_in">len</span>(pidsInCont), <span class="function"><span class="keyword">func</span><span class="params">(pivot <span class="keyword">int</span>)</span> <span class="title">bool</span></span> &#123;</span><br><span class="line"><span class="keyword">return</span> pidsInCont[pivot] &gt;= <span class="keyword">int</span>(sample.Pid)</span><br><span class="line">&#125;)</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> idx &lt; <span class="built_in">len</span>(pidsInCont) &amp;&amp; pidsInCont[idx] == <span class="keyword">int</span>(sample.Pid) &#123;</span><br><span class="line">usedGPU += sample.SmUtil</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> &amp;displayapi.DeviceInfo&#123;</span><br><span class="line">Id:      busID.BusID,</span><br><span class="line">CardIdx: fmt.Sprintf(<span class="string">"%d"</span>, deviceIdx),</span><br><span class="line">Gpu:     <span class="keyword">float32</span>(usedGPU),</span><br><span class="line">Mem:     <span class="keyword">float32</span>(usedMemory &gt;&gt; <span class="number">20</span>),</span><br><span class="line">Pids:    usedPids,</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h2 id="GPU-监控指标探讨"><a href="#GPU-监控指标探讨" class="headerlink" title="GPU 监控指标探讨"></a>GPU 监控指标探讨</h2><p>对于 k8s 的 GPU 监控，我们到底需要那些指标：</p><ul><li>集群级别<ul><li>整个集群有多少GPU，各种GPU的型号是怎样的</li><li>集群级别GPU算力使用量（绝对值），算力使用率（相对值）</li><li>集群级别GPU显存使用量（绝对值），显存使用率（相对值）</li></ul></li><li>单机级别<ul><li>Node上有多少GPU，各种GPU的型号是怎样的</li><li>单机级别GPU算力使用量（绝对值），算力使用率（相对值）</li><li>单机级别GPU显存使用量（绝对值），显存使用率（相对值）</li></ul></li><li>Pod级别<ul><li>Pod 运行在哪个GPU上</li><li>Pod级别GPU算力使用量（绝对值），算力使用率（相对值）</li><li>Pod级别GPU显存使用量（绝对值），显存使用率（相对值）</li></ul></li><li>其他相关统计数据<ul><li>GPU的功率、温度、主频、FAN转速等</li></ul></li></ul><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://github.com/NVIDIA/gpu-monitoring-tools" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA GPU Monitoring Tools</a></li><li><a href="https://developer.nvidia.com/blog/monitoring-gpus-in-kubernetes-with-dcgm/" target="_blank" rel="external nofollow noopener noreferrer">Monitoring GPUs in Kubernetes with DCGM</a></li><li><a href="https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html" target="_blank" rel="external nofollow noopener noreferrer">Integrating GPU Telemetry into Kubernetes</a></li><li><a href="http://on-demand.gputechconf.com/gtc/2018/presentation/s8505-gpu-monitoring-and-management-with-nvidia-data-center-gpu-manager-dcgm-v2.pdf" target="_blank" rel="external nofollow noopener noreferrer">GTC 2018 Talk: GPU Monitoring and Management with NVIDIA Data Center GPU Manager</a></li><li><a href="https://book.open-falcon.org/zh_0_2/" target="_blank" rel="external nofollow noopener noreferrer">OpenFalcon 说明书</a></li><li><a href="https://github.com/open-falcon/gpu-mon" target="_blank" rel="external nofollow noopener noreferrer">OpenFalcon GPU监控插件</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;link rel=&quot;stylesheet&quot; class=&quot;aplayer-secondary-style-marker&quot; href=&quot;/assets/css/APlayer.min.css&quot;&gt;&lt;script src=&quot;/assets/js/APlayer.min.js&quot; class=&quot;aplayer-secondary-script-marker&quot;&gt;&lt;/script&gt;&lt;script class=&quot;meting-secondary-script-marker&quot; src=&quot;/assets/js/Meting.min.js&quot;&gt;&lt;/script&gt;&lt;h2 id=&quot;问题背景&quot;&gt;&lt;a href=&quot;#问题背景&quot; class=&quot;headerlink&quot; title=&quot;问题背景&quot;&gt;&lt;/a&gt;问题背景&lt;/h2&gt;&lt;p&gt;在使用GPU进行深度学习相关的训练与推理时，需要查看当前集群中GPU的使用情况：&lt;/p&gt;&lt;ul&gt;
&lt;li&gt;需要通过当前GPU设备资源使用情况判断是否可以再部署新的应用，判断集群是否需要扩容，为GPU服务提供对齐CPU的容量保障服务，补齐容量保障中的GPU短板&lt;/li&gt;
&lt;li&gt;需要通过当前GPU设备资源使用情况分析使用中存在的瓶颈和短板，推进优化，提高资源利用率和服务性能&lt;/li&gt;
&lt;/ul&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-06_dcgm-modularity.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="GPU" scheme="https://houmin.cc/tags/GPU/"/>
    
      <category term="监控" scheme="https://houmin.cc/tags/%E7%9B%91%E6%8E%A7/"/>
    
  </entry>
  
  <entry>
    <title>【Service Mesh】Istio 流量控制</title>
    <link href="https://houmin.cc/posts/151719f0/"/>
    <id>https://houmin.cc/posts/151719f0/</id>
    <published>2020-11-24T08:47:28.000Z</published>
    <updated>2022-11-09T15:13:45.392Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>流量控制是指对系统流量的管控，包括了对网格入口的流量、网格出口的流量以及在网格内部微服务间相互调用流量的控制。在 <a href="../22cae0b8">Istio 入门</a> 中我们知道，Istio 架构在逻辑上分为 Control plane 和 Data plane，Control plane 负责整体管理和配置代理， Data plane 负责网格内所有微服务间的网络通信，同时还收集报告网络请求的遥测数据等。流量控制是在 Data plane 层实现。</p><a id="more"></a><p><img alt="Istio Architecture" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-arch.svg"></p><h2 id="路由和流量转移"><a href="#路由和流量转移" class="headerlink" title="路由和流量转移"></a>路由和流量转移</h2><p>Istio 为了控制服务请求，引入了服务版本（version）的概念，可以通过版本这一标签将服务进行区分。版本的设置是非常灵活的，以下是几种典型的设置方式：</p><ul><li>根据服务的迭代编号进行定义（如 v1、v2 版本）</li><li>根据部署环境进行定义（比如 dev、staging、production）</li><li>自定义的任何用于区分服务的某种标记</li></ul><p>通过版本标签，Istio 就可以定义灵活的路由规则来控制流量，上面提到的金丝雀发布这类应用场景就很容易实现了。</p><p>下图展示了使用服务版本实现路由分配的例子。服务版本定义了版本号（v1.5、v2.0-alpha）和环境（us-prod、us-staging）两种信息。服务 B 包含了 4 个 Pod，其中 3 个是部署在生产环境的 v1.5 版本，而 Pod4 是部署在预生产环境的 v2.0-alpha 版本。运维人员可以根据服务版本来指定路由规则，使 99% 的流量流向 v1.5 版本，而 1% 的流量进入 v2.0-alpha 版本。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_istio-routing.png"></p><p>除了上面介绍的服务间流量控制外，还能控制与网格边界交互的流量。可以在系统的入口和出口处部署 Sidecar 代理，让所有流入和流出的流量都由代理进行转发。负责入和出的代理就叫做入口网关和出口网关，它们把守着进入和流出网格的流量。下图展示了 Ingress 和 Egress 在请求流中的位置，有了他们俩，也就可以控制出入网格的流量了。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_istio-gateway.png"></p><p>Istio 还能设置流量策略。比如可以对连接池相关的属性进行设置，通过修改最大连接等参数，实现对请求负载的控制。还可以对负载均衡策略进行设置，在轮询、随机、最少访问等方式之间进行切换。还能设置异常探测策略，将满足异常条件的实例从负载均衡池中摘除，以保证服务的稳定性。</p><hr><p>Istio 的流量路由规则可以让您很容易的控制服务之间的流量和 API 调用。Istio 在服务层面提供了断路器，超时，重试等功能，通过这些功能可以简单地实现 A/B 测试，金丝雀发布，基于百分比的流量分割等，此外还提供了开箱即用的故障恢复功能，用于增加应用的健壮性，以应对服务故障或网络故障。这些功能都可以通过 Istio 的流量管理 API 添加流量配置来实现。</p><p>跟其他 Istio 配置一样，流量管理 API 也使用 CRD 指定。本小节主要介绍下面几个典型的流量管理 API 资源，以及这些 API 的功能和使用示例。</p><h3 id="VirtualService"><a href="#VirtualService" class="headerlink" title="VirtualService"></a>VirtualService</h3><p>VirtualService 由一组 <strong>路由规则</strong> 组成，描述了 <strong>用户请求的目标地址</strong> 到 <strong>服务网格中实际工作负载</strong> 之间的映射。在这个映射中，VirtualService提供了丰富的配置方式，可以为发送到这些 Workloads 的流量指定不同的路由规则。对应于具体的配置，用户请求的目标地址用 <code>hosts</code> 字段来表示，网格内的实际负载由每个 <code>route</code> 配置项中的 <code>destination</code> 字段指定。</p><pre class="mermaid">graph LRsubgraph VirtualServiceClientRequests -- DifferentTrafficRoutingRules --> DestinationWorkloadsHosts -- DifferentTrafficRoutingRules --> RouteDestinationend</pre><p>VirtualService 通过解耦 <strong>用户请求的目标地址</strong> 和 <strong>真实响应请求的目标工作负载</strong>，为服务提供了合适的统一抽象层，而由此演化设计的配置模型为管理这方面提供了一致的环境。对于原生 Kubernetes 而言，只有在 Ingress 处有这种路由规则的定义，对于集群内部不同Service的不同版本之间，并没有类似 VirtualService 的定义。</p><p>使用 VirtualService，可以为一个或多个主机名指定流量行为。在 VirtualService 中使用路由规则，告诉 Envoy如何发送 VirtualService 的流量到适当的目标。路由目标可以是相同服务的不同版本，或者是完全不同的服务。</p><p>一个典型的应用场景是将流量发送到被指定为服务子集的服务的不同版本。客户端将 VirtualService 视为一个单一实体，将请求发送至 VirtualService 主机，然后 Envoy 根据 VirtualService 规则把流量路由到不同的版本中。</p><p>这种方式可以方便地创建一种金丝雀的发布策略实现新版本流量的平滑比重升级。流量路由完全独立于实例部署，这意味着实现新版本服务的实例可以根据流量的负载来伸缩，完全不影响流量路由。相比之下，类似 Kubernetes 的容器调度平台仅支持基于部署中实例扩缩容比重的流量分发，那样会日趋复杂化。关于使用VirtualService实现金丝雀部署，可以参考 <a href="https://istio.io/latest/blog/2017/0.1-canary/" target="_blank" rel="external nofollow noopener noreferrer">Canary</a> 。</p><p>VirtualService 也提供了如下功能。</p><ul><li>通过单个 VirtualService 处理多个应用程序服务。例如，如果您的服务网格使用是 Kubernetes，您可以配置一个 VirtualService 来处理一个特定命名空间的所有服务。将单一的 VirtualService 映射为多个“真实”的服务特别有用，可以在不需要客户适应转换的情况下，将单体应用转换为微服务构建的复合应用系统。您的路由规则可以指定“请求到 <code>monolith.com</code> 的 URLs 跳转至 <code>microservice A</code> 中”。</li><li>和 Gateway  一起配置流量规则来控制入口和出口流量。</li></ul><p>在一些应用场景中，由于指定服务子集，需要配置 DestinationRule 来使用这些功能。在不同的对象中指定服务子集以及其他特定的目标策略可以帮助您在不同的 VirtualService 中清晰地复用这些功能。</p><p>下面的 VirtualService 根据是否来自于特定用户路由请求到不同的服务版本中（如果请求来自用户 <code>jason</code> ，则访问 <code>v2</code> 版本的 <code>reviews</code>，否则访问 <code>v3</code> 版本）：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">reviews</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">reviews</span></span><br><span class="line">  <span class="attr">http:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">match:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">headers:</span></span><br><span class="line">        <span class="attr">end-user:</span></span><br><span class="line">          <span class="attr">exact:</span> <span class="string">jason</span></span><br><span class="line">    <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">reviews</span></span><br><span class="line">        <span class="attr">subset:</span> <span class="string">v2</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">reviews</span></span><br><span class="line">        <span class="attr">subset:</span> <span class="string">v3</span></span><br></pre></td></tr></table></figure><p>下面对这些字段依次解释：</p><h4 id="Hosts"><a href="#Hosts" class="headerlink" title="Hosts"></a>Hosts</h4><p>用来配置 Downstream 访问的可寻址地址，也就是用户请求的目标地址。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">hosts:</span></span><br><span class="line"><span class="bullet">-</span> <span class="string">reviews</span></span><br></pre></td></tr></table></figure><ul><li>VirtualService 主机名可以是 IP 地址、 DNS 域名、完全限定域名（FQDN)</li><li>也可以是 依赖于平台的一个简称（例如 Kubernetes 服务的短名称）</li><li>也可以使用通配符 <code>*</code>前缀，创建一组匹配所有服务的路由规则</li><li>VirtualService 的 <code>hosts</code> 实际上不必是 Istio 服务注册的一部分，它只是虚拟的目标地址。这可以为没有路由到网格内部的虚拟主机建模。</li></ul><h4 id="路由规则"><a href="#路由规则" class="headerlink" title="路由规则"></a>路由规则</h4><p><code>http</code> 字段用来配置路由规则，通常情况下配置一组路由规则，当请求到来时，自上而下依次进行匹配，直到匹配成功后跳出匹配。它可以对请求的 uri、method、authority、headers、port、queryParams 以及是否对 uri 大小写敏感等进行配置。</p><blockquote><p>除了HTTP协议，也可以使用 <code>tcp</code> 和 <code>tls</code> 片段为 <a href="https://istio.io/latest/docs/reference/config/networking/virtual-service/#TCPRoute" target="_blank" rel="external nofollow noopener noreferrer">TCP</a> 和未终止的 <a href="https://istio.io/docs/reference/config/networking/virtual-service/#TLSRoute" target="_blank" rel="external nofollow noopener noreferrer">TLS</a> 流量设置路由规则</p></blockquote><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">http:</span></span><br><span class="line"><span class="bullet">-</span> <span class="attr">match:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">headers:</span></span><br><span class="line">      <span class="attr">end-user:</span></span><br><span class="line">        <span class="attr">exact:</span> <span class="string">jason</span></span><br><span class="line">  <span class="attr">route:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">      <span class="attr">host:</span> <span class="string">reviews</span></span><br><span class="line">      <span class="attr">subset:</span> <span class="string">v2</span></span><br><span class="line"><span class="bullet">-</span> <span class="attr">route:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">      <span class="attr">host:</span> <span class="string">reviews</span></span><br><span class="line">      <span class="attr">subset:</span> <span class="string">v3</span></span><br></pre></td></tr></table></figure><p>我们推荐在每个 VirtualService 中配置一条默认「无条件的」或者基于权重的规则以确保 VirtualService 至少有一条匹配的路由。</p><h5 id="Destination"><a href="#Destination" class="headerlink" title="Destination"></a>Destination</h5><p>路由片段的 <code>destination</code> 字段指定符合匹配条件的流量目标地址。这里不像 VirtualService 的 <code>hosts</code>，Destination 的 <code>host</code> 必须是存在于 Istio 服务注册中心的实际目标地址，否则 Envoy 不知道该将请求发送到哪里。这个目标地址可以是代理的网格服务或者作为服务入口加入的非网格服务。下面的场景中我们运行在 Kubernetes 平台上，主机名是 Kubernetes 的服务名。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">route:</span></span><br><span class="line"><span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">    <span class="attr">host:</span> <span class="string">reviews</span></span><br><span class="line">    <span class="attr">subset:</span> <span class="string">v2</span></span><br></pre></td></tr></table></figure><blockquote><figure class="highlight gams"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">*Note for Kubernetes users*: When short names are used (e.g. "reviews" instead of "reviews.default.svc.cluster.local"), Istio will interpret the short name based on the namespace of the rule, not the service. A rule in the "default" namespace containing a host "reviews will be interpreted as "reviews.default.svc.cluster.local", irrespective of the actual namespace associated with the reviews service. To avoid potential misconfiguration, it is recommended to always use fully qualified domain names over short names.</span></span><br></pre></td></tr></table></figure></blockquote><h5 id="Match"><a href="#Match" class="headerlink" title="Match"></a>Match</h5><p>路由规则是将特定流量子集路由到特定目标地址的强大工具。可以在流量端口、<code>header</code> 字段、 URL 等内容上设置匹配条件。例如，下面的VirtualService 使用户发送流量到两个独立的服务，ratings and reviews， 就好像它们是 <code>http://bookinfo.com/</code> 这个更大的 VirtualService 的一部分。VirtualService 规则根据请求的 URL 和指向适当服务的请求匹配流量。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">bookinfo</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="string">bookinfo.com</span></span><br><span class="line">  <span class="attr">http:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">match:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">uri:</span></span><br><span class="line">        <span class="attr">prefix:</span> <span class="string">/reviews</span></span><br><span class="line">    <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">reviews</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">match:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">uri:</span></span><br><span class="line">        <span class="attr">prefix:</span> <span class="string">/ratings</span></span><br><span class="line">    <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">ratings</span></span><br></pre></td></tr></table></figure><p>对于匹配条件，您可以使用确定的值，一条前缀、或者一条正则表达式。</p><p>您可以使用 <code>AND</code> 向同一个 <code>match</code> 块添加多个匹配条件， 或者使用 <code>OR</code> 向同一个规则添加多个 <code>match</code> 块。对于任意给定的 VirtualService ，您可以配置多条路由规则。这可以使您的路由条件在一个单独的 VirtualService 中基于业务场景的复杂度来进行相应的配置。可以在 <a href="https://istio.io/docs/reference/config/networking/virtual-service/#HTTPMatchRequest" target="_blank" rel="external nofollow noopener noreferrer">HTTPMatchRequest 参考</a>中查看匹配条件字段和他们可能的值。</p><p>再者进一步使用匹配条件，您可以使用基于“权重”百分比分发流量。这在 A/B 测试和金丝雀部署中非常有用。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">reviews</span></span><br><span class="line">  <span class="attr">http:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">reviews</span></span><br><span class="line">        <span class="attr">subset:</span> <span class="string">v1</span></span><br><span class="line">      <span class="attr">weight:</span> <span class="number">75</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">reviews</span></span><br><span class="line">        <span class="attr">subset:</span> <span class="string">v2</span></span><br><span class="line">      <span class="attr">weight:</span> <span class="number">25</span></span><br></pre></td></tr></table></figure><p>您也可以使用路由规则在流量上执行一些操作，例如</p><ul><li>扩展或者删除 <code>headers</code></li><li>重写 URL</li><li>为调用这个目标地址设置重试策略</li></ul><h3 id="DestinationRule"><a href="#DestinationRule" class="headerlink" title="DestinationRule"></a>DestinationRule</h3><p><code>DestinationRule</code> 是 Istio 流量路由功能的重要组成部分。一个 <code>VirtualService</code> 可以看作是如何将流量分发到给定的目标地址，然后调用 <code>DestinationRule</code> 来配置分发到该目标地址的流量。<code>DestinationRule</code> 在 <code>VirtualService</code> 的路由规则之后起作用(即在 <code>VirtualService</code> 的 <code>match</code> -&gt; <code>route</code> -&gt; <code>destination</code> 之后起作用，此时流量已经分发到真实的 <code>Service</code> 上)，应用于真实的目标地址。</p><p>特别地，可以使用 <code>DestinationRule</code> 来指定命名的服务子集，例如根据版本对服务的实例进行分组，然后通过 <code>VirtualService</code> 的路由规则中的服务子集将控制流量分发到不同服务的实例中。</p><p><code>DestinationRule</code> 允许在调用完整的目标服务或特定的服务子集(如倾向使用的负载均衡模型，TLS 安全模型或断路器)时自定义 Envoy流量策略。Istio 默认会使用轮询策略，此外 Istio 也支持如下负载均衡模型，可以在 <code>DestinationRule</code> 中使用这些模型，将请求分发到特定的服务或服务子集。</p><ul><li>Random：将请求转发到一个随机的实例上</li><li>Weighted：按照指定的百分比将请求转发到实例上</li><li>Least requests：将请求转发到具有最少请求数目的实例上</li></ul><p>下面的 <code>DestinationRule</code> 使用不同的负载均衡策略为 my-svc 目的服务配置了3个不同的 Subset</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">DestinationRule</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">my-destination-rule</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">host:</span> <span class="string">my-svc</span></span><br><span class="line">  <span class="attr">trafficPolicy:</span>     <span class="comment">#默认的负载均衡策略模型为随机</span></span><br><span class="line">    <span class="attr">loadBalancer:</span></span><br><span class="line">      <span class="attr">simple:</span> <span class="string">RANDOM</span></span><br><span class="line">  <span class="attr">subsets:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">v1</span>  <span class="comment">#subset1，将流量转发到具有标签 version:v1 的 deployment 对应的服务上</span></span><br><span class="line">    <span class="attr">labels:</span></span><br><span class="line">      <span class="attr">version:</span> <span class="string">v1</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">v2</span>  <span class="comment">#subset2，将流量转发到具有标签 version:v2 的 deployment 对应的服务上,指定负载均衡为轮询</span></span><br><span class="line">    <span class="attr">labels:</span></span><br><span class="line">      <span class="attr">version:</span> <span class="string">v2</span></span><br><span class="line">    <span class="attr">trafficPolicy:</span></span><br><span class="line">      <span class="attr">loadBalancer:</span></span><br><span class="line">        <span class="attr">simple:</span> <span class="string">ROUND_ROBIN</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">v3</span>   <span class="comment">#subset3，将流量转发到具有标签 version:v3 的 deployment 对应的服务上</span></span><br><span class="line">    <span class="attr">labels:</span></span><br><span class="line">      <span class="attr">version:</span> <span class="string">v3</span></span><br></pre></td></tr></table></figure><p>每个子集由一个或多个 <code>labels</code> 定义，对应 Kubernetes 中的对象(如 <code>Pod</code> )的 key/value 对。这些标签定义在 Kubernetes 服务的 deployment 的 metadata 中，用于标识不同的版本。</p><p>除了定义子集外，<code>DestinationRule</code> 还定义了该目的地中所有子集的默认流量策略，以及仅覆盖该子集的特定策略。默认的策略定义在 <code>subset</code> 字段之上，为 <code>v1</code> 和 <code>v3</code> 子集设置了随机负载均衡策略，在 <code>v2</code> 策略中使用了轮询负载均衡。</p><h3 id="Gateway"><a href="#Gateway" class="headerlink" title="Gateway"></a>Gateway</h3><p>Gateway 用于管理进出网格的流量，指定可以进入或离开网格的流量。Gateway 配置应用于网格边缘的独立的 Envoy代理上，而不是服务负载的 Envoy 代理上。</p><p>与其他控制进入系统的流量的机制(如 Kubernetes Ingress API)不同，Istio gateway 允许利用 Istio 的流量路由的强大功能和灵活性。Istio 的 gateway 资源仅允许配置 4-6 层的负载属性，如暴露的端口，TLS 配置等等，但结合 Istio 的 <code>VirtualService</code>，就可以像管理 Istio 网格中的其他数据面流量一样管理 Gateway 的流量。</p><p>Gateway 主要用于管理 Ingress 流量，但也可以配置 Egress Gateway。通过 Egress Gateway 可以配置流量离开网格的特定节点，限制哪些服务可以访问外部网络，或通过 Egress 安全控制来提高网格的安全性。Gateway 可以用于配置为一个纯粹的内部代理。</p><p>Istio (通过 <code>istio-ingressgateway</code> 和 <code>istio-egressgateway</code> 参数)提供了一些预配置的 Gateway 代理，<code>default</code> profile 下仅会部署 Ingress Gateway。Gateway 可以通过部署文件进行部署，也可以单独部署。</p><p>下面是 <code>default</code> profile 默认安装的 Ingress</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl get gw</span><br><span class="line">NAME               AGE</span><br><span class="line">bookinfo-gateway   28h</span><br></pre></td></tr></table></figure><p>可以看到该 ingress 就是一个普通的 <code>Pod</code>，该 <code>Pod</code> 仅包含一个 Istio-proxy 容器</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl get pod -n istio-system |grep ingress</span><br><span class="line">istio-ingressgateway-64f6f9d5c6-qrnw2 1/1 Running 0 4d20h</span><br></pre></td></tr></table></figure><p>下面是一个 Gateway 的例子，用于配置外部 HTTPS 的 ingress 流量：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Gateway</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">ext-host-gwy</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">selector:</span>              <span class="comment">#指定 gateway 配置下发的代理，如具有标签 app: my-gateway-controller 的 pod</span></span><br><span class="line">    <span class="attr">app:</span> <span class="string">my-gateway-controller</span></span><br><span class="line">  <span class="attr">servers:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">port:</span>                <span class="comment">#gateway pod 暴露的端口信息</span></span><br><span class="line">      <span class="attr">number:</span> <span class="number">443</span></span><br><span class="line">      <span class="attr">name:</span> <span class="string">https</span></span><br><span class="line">      <span class="attr">protocol:</span> <span class="string">HTTPS</span></span><br><span class="line">    <span class="attr">hosts:</span>                <span class="comment">#外部流量</span></span><br><span class="line">    <span class="bullet">-</span> <span class="string">ext-host.example.com</span></span><br><span class="line">    <span class="attr">tls:</span></span><br><span class="line">      <span class="attr">mode:</span> <span class="string">SIMPLE</span></span><br><span class="line">      <span class="attr">serverCertificate:</span> <span class="string">/tmp/tls.crt</span></span><br><span class="line">      <span class="attr">privateKey:</span> <span class="string">/tmp/tls.key</span></span><br></pre></td></tr></table></figure><p>上述 Gateway 配置允许来自 <code>ext-host.example.com</code> 流量进入网格的 443 端口，但没有指定该流量的路由。(此时流量只能进入网格，但没有指定处理该流量的服务，因此需要与 <code>VirtualService</code> 进行绑定)</p><p>为了为 Gateway 指定路由，需要通过 <code>VirtualService</code> 的 <code>Gateway</code> 字段，将 <code>Gateway</code> 绑定到一个 <code>VirtualService</code> 上，将来自 <code>ext-host.example.com</code> 流量引入一个 <code>VirtualService</code>，<code>hosts</code> 可以是通配符，表示引入匹配到的流量。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">virtual-svc</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">ext-host.example.com</span></span><br><span class="line">  <span class="attr">gateways:</span>        <span class="comment">#将 gateway "ext-host-gwy" 绑定到 virtual service "virtual-svc"上</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">ext-host-gwy</span></span><br></pre></td></tr></table></figure><p>Egress Gateway 提供了对网格的出口流量进行统一管控的功能，在安装 Istio 时默认是不开启的。可以使用以下命令查看是否开启。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">$</span><span class="bash"> kubectl get pod -l istio=egressgateway -n istio-system</span></span><br></pre></td></tr></table></figure><p>若没有开启，使用以下命令添加。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ istioctl manifest apply --<span class="built_in">set</span> values.global.istioNamespace=istio-system \</span><br><span class="line">    --<span class="built_in">set</span> values.gateways.istio-egressgateway.enabled=<span class="literal">true</span></span><br></pre></td></tr></table></figure><p>Egress Gateway 的一个简单示例如下：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Gateway</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">istio-egressgateway</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">istio:</span> <span class="string">egressgateway</span></span><br><span class="line">  <span class="attr">servers:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">port:</span></span><br><span class="line">      <span class="attr">number:</span> <span class="number">80</span></span><br><span class="line">      <span class="attr">name:</span> <span class="string">http</span></span><br><span class="line">      <span class="attr">protocol:</span> <span class="string">HTTP</span></span><br><span class="line">    <span class="attr">hosts:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="string">edition.cnn.com</span></span><br></pre></td></tr></table></figure><p>可以看出，与 Ingress Gateway 不同，Egress Gateway 使用有 <code>istio: egressgateway</code> 标签的 Pod 来代理流量，实际上这也是一个 Envoy 代理。当网格内部需要访问 <code>edition.cnn.com</code> 这个地址时，流量将会统一先转发到 Egress Gateway 上，再由 Egress Gateway 将流量转发到 <code>edition.cnn.com</code> 上。</p><h3 id="ServiceEntry"><a href="#ServiceEntry" class="headerlink" title="ServiceEntry"></a>ServiceEntry</h3><p>Istio 支持对接 Kubernetes、Consul 等多种不同的注册中心，控制平面<code>Pilot</code>启动时，会从指定的注册中心获取 <code>Service Mesh</code> 集群的服务信息和实例列表，并将这些信息进行处理和转换，然后通过 xDS 下发给对应的数据平面，保证服务之间可以互相发现并正常访问。</p><p>同时，由于这些服务和实例信息都来源于服务网格内部，Istio 无法从注册中心直接获取网格外的服务，导致不利于网格内部与外部服务之间的通信和流量管理。为此，Istio 引入 ServiceEntry 实现对外通信和管理。</p><p>使用 ServiceEntry 可以将外部的服务条目添加到 Istio 内部的服务注册表中，以便让网格中的服务能够访问并路由到这些手动指定的服务。ServiceEntry 描述了服务的属性（DNS 名称、VIP、端口、协议、端点）。这些服务可能是位于网格外部（如，web APIs），也可能是处于网格内部但不属于平台服务注册表中的条目（如，需要和 Kubernetes 服务交互的一组虚拟机服务）。</p><h4 id="ServiceEntry-示例和属性介绍"><a href="#ServiceEntry-示例和属性介绍" class="headerlink" title="ServiceEntry 示例和属性介绍"></a>ServiceEntry 示例和属性介绍</h4><p>对于网格外部的服务，下面的 ServiceEntry 示例表示网格内部的应用通过 https 访问外部的 API。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ServiceEntry</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">google</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">www.google.com</span></span><br><span class="line">  <span class="attr">ports:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">number:</span> <span class="number">443</span></span><br><span class="line">    <span class="attr">name:</span> <span class="string">https</span></span><br><span class="line">    <span class="attr">protocol:</span> <span class="string">HTTPS</span></span><br><span class="line">  <span class="attr">resolution:</span> <span class="string">DNS</span></span><br><span class="line">  <span class="attr">location:</span> <span class="string">MESH_EXTERNAL</span></span><br></pre></td></tr></table></figure><p>对于在网格内部但不属于平台服务注册表的服务，使用下面的示例可以将一组在非托管 VM 上运行的 MongoDB 实例添加到 Istio 的注册中心，以便可以将这些服务视为网格中的任何其他服务。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ServiceEntry</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">external-svc-mongocluster</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">mymongodb.somedomain</span></span><br><span class="line">  <span class="attr">addresses:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="number">192.192</span><span class="number">.192</span><span class="number">.192</span><span class="string">/24</span> <span class="comment"># VIPs</span></span><br><span class="line">  <span class="attr">ports:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">number:</span> <span class="number">27018</span></span><br><span class="line">    <span class="attr">name:</span> <span class="string">mongodb</span></span><br><span class="line">    <span class="attr">protocol:</span> <span class="string">MONGO</span></span><br><span class="line">  <span class="attr">location:</span> <span class="string">MESH_INTERNAL</span></span><br><span class="line">  <span class="attr">resolution:</span> <span class="string">STATIC</span></span><br><span class="line">  <span class="attr">endpoints:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">address:</span> <span class="number">2.2</span><span class="number">.2</span><span class="number">.2</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">address:</span> <span class="number">3.3</span><span class="number">.3</span><span class="number">.3</span></span><br></pre></td></tr></table></figure><p>结合上面给出的示例，这里对 ServiceEntry 涉及的关键属性解释如下：</p><ul><li><code>hosts</code>: 表示与该 ServiceEntry 相关的主机名，可以是带有通配符前缀的 DNS 名称。</li><li><code>address</code>: 与服务相关的虚拟 IP 地址，可以是 CIDR 前缀的形式。</li><li><code>ports</code>: 和外部服务相关的端口，如果外部服务的 endpoints 是 Unix socket 地址，这里必须只有一个端口。</li><li><code>location</code>: 用于指定该服务属于网格内部（MESH_INTERNAL）还是外部（MESH_EXTERNAL）。</li><li><code>resolution</code>: 主机的服务发现模式，可以是 NONE、STATIC、DNS。</li><li><code>endpoints</code>: 与服务相关的一个或多个端点。</li><li><code>exportTo</code>: 用于控制 ServiceEntry 跨命名空间的可见性，这样就可以控制在一个命名空间下定义的资源对象是否可以被其他命名空间下的 <code>Sidecar</code>、Gateway 和 VirtualService 使用。目前支持两种选项，”.” 表示仅应用到当前命名空间，”*” 表示应用到所有命名空间。</li></ul><h4 id="使用-ServiceEntry-访问外部服务"><a href="#使用-ServiceEntry-访问外部服务" class="headerlink" title="使用 ServiceEntry 访问外部服务"></a>使用 ServiceEntry 访问外部服务</h4><p>Istio 提供了三种访问外部服务的方法：</p><ol><li>允许 <code>Sidecar</code> 将请求传递到未在网格内配置过的任何外部服务。使用这种方法时，无法监控对外部服务的访问，也不能利用 Istio 的流量控制功能。</li><li>配置 ServiceEntry 以提供对外部服务的受控访问。这是 Istio 官方推荐使用的方法。</li><li>对于特定范围的 IP，完全绕过 <code>Sidecar</code>。仅当出于性能或其他原因无法使用 <code>Sidecar</code> 配置外部访问时，才建议使用该配置方法。</li></ol><p>这里，我们重点讨论第 2 种方式，也就是使用 ServiceEntry 完成对网格外部服务的受控访问。</p><p>对于 <code>Sidecar</code> 对外部服务的处理方式，Istio 提供了两种选项:</p><ul><li><code>ALLOW_ANY</code>：默认值，表示 Istio 代理允许调用未知的外部服务。上面的第一种方法就使用了该配置项。</li><li><code>REGISTRY_ONLY</code>：Istio 代理会阻止任何没有在网格中定义的 HTTP 服务或 ServiceEntry 的主机。</li></ul><p>可以使用下面的命令查看当前所使用的模式:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl get configmap istio -n istio-system -o yaml | grep -o <span class="string">"mode: ALLOW_ANY"</span></span><br><span class="line">mode: ALLOW_ANY</span><br></pre></td></tr></table></figure><p>如果当前使用的是 <code>ALLOW_ANY</code> 模式，可以使用下面的命令切换为 <code>REGISTRY_ONLY</code> 模式:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl get configmap istio -n istio-system -o yaml | sed <span class="string">'s/mode: ALLOW_ANY/mode: REGISTRY_ONLY/g'</span> | kubectl replace -n istio-system -f -</span><br><span class="line">configmap <span class="string">"istio"</span> replaced</span><br></pre></td></tr></table></figure><p>在 <code>REGISTRY_ONLY</code> 模式下，需要使用 ServiceEntry 才能完成对外部服务的访问。当创建如下的 ServiceEntry 时，服务网格内部的应用就可以正常访问 httpbin.org 服务了。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">ServiceEntry</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">httpbin-ext</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">httpbin.org</span></span><br><span class="line">  <span class="attr">ports:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">number:</span> <span class="number">80</span></span><br><span class="line">    <span class="attr">name:</span> <span class="string">http</span></span><br><span class="line">    <span class="attr">protocol:</span> <span class="string">HTTP</span></span><br><span class="line">  <span class="attr">resolution:</span> <span class="string">DNS</span></span><br><span class="line">  <span class="attr">location:</span> <span class="string">MESH_EXTERNAL</span></span><br></pre></td></tr></table></figure><h4 id="管理外部流量"><a href="#管理外部流量" class="headerlink" title="管理外部流量"></a>管理外部流量</h4><p>使用 ServiceEntry 可以使网格内部服务发现并访问外部服务，除此之外，还可以对这些到外部服务的流量进行管理。结合 VirtualService 为对应的 ServiceEntry 配置外部服务访问规则，如请求超时、故障注入等，实现对指定服务的受控访问。</p><p>下面的示例就是为外部服务 httpbin.org 设置了超时时间，当请求时间超过 3s 时，请求就会直接中断，避免因外部服务访问时延过高而影响内部服务的正常运行。由于外部服务的稳定性通常无法管控和监测，这种超时机制对内部服务的正常运行具有重要意义。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">httpbin-ext</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">httpbin.org</span></span><br><span class="line">  <span class="attr">http:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">timeout:</span> <span class="string">3s</span></span><br><span class="line">    <span class="attr">route:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">          <span class="attr">host:</span> <span class="string">httpbin.org</span></span><br><span class="line">        <span class="attr">weight:</span> <span class="number">100</span></span><br></pre></td></tr></table></figure><p>同样的，我们也可以为 ServiceEntry 设置故障注入规则，为系统测试提供基础。下面的示例表示为所有访问 <code>httpbin.org</code> 服务的请求注入一个403错误。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line"> <span class="attr">name:</span> <span class="string">httpbin-service</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line"> <span class="attr">hosts:</span></span><br><span class="line"> <span class="bullet">-</span> <span class="string">httpbin.org</span></span><br><span class="line"> <span class="attr">http:</span></span><br><span class="line"> <span class="bullet">-</span> <span class="attr">route:</span></span><br><span class="line">   <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">       <span class="attr">host:</span> <span class="string">httpbin.org</span></span><br><span class="line">   <span class="attr">fault:</span></span><br><span class="line">     <span class="attr">abort:</span></span><br><span class="line">       <span class="attr">percent:</span> <span class="number">100</span></span><br><span class="line">       <span class="attr">httpStatus:</span> <span class="number">403</span></span><br></pre></td></tr></table></figure><h3 id="Sidecar"><a href="#Sidecar" class="headerlink" title="Sidecar"></a>Sidecar</h3><p>在默认的情况下，Istio 中所有 Pod 中的 Envoy 代理都是可以被寻址的。然而在某些场景下，我们为了做资源隔离，希望只访问某些 Namespace 下的资源。这个时候，我们就可以使用 Sidecar配置来实现。下面是一个简单的示例：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Sidecar</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">default</span></span><br><span class="line">  <span class="attr">namespace:</span> <span class="string">bookinfo</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">egress:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">hosts:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="string">"./*"</span></span><br><span class="line">    <span class="bullet">-</span> <span class="string">"istio-system/*"</span></span><br></pre></td></tr></table></figure><p>该示例就规定了在命名空间为 bookinfo 下的所有服务仅可以访问本命名空间下的服务以及 <code>istio-system</code> 命名空间下的服务。</p><h2 id="弹性功能"><a href="#弹性功能" class="headerlink" title="弹性功能"></a>弹性功能</h2><p>除了最核心的路由和流量转移功能外，Istio 还提供了一定的弹性功能，目前支持超时、重试和熔断。</p><h3 id="Request-Timeouts"><a href="#Request-Timeouts" class="headerlink" title="Request Timeouts"></a>Request Timeouts</h3><p>如果程序请求长时间无法返回结果，则需要设置超时机制，超过设置的时间则返回错误信息。这样做既可以节约等待时消耗的资源，也可以避免由于级联错误引起的一系列问题。</p><p>设置超时的方式也有很多种，比如通过修改代码在应用程序侧设置请求超时时间，但是这样很不灵活，也容易出现遗漏的现象，而 Istio 则可以在基础设施层解决这一问题。在 Istio 里添加超时非常简单，只需要在路由配置里添加 <code>timeout</code> 这个关键字就可以实现。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">ratings</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">ratings</span></span><br><span class="line">  <span class="attr">http:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">ratings</span></span><br><span class="line">        <span class="attr">subset:</span> <span class="string">v1</span></span><br><span class="line">    <span class="attr">timeout:</span> <span class="string">10s</span></span><br></pre></td></tr></table></figure><h3 id="Retries"><a href="#Retries" class="headerlink" title="Retries"></a>Retries</h3><p>在网络环境不稳定的情况下，会出现暂时的网络不可达现象，这时需要重试机制，通过多次尝试来获取正确的返回信息。重试逻辑可以写业务代码中，比如 Bookinfo 应用中的<code>productpage</code>服务就存在硬编码重试，而 Istio 可以通过简单的配置来实现重试功能，让开发人员无需关注重试部分的代码实现，专心实现业务代码。在 Istio 里添加超时和重试都非常简单，只需要在路由配置里添 <code>retry</code> 这个关键字就可以实现。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">ratings</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">ratings</span></span><br><span class="line">  <span class="attr">http:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">ratings</span></span><br><span class="line">        <span class="attr">subset:</span> <span class="string">v1</span></span><br><span class="line">    <span class="attr">retries:</span></span><br><span class="line">      <span class="attr">attempts:</span> <span class="number">3</span></span><br><span class="line">      <span class="attr">perTryTimeout:</span> <span class="string">2s</span></span><br></pre></td></tr></table></figure><h3 id="Circuit-Breaking"><a href="#Circuit-Breaking" class="headerlink" title="Circuit Breaking"></a>Circuit Breaking</h3><p>熔断是一种非常有用的过载保护手段，可以避免服务的级联失败。在熔断器中，设置一个对服务中的单个主机调用的限制，例如并发连接的数量或对该主机调用失败的次数。一旦限制被触发，熔断器就会“跳闸”并停止连接到该主机。使用熔断模式可以快速失败而不必让客户端尝试连接到过载或有故障的主机。熔断适用于在负载均衡池中的“真实”网格目标地址，可以在 DestinationRule 中配置熔断器阈值，让配置适用于服务中的每个主机。</p><p>Istio 里面的熔断需要在自定义资源 <code>DestinationRule</code> 的 <code>TrafficPolicy</code> 里进行设置。下面的示例将 v1 子集的<code>reviews</code>服务工作负载的并发连接数限制为 100：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">DestinationRule</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">reviews</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">host:</span> <span class="string">reviews</span></span><br><span class="line">  <span class="attr">subsets:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">v1</span></span><br><span class="line">    <span class="attr">labels:</span></span><br><span class="line">      <span class="attr">version:</span> <span class="string">v1</span></span><br><span class="line">    <span class="attr">trafficPolicy:</span></span><br><span class="line">      <span class="attr">connectionPool:</span></span><br><span class="line">        <span class="attr">tcp:</span></span><br><span class="line">          <span class="attr">maxConnections:</span> <span class="number">100</span></span><br></pre></td></tr></table></figure><h2 id="调试能力"><a href="#调试能力" class="headerlink" title="调试能力"></a>调试能力</h2><p>Istio 还提供了对流量进行调试的能力，包括故障注入和流量镜像。对流量进行调试可以让系统具有更好的容错能力，也方便我们在问题排查时通过调试来快速定位原因所在。</p><h3 id="Fault-Injection"><a href="#Fault-Injection" class="headerlink" title="Fault Injection"></a>Fault Injection</h3><p>在一个微服务架构的系统中，为了让系统达到较高的健壮性要求，通常需要对系统做定向错误测试。比如电商中的订单系统、支付系统等若出现故障那将是非常严重的生产事故，因此必须在系统设计前期就需要考虑多样性的异常故障并对每一种异常设计完善的恢复策略或优雅的回退策略，尽全力规避类似事故的发生，使得当系统发生故障时依然可以正常运作。而在这个过程中，服务故障模拟一直以来是一个非常繁杂的工作，于是在这样的背景下就衍生出了故障注入技术手段，故障注入是用来模拟上游服务请求响应异常行为的一种手段。通过人为模拟上游服务请求的一些故障信息来检测下游服务的故障策略是否能够承受这些故障并进行自我恢复。</p><p>Istio 提供了一种无侵入式的故障注入机制，让开发测试人员在不用调整服务程序的前提下，通过配置即可完成对服务的异常模拟。Istio 1.5 仅支持网络层的故障模拟，即支持模拟上游服务的处理时长、服务异常状态、自定义响应状态码等故障信息，暂不支持对于服务主机内存、CPU 等信息故障的模拟。他们都是通过配置上游主机的 VirtualService 来实现的。当我们在 VirtualService 中配置了故障注入时，上游服务的 Envoy代理在拦截到请求之后就会做出相应的响应。</p><p>目前，Istio 提供两种类型的故障注入，abort 类型与 delay 类型。</p><ul><li><strong>abort</strong>：非必配项，配置一个 Abort 类型的对象。用来注入请求异常类故障。简单的说，就是用来模拟上游服务对请求返回指定异常码时，当前的服务是否具备处理能力。它对应于 Envoy过滤器中的 <a href="https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/http/fault/v2/fault.proto#envoy-api-msg-config-filter-http-fault-v2-faultabort" target="_blank" rel="external nofollow noopener noreferrer">config.filter.http.fault.v2.FaultAbort</a> 配置项，当 VirtualService 资源应用时，Envoy将会该配置加载到过滤器中并处理接收到的流量。</li><li><strong>delay</strong>：非必配项，配置一个 Delay 类型的对象。用来注入延时类故障。通俗一点讲，就是人为模拟上游服务的响应时间，测试在高延迟的情况下，当前的服务是否具备容错容灾的能力。它对应于 Envoy过滤器中的 <a href="https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/fault/v2/fault.proto#envoy-api-msg-config-filter-fault-v2-faultdelay" target="_blank" rel="external nofollow noopener noreferrer">config.filter.fault.v2.FaultDelay</a> 配置型，同样也是在应用 Istio 的 VirtualService 资源时，Envoy将该配置加入到过滤器中。</li></ul><p>实际上，Istio 的故障注入正是基于 Envoy的 config.filter.http.fault.v2.HTTPFault 过滤器实现的，它的局限性也来自于 Envoy故障注入机制的局限性。对于 Envoy的 HttpFault 的详细介绍请参考 <a href="https://www.envoyproxy.io/docs/envoy/latest/api-v2/config/filter/http/fault/v2/fault.proto#envoy-api-msg-config-filter-http-fault-v2-httpfault" target="_blank" rel="external nofollow noopener noreferrer">Envoy 文档</a>。对比 Istio 故障注入的配置项与 Envoy故障注入的配置项，不难发现，Istio 简化了对于故障控制的手段，去掉了 Envoy中通过 HTTP header 控制故障注入的配置。</p><h4 id="HTTPFaultInjection-Abort"><a href="#HTTPFaultInjection-Abort" class="headerlink" title="HTTPFaultInjection.Abort"></a>HTTPFaultInjection.Abort</h4><ul><li><strong>httpStatus</strong>：必配项，是一个整型的值。表示注入 HTTP 请求的故障状态码。</li><li><strong>percentage</strong>：非必配项，是一个 Percent 类型的值。表示对多少请求进行故障注入。如果不指定该配置，那么所有请求都将会被注入故障。</li><li><strong>percent</strong>：已经废弃的一个配置，与 percentage 配置功能一样，已经被 percentage 代替。</li></ul><p>如下的配置表示对 <code>v1</code> 版本的 <code>ratings.prod.svc.cluster.local</code> 服务访问的时候进行故障注入，<code>0.1</code>表示有千分之一的请求被注入故障， <code>400</code> 表示故障为该请求的 HTTP 响应码为 <code>400</code> 。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">ratings-route</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">ratings.prod.svc.cluster.local</span></span><br><span class="line">  <span class="attr">http:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">ratings.prod.svc.cluster.local</span></span><br><span class="line">        <span class="attr">subset:</span> <span class="string">v1</span></span><br><span class="line">    <span class="attr">fault:</span></span><br><span class="line">      <span class="attr">abort:</span></span><br><span class="line">        <span class="attr">percentage:</span></span><br><span class="line">          <span class="attr">value:</span> <span class="number">0.1</span></span><br><span class="line">        <span class="attr">httpStatus:</span> <span class="number">400</span></span><br></pre></td></tr></table></figure><h4 id="HTTPFaultInjection-Delay"><a href="#HTTPFaultInjection-Delay" class="headerlink" title="HTTPFaultInjection.Delay"></a>HTTPFaultInjection.Delay</h4><ul><li><strong>fixedDelay</strong>：必配项，表示请求响应的模拟处理时间。格式为：<code>1h/1m/1s/1ms</code>， 不能小于 <code>1ms</code>。</li><li><strong>percentage</strong>：非必配项，是一个 Percent 类型的值。表示对多少请求进行故障注入。如果不指定该配置，那么所有请求都将会被注入故障。</li><li><strong>percent</strong>：已经废弃的一个配置，与 <code>percentage</code> 配置功能一样，已经被 <code>percentage</code> 代替。</li></ul><p>如下的配置表示对 <code>v1</code> 版本的 <code>reviews.prod.svc.cluster.local</code> 服务访问的时候进行延时故障注入，<code>0.1</code> 表示有千分之一的请求被注入故障，<code>5s</code> 表示<code>reviews.prod.svc.cluster.local</code> 延时 <code>5s</code>返回。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">reviews-route</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">reviews.prod.svc.cluster.local</span></span><br><span class="line">  <span class="attr">http:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">match:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">sourceLabels:</span></span><br><span class="line">        <span class="attr">env:</span> <span class="string">prod</span></span><br><span class="line">    <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">reviews.prod.svc.cluster.local</span></span><br><span class="line">        <span class="attr">subset:</span> <span class="string">v1</span></span><br><span class="line">    <span class="attr">fault:</span></span><br><span class="line">      <span class="attr">delay:</span></span><br><span class="line">        <span class="attr">percentage:</span></span><br><span class="line">          <span class="attr">value:</span> <span class="number">0.1</span></span><br><span class="line">        <span class="attr">fixedDelay:</span> <span class="string">5s</span></span><br></pre></td></tr></table></figure><h3 id="Mirroring"><a href="#Mirroring" class="headerlink" title="Mirroring"></a>Mirroring</h3><p>流量镜像（Mirroring / traffic-shadow），也叫作影子流量，就是通过复制一份请求并把它发送到镜像服务，从而实现流量的复制功能。流量镜像的主要应用场景有以下几种：最主要的就是进行<strong>线上问题排查</strong>。</p><p>一般情况下，因为系统环境，特别是数据环境、用户使用习惯等问题，我们很难在开发环境中模拟出真实的生产环境中出现的棘手问题，同时生产环境也不能记录太过详细的日志，因此很难定位到问题。有了流量镜像，我们就可以把真实的请求发送到镜像服务，再打开 debug 日志来查看详细的信息。除此以外，还可以通过它来观察生产环境的请求处理能力，比如在镜像服务进行压力测试。也可以将复制的请求信息用于数据分析。流量镜像在 Istio 里实现起来也非常简单，只需要在路由配置中通添加<code>mirror</code>关键字即可。</p><h4 id="流量镜像能够为我们带来什么"><a href="#流量镜像能够为我们带来什么" class="headerlink" title="流量镜像能够为我们带来什么"></a>流量镜像能够为我们带来什么</h4><p>很多情况下，当我们对服务做了重构，或者我们对项目做了重大优化时，怎么样保证服务是健壮的呢？在传统的服务里，我们只能通过大量的测试，模拟在各种情况下服务的响应情况。虽然也有手工测试、自动化测试、压力测试等一系列手段去检测它，但是测试本身就是一个样本化的行为，即使测试人员再完善它的测试样例，无法全面的表现出线上服务的一个真实流量形态。往往当项目发布之后，总会出现一些意外，比如你服务里收到客户使用的某些数据库不认识的特殊符号，再比如用户在本该输入日期的输入框中输入了 “—” 字样的字符，又比如用户使用乱码替换你的 token 值批量恶意攻击服务等等，这样的情况屡见不鲜。数据的多样性，复杂性决定了开发人员在开发阶段根本是无法考虑周全的。</p><p>而流量镜像的设计，让这类问题得到了最大限度的解决。流量镜像讲究的不再是使用少量样本去评估一个服务的健壮性，而是在不影响线上坏境的前提下将线上流量持续的镜像到我们的预发布坏境中去，让重构后的服务在上线之前就结结实实地接受一波真实流量的冲击与考验，让所有的风险全部暴露在上线前夕，通过不断的暴露问题，解决问题让服务在上线前夕就拥有跟线上服务一样的健壮性。由于测试坏境使用的是真实流量，所以不管从流量的多样性，真实性，还是复杂性上都将能够得以展现，同时预发布服务也将表现出其最真实的处理能力和对异常的处理能力。运用这种模式，一方面，我们将不会再跟以前一样在发布服务前夕内心始终忐忑不安，只能祈祷上线之后不会出现问题。另一方面，当大量的流量流入重构服务之后，开发过程中难以评估的性能问题也将完完整整的暴露出来，此时开发人员将会考虑它服务的性能，测试人员将会更加完善他们的测试样例。通过暴露问题，解决问题，再暴露问题，再解决问题的方式循序渐进地完善预发布服务来增加我们上线的成功率。同时也变相的促进我们开发测试人员技能水平的提高。</p><p>当然，流量镜像的作用不仅仅只是解决上面这样的场景问题，我们可以根据它的特性，解决更多的问题。比如，假如我们在上线后突然发现一个线上问题，而这个问题在测试坏境中始终不能复现。那么这个时候我们就能利用它将异常流量镜像到一个分支服务中去，然后我们可以随意在这个分支服务上进行分析调试，这里所说的分支服务，可以是原服务的只用于问题分析而不处理正式业务的副本服务，也可以是一个只收集镜像流量的组件类服务。又比如突然需要收集某个时间段某些流量的特征数据做分析，像这种临时性的需求，使用流量镜像来处理非常合适，既不影响线上服务的正常运转，也达到了收集分析的目的。</p><h4 id="流量镜像的实现原理"><a href="#流量镜像的实现原理" class="headerlink" title="流量镜像的实现原理"></a>流量镜像的实现原理</h4><p>实际上在 Istio 中，服务间的通讯都是被 Envoy代理拦截并处理的， Istio 流量镜像的设计也是基于 Envoy特性实现的。它的流量转发如下图所示。可以看到，当流量进入到<code>Service A</code>时，因为在<code>Service A</code>的 Envoy代理上配置了流量镜像规则，那么它首先会将原始流量转发到<code>v1</code>版本的 <code>Service B</code>服务子集中去 。同时也会将相同的流量复制一份，异步地发送给<code>v2</code>版本的<code>Service B</code> 服务子集中去，可以明显的看到，<code>Service A</code> 发送完镜像流量之后并不关心它的响应情况。</p><p>在很多情况下，我们需要将真实的流量数据与镜像流量数据进行收集并分析，那么当我们收集完成后应该怎样区分哪些是真实流量，哪些是镜像流量呢？ 实际上，Envoy团队早就考虑到了这样的场景，他们为了区分镜像流量与真实流量，在镜像流量中修改了请求标头中 <code>host</code> 值来标识，它的修改规则是：在原始流量请求标头中的 <code>host</code> 属性值拼接上<code>“-shadow”</code> 字样作为镜像流量的 <code>host</code> 请求标头。</p><p>为了能够更清晰的对比出原始流量与镜像流量的区别，我们使用以下的一个示例来说明：</p><p>如下图所示，我们发起一个<code>http://istio.gateway.xxxx.tech/serviceB/request/info</code>的请求，请求首先进入了<code>istio-ingressgateway</code> ，它是一个 Istio 的 Gateway 资源类型的服务，它本身就是一个 Envoy代理。在这个例子里，就是它对流量进行了镜像处理。可以看到，它将流量转发给<code>v1</code>版本<code>Service B</code>服务子集的同时也复制了一份流量发送到了<code>v2</code>版本的<code>Service B</code>服务子集中去。</p><p><img alt="concepts-traffic-shadow-request" data-src="https://www.servicemesher.com/istio-handbook/images/concepts-traffic-shadow-request.png"></p><p>在上面的请求链中，请求标头数据有什么变化呢？下图收集了它们请求标头中的所有信息，可以明显的对比出正式流量与镜像流量请求标头中<code>host</code>属性的区别（部分相同的属性值过长，这里只截取了前半段）。从图中我们可以看出，首先就是host属性值的不同，而区别就是多了一个<code>“-shadow”</code>的后缀。再者发现<code>x-forwarded-for</code>属性也不相同，<code>x-forwarded-for</code>协议头的格式是：<code>x-forwarded-for: client1, proxy1, proxy2</code>， 当流量经过 Envoy代理时这个协议头将会把代理服务的 IP 添加进去。实例中<code>10.10.2.151</code>是我们云主机的 IP，而<code>10.10.2.121</code>是<code>isito-ingressgateway</code>所对应<code>Pod</code>的 IP 。从这里也能看到，镜像流量是由<code>istio-ingressgatway</code>发起的。除了这两个请求标头的不同，其他配置项是完全一样的。</p><p><img alt="concepts-traffic-shadow-header" data-src="https://www.servicemesher.com/istio-handbook/images/concepts-traffic-shadow-header.png"></p><h4 id="流量镜像的配置"><a href="#流量镜像的配置" class="headerlink" title="流量镜像的配置"></a>流量镜像的配置</h4><p>上面我们介绍了流量镜像的原理及使用场景，接下来我们再介绍下流量的镜像如何配置才能生效。在 Istio 架构里，镜像流量是借助于 VirtualService 这个资源中的 <code>HTTPRoute</code> 配置项的<code>mirror</code>与<code>mirrorPercent</code>这两项子配置项来实现的，这两个配置项的定义也是非常的简单。</p><ul><li><strong>mirror</strong>：配置一个 Destination 类型的对象，这里就是我们镜像流量转发的服务地址。具体的 <strong>VirtualService</strong> 配置与<strong>DestinationRule</strong> 对象配置属性请参考相关介绍页。</li><li><strong>mirrorPercent</strong>：配置一个数值，这个配置项用来指定有多少的原始流量将被转发到镜像流量服务中去，它的有效值为<code>0~100</code>，如果配置成<code>0</code>则表示不发送镜像流量。</li></ul><p>下面的例子就是我们在示例中使用到的<code>Service B</code>的镜像流量配置，其中，<code>mirror.host</code>配置项是配置一个域名或者在Istio 注册表中注册过的服务名称，可以看到，该配置指定了镜像流量需要发送的目标服务地址为<code>serviceB</code>。<code>mirror.subset</code>配置项配置一个<code>Service B</code>服务的服务子集名称 ，指定了要将镜像流量镜像到<code>v2</code>版本的<code>Service B</code>服务子集中去。<code>mirror_percent</code>配置将<code>100%</code>的真实流量进行镜像发送。所以下面的配置整体表示当流量到来时，将请求转发到<code>v1</code>版本的<code>service B</code>服务子集中，再以镜像的方式发送到<code>v2</code>版本的<code>service B</code>服务上一份，并将真实流量全部镜像。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">VirtualService</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">serviceB</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">hosts:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">istio.gateway.xxxx.tech</span></span><br><span class="line">  <span class="attr">gateways:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="string">ingressgateway.istio-system.svc.cluster.local</span></span><br><span class="line">  <span class="attr">http:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">match:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">uri:</span></span><br><span class="line">        <span class="attr">prefix:</span> <span class="string">/serviceB</span></span><br><span class="line">    <span class="attr">rewrite:</span></span><br><span class="line">      <span class="attr">uri:</span> <span class="string">/</span></span><br><span class="line">    <span class="attr">route:</span></span><br><span class="line">    <span class="bullet">-</span> <span class="attr">destination:</span></span><br><span class="line">        <span class="attr">host:</span> <span class="string">serviceB</span></span><br><span class="line">        <span class="attr">subset:</span> <span class="string">v1</span></span><br><span class="line">    <span class="attr">mirror:</span></span><br><span class="line">      <span class="attr">host:</span> <span class="string">serviceB</span></span><br><span class="line">      <span class="attr">subset:</span> <span class="string">v2</span></span><br><span class="line">    <span class="attr">mirror_percent:</span> <span class="number">100</span></span><br></pre></td></tr></table></figure><p><code>service B</code> 服务对应的 DestinationRule 配置如下 ：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">networking.istio.io/v1alpha3</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">DestinationRule</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">serviceB</span></span><br><span class="line">  <span class="attr">namespace:</span> <span class="string">default</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">host:</span> <span class="string">serviceB</span></span><br><span class="line">  <span class="attr">subsets:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">v2</span></span><br><span class="line">    <span class="attr">labels:</span></span><br><span class="line">      <span class="attr">version:</span> <span class="string">v2</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">v1</span></span><br><span class="line">    <span class="attr">labels:</span></span><br><span class="line">      <span class="attr">version:</span> <span class="string">v1</span></span><br></pre></td></tr></table></figure>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;流量控制是指对系统流量的管控，包括了对网格入口的流量、网格出口的流量以及在网格内部微服务间相互调用流量的控制。在 &lt;a href=&quot;../22cae0b8&quot;&gt;Istio 入门&lt;/a&gt; 中我们知道，Istio 架构在逻辑上分为 Control plane 和 Data plane，Control plane 负责整体管理和配置代理， Data plane 负责网格内所有微服务间的网络通信，同时还收集报告网络请求的遥测数据等。流量控制是在 Data plane 层实现。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-bookinfo.svg" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="service mesh" scheme="https://houmin.cc/tags/service-mesh/"/>
    
      <category term="istio" scheme="https://houmin.cc/tags/istio/"/>
    
      <category term="envoy" scheme="https://houmin.cc/tags/envoy/"/>
    
      <category term="网络" scheme="https://houmin.cc/tags/%E7%BD%91%E7%BB%9C/"/>
    
  </entry>
  
  <entry>
    <title>【Service Mesh】Istio 入门</title>
    <link href="https://houmin.cc/posts/22cae0b8/"/>
    <id>https://houmin.cc/posts/22cae0b8/</id>
    <published>2020-11-23T02:44:08.000Z</published>
    <updated>2022-11-09T15:13:45.391Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>Istio 是一个完全开源的服务网格，以透明的方式构建在现有的分布式应用中。它也是一个平台，拥有可以集成任何日志、遥测和策略系统的 API 接口。Istio 多样化的特性使你能够成功且高效地运行分布式微服务架构，并提供保护、连接和监控微服务的统一方法。</p><a id="more"></a><h2 id="核心功能"><a href="#核心功能" class="headerlink" title="核心功能"></a>核心功能</h2><h3 id="流量控制"><a href="#流量控制" class="headerlink" title="流量控制"></a>流量控制</h3><p>微服务应用最大的痛点就是处理服务间的通信，而这一问题的核心其实就是流量管理。首先我们来看看传统的微服务应用在没有 <a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#service-mesh" target="_blank" rel="external nofollow noopener noreferrer">Service Mesh</a> 介入的情况下，是如何完成诸如金丝雀发布这样的路由功能的。我们假设不借助任何现成的第三方框架，一个最简单的实现方法，就是在服务间添加一个负载均衡（比如 Nginx）做代理，通过修改配置的权重来分配流量。这种方式使得对流量的管理和基础设施绑定在了一起，难以维护。</p><p>而使用 <a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 就可以轻松的实现各种维度的流量控制。下图是典型的金丝雀发布策略：根据权重把 5% 的流量路由给新版本，如果服务正常，再逐渐转移更多的流量到新版本。</p><p><a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 中的流量控制功能主要分为三个方面：</p><ul><li>请求路由和流量转移</li><li>弹性功能，包括熔断、超时、重试</li><li>调试能力，包括故障注入和流量镜像</li></ul><p>关于流量控制的更多内容，参考 <a href="../151719f0">Istio流量控制</a></p><h3 id="安全管理"><a href="#安全管理" class="headerlink" title="安全管理"></a>安全管理</h3><p>安全对于微服务这样的分布式系统来说至关重要。与单体应用在进程内进行通信不同，网络成为了服务间通信的纽带，这使得它对安全有了更迫切的需求。比如为了抵御外来攻击，我们需要对流量进行加密；为保证服务间通信的可靠性，需要使用mTLS的方式进行交互；为控制不同身份的访问，需要设置不同粒度的授权策略。作为一个服务网格，<a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 提供了一整套完整的安全解决方案。它可以以透明的方式，为我们的微服务应用添加安全策略。</p><p><a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 中的安全架构是由多个组件协同完成的。Citadel 是负责安全的主要组件，用于密钥和证书的管理；<a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#pilot" target="_blank" rel="external nofollow noopener noreferrer">Pilot</a> 会将安全策略配置分发给 <a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#envoy" target="_blank" rel="external nofollow noopener noreferrer">Envoy</a> 代理；<a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#envoy" target="_blank" rel="external nofollow noopener noreferrer">Envoy</a> 执行安全策略来实现访问控制。下图展示了 <a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 的安全架构和运作流程。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-secure-arch.svg"></p><p>关于安全管理的更多内容，参考 <a href="../">Istio安全管理</a></p><h3 id="可观测性"><a href="#可观测性" class="headerlink" title="可观测性"></a>可观测性</h3><p>面对复杂的应用环境和不断扩展的业务需求，即使再完备的测试也难以覆盖所有场景，无法保证服务不会出现故障。正因为如此，才需要“可观察性”来对服务的运行时状态进行监控、上报、分析，以提高服务可靠性。具有可观察性的系统，可以在服务出现故障时大大降低问题定位的难度，甚至可以在出现问题之前及时发现问题以降低风险。具体来说，可观察性可以：</p><ul><li>及时反馈异常或者风险使得开发人员可以及时关注、修复和解决问题（告警）；</li><li>出现问题时，能够帮助快速定位问题根源并解决问题，以减少服务损失（减损）；</li><li>收集并分析数据，以帮助开发人员不断调整和改善服务（持续优化）。</li></ul><p>而在微服务治理之中，随着服务数量大大增加，服务拓扑不断复杂化，可观察性更是至关重要。<a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 自然也不可能缺少对可观察性的支持。它会为所有的服务间通信生成详细的遥测数据，使得网格中每个服务请求都可以被观察和跟踪。开发人员可以凭此定位故障，维护和优化相关服务。而且，这一特性的引入无需侵入被观察的服务。</p><p><a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 一共提供了三种不同类型的数据从不同的角度支撑起其可观察性：</p><ul><li>指标（Metrics）</li><li>日志（Access Logs）</li><li>分布式追踪（Distributed Traces）</li></ul><p>关于可观测行的更多内容，参考 <a href="../">Istio可观测性</a></p><h2 id="架构解析"><a href="#架构解析" class="headerlink" title="架构解析"></a>架构解析</h2><p>Istio的架构由<strong>控制平面</strong>和<strong>数据平面</strong>两个部分组成。</p><ul><li>数据平面：由整个网格内的sidecar代理组成，每个sidecar代理会接管流入和流出服务的流量，并配合控制平面完成流量控制等方面的内容。</li><li>控制平面：负责控制和管理数据平面的sidecar代理，完成配置的分发、服务发现和授权鉴权等功能。</li></ul><p>控制平面是 Istio 在原有服务网格产品上，首次提出的架构，实现了对于数据平面的统一管理。</p><p><img alt="Istio Arch" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-arch.svg"></p><h3 id="控制平面"><a href="#控制平面" class="headerlink" title="控制平面"></a>控制平面</h3><h4 id="Pilot"><a href="#Pilot" class="headerlink" title="Pilot"></a>Pilot</h4><p><code>Pilot</code> 组件的主要功能是将路由规则等配置信息转换为 sidecar 可以识别的信息，并下发给数据平面。可以把它简单的理解为是一个<strong>配置分发器</strong>（dispatcher），并辅助 sidecar 完成流量控制相关的功能。它管理sidecar代理之间的路由流量规则，并配置故障恢复功能，如超时、重试和熔断。</p><p><img alt="Istio Pilot Arch" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-pilot-arch.svg"></p><p>上图显示了Pilot的基本架构，它主要由以下几个部分组成：</p><h5 id="Abstract-Model"><a href="#Abstract-Model" class="headerlink" title="Abstract Model"></a>Abstract Model</h5><p>为了实现对不同服务注册中心 （Kubernetes、consul） 的支持，<a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#pilot" target="_blank" rel="external nofollow noopener noreferrer">Pilot</a> 需要对不同的输入来源的数据有一个统一的存储格式，也就是抽象模型。抽象模型中定义的关键成员包括 HostName（Service名称）、Ports（Service端口）、Address（Service ClusterIP）、Resolution （负载均衡策略） 等。</p><h5 id="Platform-Adapters"><a href="#Platform-Adapters" class="headerlink" title="Platform Adapters"></a>Platform Adapters</h5><p>借助平台适配器 Pilot 可以实现服务注册中心数据到抽象模型之间的数据转换。例如 Pilot 中的 Kubernetes 适配器通过 Kubernetes API 服务器得到 Kubernetes 中 Service 和 Pod 的相关信息，然后翻译为抽象模型提供给 Pilot 使用。通过平台适配器模式，Pilot 还可以从 Consul 等平台中获取服务信息，还可以开发适配器将其他提供服务发现的组件集成到 Pilot 中。</p><h5 id="xDS-API"><a href="#xDS-API" class="headerlink" title="xDS API"></a>xDS API</h5><p>Pilot 使用了一套起源于 Envoy 项目的标准数据面 API 来将服务信息和流量规则下发到数据面的 sidecar 中。这套标准数据面 API，也叫 xDS。Sidecar 通过 xDS API 可以动态获取 Listener （监听器）、Route （路由）、<a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#cluster" target="_blank" rel="external nofollow noopener noreferrer">Cluster</a> （集群）及 Endpoint （集群成员）配置：</p><ul><li>LDS，Listener 发现服务：Listener 监听器控制 <a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#sidecar" target="_blank" rel="external nofollow noopener noreferrer">sidecar</a> 启动端口监听（目前只支持 TCP 协议），并配置 L3/L4 层过滤器，当网络连接达到后，配置好的网络过滤器堆栈开始处理后续事件。</li><li>RDS，Router 发现服务：用于 HTTP 连接管理过滤器动态获取路由配置，路由配置包含 HTTP 头部修改（增加、删除 HTTP 头部键值），virtual hosts （虚拟主机），以及 virtual hosts 定义的各个路由条目。</li><li>CDS，Cluster发现服务：用于动态获取 Cluster 信息。</li><li>EDS，Endpoint 发现服务：用于动态维护端点信息，端点信息中还包括负载均衡权重、金丝雀状态等，基于这些信息，Sidecar 可以做出智能的负载均衡决策。</li></ul><h5 id="User-API"><a href="#User-API" class="headerlink" title="User API"></a>User API</h5><p>Pilot 还定义了一套用户 API， 用户 API 提供了面向业务的高层抽象，可以被运维人员理解和使用。</p><p>运维人员使用该 API 定义流量规则并下发到 Pilot，这些规则被 Pilot 翻译成数据面的配置，再通过标准数据面 API 分发到 sidecar 实例，可以在运行期对微服务的流量进行控制和调整。</p><p>通过运用不同的流量规则，可以对网格中微服务进行精细化的流量控制，如按版本分流、断路器、故障注入、灰度发布等。</p><p>关于 Pilot 的具体实现，可以参考 <a href="../">Istio Pilot 模块分析</a></p><h4 id="Citadel"><a href="#Citadel" class="headerlink" title="Citadel"></a>Citadel</h4><p><code>Citadel</code> 是 Istio 中专门负责安全的组件，内置有身份和证书管理功能，可以实现较为强大的授权和认证等操作，在1.5 版本之后取消了独立进程，作为一个模块被整合在 istiod 中。</p><p>总体来说，Istio 在安全架构方面主要包括以下内容：</p><ul><li>证书签发机构（CA）负责密钥和证书管理</li><li>API 服务器将安全配置分发给数据平面</li><li>客户端、服务端通过代理安全通信</li><li>Envoy 代理管理遥测和审计</li></ul><p>Istio 的身份标识模型使用一级服务标识来确定请求的来源，它可以灵活的标识终端用户、工作负载等。在平台层面，Istio 可以使用类似于服务名称来标识身份，或直接使用平台提供的服务标识。比如 Kubernetes 的 ServiceAccount，AWS IAM 用户、角色账户等。</p><p>在身份和证书管理方面，Istio 使用 X.509 证书，并支持密钥和证书的自动轮换。从 1.1 版本开始，Istio 开始支持安全发现服务器（SDS），随着不断的完善和增强，1.5 版本 SDS 已经成为默认开启的组件。Citadel 以前有两个功能：将证书以 Secret 的方式挂载到命名空间里；通过 SDS gRPC 接口与 nodeagent（已废弃）通信。目前 Citadel 只需要完成与 SDS 相关的工作，其他功能被移动到了 istiod 中。</p><p>关于Citadel的更多内容，参考 <a href="../">Istio安全管理</a></p><h4 id="Galley"><a href="#Galley" class="headerlink" title="Galley"></a>Galley</h4><p><code>Galley</code> 是 Istio 1.1 版本中新增加的组件，其目的是将 <code>Pilot</code> 和底层平台（如 Kubernetes）进行解耦。它分担了原本 <code>Pilot</code> 的一部分功能，主要负责配置的验证、提取和处理等功能。</p><h3 id="数据平面"><a href="#数据平面" class="headerlink" title="数据平面"></a>数据平面</h3><p>Istio 数据平面核心是以 sidecar 模式运行的智能代理。Sidecar 模式将数据平面核心组件部署到单独的流程或容器中，以提供隔离和封装。Sidecar 应用与父应用程序共享相同的生命周期，与父应用程序一起创建和退出。Sidecar 应用附加到父应用程序，并为应用程序提供额外的特性支持。</p><p>如下图所示，数据平面的 sidecar 代理可以调节和控制微服务之间所有的网络通信，每个服务 Pod 启动时会伴随启动 <code>istio-init</code> 和 proxy 容器。 </p><ul><li><code>istio-init</code> 容器主要功能是初始化 Pod 网络和对 Pod设置 iptable 规则，设置完成后自动结束。</li><li>Proxy 容器会启动两个服务：<code>istio-agent</code> 以及网络代理组件<ul><li><code>istio-agent</code>  的作用是同步管理数据，启动并管理网络代理服务进程，上报遥测数据</li><li>网络代理组件则根据管理策略完成流量管控、生成遥测数据。</li></ul></li></ul><p>数据平面真正触及到对网络数据包的相关操作，是上层控制平面策略的具体执行者。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-data-plane-arch.png"></p><p>Envoy 是 Istio 中默认的数据平面 Sidecar 代理，关于 Sidecar 是如何实现自动注入和流量劫持，以及Sidecar的流量路由机制如何实现，更多可参考 <a href="../">Envoy系列文章</a> 。</p><h2 id="安装部署"><a href="#安装部署" class="headerlink" title="安装部署"></a>安装部署</h2><h3 id="下载安装"><a href="#下载安装" class="headerlink" title="下载安装"></a>下载安装</h3><p>这里介绍在 Kubernetes 环境下安装 Istio，在开始之前，你需要有一个 Kubernetes 运行环境。</p><p>从 Istio v1.7 版本开始，Istio官方推荐使用 istioctl 安装。下面是安装步骤：</p><ul><li>在 <a href="https://github.com/istio/istio/releases" target="_blank" rel="external nofollow noopener noreferrer">Istio release</a> 页面下载与操作系统匹配的安装包，并将其解压。这里可以直接用Istio提供的脚本：</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br></pre></td><td class="code"><pre><span class="line">$ curl -L https://raw.githubusercontent.com/istio/istio/release-1.7/release/downloadIstioCandidate.sh | sh -</span><br><span class="line">$  [root@VM-1-28-centos istio]<span class="comment"># ls </span></span><br><span class="line">istio-1.7.0  istio-1.7.0-linux-amd64.tar.gz</span><br><span class="line">$ [root@VM-1-28-centos istio]<span class="built_in">cd</span> istio-1.7.0</span><br><span class="line">$ [root@VM-1-28-centos istio-1.7.0]<span class="comment"># tree -L 2</span></span><br><span class="line">.</span><br><span class="line">├── bin</span><br><span class="line">│   └── istioctl</span><br><span class="line">├── LICENSE</span><br><span class="line">├── manifests</span><br><span class="line">│   ├── charts</span><br><span class="line">│   ├── deploy</span><br><span class="line">│   ├── examples</span><br><span class="line">│   └── profiles</span><br><span class="line">├── manifest.yaml</span><br><span class="line">├── README.md</span><br><span class="line">├── samples</span><br><span class="line">│   ├── addons</span><br><span class="line">│   ├── bookinfo</span><br><span class="line">│   ├── certs</span><br><span class="line">│   ├── cross-network-gateway</span><br><span class="line">│   ├── custom-bootstrap</span><br><span class="line">│   ├── external</span><br><span class="line">│   ├── fortio</span><br><span class="line">│   ├── health-check</span><br><span class="line">│   ├── helloworld</span><br><span class="line">│   ├── httpbin</span><br><span class="line">│   ├── https</span><br><span class="line">│   ├── kubernetes-blog</span><br><span class="line">│   ├── operator</span><br><span class="line">│   ├── rawvm</span><br><span class="line">│   ├── README.md</span><br><span class="line">│   ├── security</span><br><span class="line">│   ├── sleep</span><br><span class="line">│   ├── tcp-echo</span><br><span class="line">│   └── websockets</span><br><span class="line">└── tools</span><br><span class="line">    ├── certs</span><br><span class="line">    ├── convert_RbacConfig_to_ClusterRbacConfig.sh</span><br><span class="line">    ├── dump_kubernetes.sh</span><br><span class="line">    ├── _istioctl</span><br><span class="line">    └── istioctl.bash</span><br><span class="line"></span><br><span class="line">27 directories, 9 files</span><br></pre></td></tr></table></figure><p>安装目录内容： </p><div class="table-container"><table><thead><tr><th>目录</th><th>包含内容</th></tr></thead><tbody><tr><td><code>bin</code></td><td>包含 istioctl 的客户端文件</td></tr><tr><td><code>manifests</code></td><td>包含 各种部署的 manifests</td></tr><tr><td><code>samples</code></td><td>包含示例应用程序</td></tr><tr><td><code>tools</code></td><td>包含用于性能测试和在本地机器上进行测试的脚本</td></tr></tbody></table></div><ul><li>将<code>istioctl</code>客户端路径加入 <code>$PATH</code> 中，从而可以使用 istioctl 命令行工具</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">export</span> PATH=<span class="variable">$PATH</span>:$(<span class="built_in">pwd</span>)/bin</span><br></pre></td></tr></table></figure><ul><li>安装 <code>demo</code> 配置</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">$ istioctl install --<span class="built_in">set</span> profile=demo</span><br><span class="line">✔ Istio core installed</span><br><span class="line">✔ Istiod installed</span><br><span class="line">✔ Egress gateways installed</span><br><span class="line">✔ Ingress gateways installed</span><br><span class="line">✔ Installation complete</span><br></pre></td></tr></table></figure><ul><li>添加一个Namespace Label，使得之后在部署你的应用的时候，istio会自动注入Envoy sidecar 代理</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl label namespace default istio-injection=enabled</span><br></pre></td></tr></table></figure><h3 id="部署-Bookinfo"><a href="#部署-Bookinfo" class="headerlink" title="部署 Bookinfo"></a>部署 Bookinfo</h3><p>Bookinfo 是 <a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 社区官方推荐的示例应用之一。它可以用来演示多种 <a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 的特性，并且它是一个异构的微服务应用。该应用由四个单独的微服务构成。 这个应用模仿了在线书店，可以展示书店中书籍的信息。例如页面上会显示一本书的描述，书籍的细节（ ISBN、页数等），以及关于这本书的一些评论。</p><p>Bookinfo 应用分为四个单独的微服务， 这些服务对 <a href="https://www.servicemesher.com/istio-handbook/GLOSSARY.html#istio" target="_blank" rel="external nofollow noopener noreferrer">Istio</a> 并无依赖，但是构成了一个有代表性的服务网格的例子：它由多个不同语言编写的服务构成，并且其中有一个应用会包含多个版本。</p><ul><li><code>productpage</code> 会调用 <code>details</code> 和 <code>reviews</code> 两个微服务，用来生成页面。</li><li><code>details</code> 中包含了书籍的信息。</li><li><code>reviews</code> 中包含了书籍相关的评论。它还会调用 <code>ratings</code> 微服务。</li><li><code>ratings</code> 中包含了由书籍评价组成的评级信息。</li></ul><p><code>reviews</code> 微服务有 3 个版本，可用来展示各服务之间的不同的调用链路：</p><ul><li>v1 版本不会调用 <code>ratings</code> 服务。</li><li>v2 版本会调用 <code>ratings</code> 服务，并使用 1 到 5 个黑色星形图标来显示评分信息。</li><li>v3 版本会调用 <code>ratings</code> 服务，并使用 1 到 5 个红色星形图标来显示评分信息。</li></ul><p>下图展示了这个应用的端到端架构：</p><p><img alt="Bookinfo Application without Istio" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-bookinfo-noistio.svg"></p><ul><li>部署示例应用程序</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml</span><br><span class="line">service/details created</span><br><span class="line">serviceaccount/bookinfo-details unchanged</span><br><span class="line">deployment.apps/details-v1 created</span><br><span class="line">service/ratings created</span><br><span class="line">serviceaccount/bookinfo-ratings unchanged</span><br><span class="line">deployment.apps/ratings-v1 created</span><br><span class="line">service/reviews created</span><br><span class="line">serviceaccount/bookinfo-reviews unchanged</span><br><span class="line">deployment.apps/reviews-v1 created</span><br><span class="line">deployment.apps/reviews-v2 created</span><br><span class="line">deployment.apps/reviews-v3 created</span><br><span class="line">service/productpage created</span><br><span class="line">serviceaccount/bookinfo-productpage unchanged</span><br><span class="line">deployment.apps/productpage-v1 created</span><br></pre></td></tr></table></figure><ul><li>之后应用起来，当每个Pod状态变为Ready的时候，sidecar也部署成功。</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl get svc</span><br><span class="line">NAME          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE</span><br><span class="line">details       ClusterIP   172.18.252.45    &lt;none&gt;        9080/TCP   97s</span><br><span class="line">kubernetes    ClusterIP   172.18.252.1     &lt;none&gt;        443/TCP    51d</span><br><span class="line">productpage   ClusterIP   172.18.253.238   &lt;none&gt;        9080/TCP   97s</span><br><span class="line">ratings       ClusterIP   172.18.254.131   &lt;none&gt;        9080/TCP   97s</span><br><span class="line">reviews       ClusterIP   172.18.255.63    &lt;none&gt;        9080/TCP   97s</span><br><span class="line">$ kubectl get pods</span><br><span class="line">NAME                              READY   STATUS    RESTARTS   AGE</span><br><span class="line">details-v1-5974b67c8-z67st        2/2     Running   0          2m8s</span><br><span class="line">productpage-v1-797898bc54-frzdz   2/2     Running   0          2m8s</span><br><span class="line">ratings-v1-c6cdf8d98-xmhz8        2/2     Running   0          2m8s</span><br><span class="line">reviews-v1-8bdc65f7b-mjktx        2/2     Running   0          2m8s</span><br><span class="line">reviews-v2-868d77d678-4dzmn       2/2     Running   0          2m8s</span><br><span class="line">reviews-v3-6c9b646cb4-5tp9q       2/2     Running   0          2m8s</span><br></pre></td></tr></table></figure><ul><li>查看应用是否成功运行，通过给productpage发送请求，查看其返回</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl <span class="built_in">exec</span> <span class="string">"<span class="variable">$(kubectl get pod -l app=ratings -o jsonpath='&#123;.items[0].metadata.name&#125;')</span>"</span> -c ratings -- curl -s productpage:9080/productpage | grep -o <span class="string">"&lt;title&gt;.*&lt;/title&gt;"</span></span><br><span class="line">&lt;title&gt;Simple Bookstore App&lt;/title&gt;</span><br></pre></td></tr></table></figure><h3 id="集群外部访问应用"><a href="#集群外部访问应用" class="headerlink" title="集群外部访问应用"></a>集群外部访问应用</h3><p>到现在，Bookinfo 应用已经成功部署，我们在集群内部也已经可以访问，但是在集群外部还不能够访问。为了使得外部能够访问应用程序，我们需要创建一个<a href="https://istio.io/latest/docs/concepts/traffic-management/#gateways" target="_blank" rel="external nofollow noopener noreferrer">Istio Ingress Gateway</a>。</p><ul><li>将应用于istio gateway关联</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl apply -f samples/bookinfo/networking/bookinfo-gateway.yaml</span><br><span class="line">gateway.networking.istio.io/bookinfo-gateway created</span><br><span class="line">virtualservice.networking.istio.io/bookinfo created</span><br></pre></td></tr></table></figure><ul><li>确保配置上没有问题</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ istioctl analyze</span><br><span class="line">✔ No validation issues found when analyzing namespace: default.</span><br></pre></td></tr></table></figure><ul><li>确定Ingress的IP和Ports</li></ul><p>通过下面的命令来设置 <code>INGRESS_HOST</code> 和 <code>INGRESS_PORT</code>环境变量。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">kubectl get svc istio-ingressgateway -n istio-system</span><br><span class="line">NAME                   TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                                                                      AGE</span><br><span class="line">istio-ingressgateway   LoadBalancer   172.18.252.12   49.233.242.233   15021:32663/TCP,80:31968/TCP,443:31588/TCP,31400:32002/TCP,15443:30652/TCP   18m</span><br></pre></td></tr></table></figure><p>这里显示 <code>EXTERNAL_IP</code> 已经变设置，表明当前环境下有一个可以使用的外部负载均衡器。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">export</span> INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath=<span class="string">'&#123;.status.loadBalancer.ingress[0].ip&#125;'</span>)</span><br><span class="line">$ <span class="built_in">export</span> INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath=<span class="string">'&#123;.spec.ports[?(@.name=="http2")].port&#125;'</span>)</span><br><span class="line">$ <span class="built_in">export</span> SECURE_INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath=<span class="string">'&#123;.spec.ports[?(@.name=="https")].port&#125;'</span>)</span><br></pre></td></tr></table></figure><ul><li>设定GATEWAY_URL</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">export</span> GATEWAY_URL=<span class="variable">$INGRESS_HOST</span>:<span class="variable">$INGRESS_PORT</span></span><br><span class="line">$ <span class="built_in">echo</span> <span class="variable">$GATEWAY_URL</span></span><br><span class="line">49.233.242.233:80</span><br></pre></td></tr></table></figure><ul><li>确认外部访问是否成功：在浏览器直接访问 <code>http://&lt;GATE_WAYURL&gt;/productpage</code> 来访问Bookinfo应用</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-external-access.png"></p><h3 id="查看Dashboard"><a href="#查看Dashboard" class="headerlink" title="查看Dashboard"></a>查看Dashboard</h3><p>Istio集成了 <a href="https://istio.io/latest/docs/ops/integrations/" target="_blank" rel="external nofollow noopener noreferrer">一些</a> 遥测应用，他们可以帮助你对你的服务网格有直观的认识、展示网格的拓扑、分析网格的健康状态</p><ul><li>安装Kiali </li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl apply -f samples/addons</span><br><span class="line">$ <span class="keyword">while</span> ! kubectl <span class="built_in">wait</span> --<span class="keyword">for</span>=condition=available --timeout=600s deployment/kiali -n istio-system; <span class="keyword">do</span> sleep 1; <span class="keyword">done</span></span><br></pre></td></tr></table></figure><ul><li>访问Kiali</li></ul><p>官方教程指示使用 <code>istioctl dashboard kiali</code> 命令来打开浏览器访问 Kiali服务，但是我的 Kubernetes 集群在服务器上，这样显然不行，不要将 Kiali 服务暴露给外部。因为之前集群已经安装了 Traefik ，所以可以使用 Ingress来暴露。</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">extensions/v1beta1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Ingress</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">kiali</span></span><br><span class="line">  <span class="attr">namespace:</span> <span class="string">istio-system</span></span><br><span class="line">  <span class="attr">annotations:</span></span><br><span class="line">    <span class="attr">kubernetes.io/ingress.class:</span> <span class="string">traefik</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">rules:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">http:</span></span><br><span class="line">      <span class="attr">paths:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">path:</span> <span class="string">/kiali</span></span><br><span class="line">        <span class="attr">backend:</span></span><br><span class="line">          <span class="attr">serviceName:</span> <span class="string">kiali</span></span><br><span class="line">          <span class="attr">servicePort:</span> <span class="number">20001</span></span><br></pre></td></tr></table></figure><p>在命令行创建Ingress，打开浏览器访问 <code>http://&lt;NodeIP&gt;:&lt;TraefikWebNodePort&gt;/kiali</code> 即可访问Kiali</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_traefik-kiali.png"></p><p>在左侧导航栏点击Graph，选择default的命名空间，可以看到 <code>Bookinfo</code> 应用中各个服务间的关系：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-kiali.png"></p><p>到此为止，你的Istio和相关的服务已经在集群中完好的部署，关于其具体功能演示，参照 <a href="../151719f0">Istio流量控制</a>。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://istio.io/latest/docs/setup/getting-started" target="_blank" rel="external nofollow noopener noreferrer">https://istio.io/latest/docs/setup/getting-started</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;Istio 是一个完全开源的服务网格，以透明的方式构建在现有的分布式应用中。它也是一个平台，拥有可以集成任何日志、遥测和策略系统的 API 接口。Istio 多样化的特性使你能够成功且高效地运行分布式微服务架构，并提供保护、连接和监控微服务的统一方法。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="service mesh" scheme="https://houmin.cc/tags/service-mesh/"/>
    
      <category term="istio" scheme="https://houmin.cc/tags/istio/"/>
    
  </entry>
  
  <entry>
    <title>【Service Mesh】开篇</title>
    <link href="https://houmin.cc/posts/ac3e3d15/"/>
    <id>https://houmin.cc/posts/ac3e3d15/</id>
    <published>2020-11-22T06:24:34.000Z</published>
    <updated>2022-11-09T15:13:45.393Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><blockquote><p>Service Mesh 是一个<strong>基础设施层</strong>，用于处理<strong>服务到服务间</strong>的网络通信。<strong>云原生应用</strong>有着复杂的服务拓扑，Service Mesh负责在这些<strong>网络拓扑中实现请求的可靠传递</strong>。在实践中，Service Mesh通常实现为一组轻量级的<strong>网络代理</strong>，它们与应用程序部署在一起，但是<strong>对应用保持透明</strong>。</p></blockquote><p>本文作为 「Service Mesh」系列开篇，将理清 Service Mesh 的前世今生，通过对其概念与原理的理解，开始上手 Service Mesh的工作。与此同时，我们也会讨论 Service Mesh 在业界当前的应用现状，探讨其落地的难点与痛点。</p><a id="more"></a><h2 id="历史演进"><a href="#历史演进" class="headerlink" title="历史演进"></a>历史演进</h2><p>随着行业需求的推动，互联网服务从最早的仅有少数几台的大型服务器演变到成百上千的小型服务，服务架构也从最早期的单体式（Monolithic）到分布式（Distributed），再到微服务（Microservices）、容器化（Containerization）、容器编排（Container Orchestration），最后到服务网格（Service Mesh）、无服务器（Serverless）。</p><p>总结分布式系统的演进过程，我们可以看到一种通用的发展规律：</p><ul><li>首先是对每种情况提出临时解决方案</li><li>然后是更复杂的解决方案，类似于 library 以实现统一复用</li><li>随着对问题有更多的了解，开始将这些解决方案落实到 platform</li></ul><p>接下来我们会回顾从早期TCP/IP协议栈的广泛应用，到微服务时代从容器编排到服务网格的演进过程，并再次体会上述规律。</p><h3 id="计算机网络系统的演进"><a href="#计算机网络系统的演进" class="headerlink" title="计算机网络系统的演进"></a>计算机网络系统的演进</h3><p>从多台计算机开始通信以来，服务间通信是应用最为广泛的模式。以下图为例，ServiceA 和 ServiceB 可以是我们提供应用的服务端与客户端。在开发者开发这些服务的时候，需要借助底层的网络硬件和协议进行通信。这张图只是一个简化的师徒，省略了在代码操作的数据和通过线路发送接收的电信号之间转换的很多层。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-svc2svc.png"></p><p>更加具体一点，把底层的网络协议栈加入，我们会看到下图：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-svc2svc-stack.png"></p><p>从上世纪50年代起，上述的模型就一直在使用。最开始，由于计算机系统规模相对较小，每个节点之间的链路协议都是经过专门设计和维护的。随着计算机规模的迅速扩大，很多个小的网络系统开始连接起来。在这个过程中，不同主机间如何找到彼此，跨网络间如何路由转发，如何实现流量控制等问题，成了摆在网络系统设计人员面前亟需解决的难题。</p><p>为了实现各个网络节点的路由转发，屏蔽链路层协议，人们发明了IP网络协议。然而，IP网络协议还不能够解决流量控制的问题。这里的流量控制，值得是防止一台服务器发送过多的数据包，超出下游服务器的处理能力。在最开始，编写网络服务和应用程序的开发者来负责处理上述流量控制的问题。这就意味着在编写应用程序过程中，网络处理的逻辑和应用自身的业务逻辑被耦合在一起，如下图所示。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-flow-control.png"></p><p>然而，这种每个开发人员都要去考虑流量处理等传输层的问题太过复杂，程序开发的成本太高。随着技术的快速发展，流量处理和其他网络问题相关的解决方案被整合到网络协议栈，TCP/IP席卷了世界，成为互联网事实上的协议标准。流量控制等网络问题的代码仍在，但是你不再需要自己去开发与维护这段代码，而是直接调用系统提供的网络协议栈。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-tcp.png"></p><h3 id="微服务架构的演进"><a href="#微服务架构的演进" class="headerlink" title="微服务架构的演进"></a>微服务架构的演进</h3><p>确定于上世界80年代的TCP/IP网络协议栈和通用的网络模型对于互联网的发展发挥了巨大的作用，极大了促进了互联网应用的繁荣。网络应用的功能逐渐复杂起来，人们把所有的组件都集中在一个应用当中，这即是<code>单体应用 Monolithic</code>。单体应用基于相同技术栈开发、访问共享的数据库、共同部署运维和扩容。同时，组件之间的通信也趋于频繁和耦合，所有的交互都是以函数调用的形式来实现。</p><p>然而，随着互联网的迅猛发展，网络应用中需要添加越来越多的功能，应用的复杂度不断提升，参与软件开发的协作人数也越来越多，单体应用开始爆发出其固有局限性。在这种背景下，微服务的思潮降临，让软件开发重新变得小而美：</p><ul><li>单⼀职责：拆分后的单个微服务，通常只负责单个高内聚自闭环功能，因此很易于开发、理解和维护。</li><li>架构灵活：不同微服务应用之间在技术选型层面几乎是独立的，可以⾃由选择最适合的技术栈。</li><li>部署隔离：相比巨无霸单体应用，单个微服务应用的代码和产物体积大大减少，更容易持续集成和快速部署；同时，通过进程级别的隔离，也不再像单体应用一样只能同生共死，故障隔离效果显著提升。</li><li>独⽴扩展：单体应用时代，某个模块如果存在资源瓶颈（e.g. CPU/内存），只能跟随整个应用一起扩容，白白浪费很多资源。微服务化后，扩展的粒度细化到了微服务级别，可以更精确地按需独立扩展。</li></ul><p>然而，微服务也不是银弹，在微服务落地的过程中，也产生了很多的问题，其中主要的问题就是服务间通信：</p><ul><li><p><strong>如何找到服务的提供⽅？</strong></p><p>微服务通讯必须走远程过程调用（HTTP/REST本质上也属于RPC），当其中一个应用需要消费另一个应用的服务时，无法再像单体应用一样通过简单的进程内机制（e.g. Spring的依赖注入）就能获取到服务实例；你甚至都不知道有没有这个服务方。</p></li><li><p><strong>如何保证远程调⽤的可靠性?</strong></p><p>既然是RPC，那必然要走IP网络，而我们都知道网络（相比计算和存储）是软件世界里最不可靠的东西。虽然有TCP这种可靠传输协议，但频繁丢包、交换机故障甚至电缆被挖断也常有发生；即使网络是好的，如果对方机器宕机了，或者进程负载过高不响应呢？</p></li><li><p><strong>如何降低服务调⽤的延迟？</strong></p><p>网络不只是不可靠，还有延迟的问题。虽然相同系统内的微服务应用通常都部署在一起，同机房内调用延迟很小；但对于较复杂的业务链路，很可能一次业务访问就会包括数十次RPC调用，累积起来的延迟就很可观了。</p></li><li><p><strong>如何保证服务调⽤的安全性？</strong></p><p>网络不只是不可靠和有延迟，还是不安全的。互联网时代，你永远不知道屏幕对面坐的是人还是狗；同样，微服务间通讯时，如果直接走裸的通讯协议，你也永远不知道对端是否真的就是自己人，或者传输的机密信息是否有被中间人偷听。</p></li></ul><h4 id="服务通信：耦合业务逻辑"><a href="#服务通信：耦合业务逻辑" class="headerlink" title="服务通信：耦合业务逻辑"></a>服务通信：耦合业务逻辑</h4><p>就像历史总是会重演，为了解决上述微服务引入的问题，最早需要工程师独立去完成对应的服务，在业务逻辑中实现下列逻辑：</p><ul><li>服务发现（Service Discovery）：解决“我想调用你，如何找到你”的问题。</li><li>服务熔断（Circuit Breaker）：缓解服务之间依赖的不可靠问题。</li><li>负载均衡（Load Balancing）：通过均匀分配流量，让请求处理更加及时。</li><li>安全通讯：包括协议加密（TLS）、身份认证（证书/签名）、访问鉴权（RBAC）等</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-micro-service.png"></p><p>然而，随着分布式程度的增加，这些服务的复杂度也越来越高，一些问题不得不考虑：</p><ul><li>重复造轮子：需要编写和维护⼤量非功能性代码，如何集中精力专注业务创新?</li><li>与业务耦合：服务通讯逻辑与业务代码逻辑混在一起，动不动还会遇到点匪夷所思的分布式bug。</li></ul><h4 id="服务通信：独立Library"><a href="#服务通信：独立Library" class="headerlink" title="服务通信：独立Library"></a>服务通信：独立Library</h4><p>为了解决重复造轮子的问题，集成了服务通信中各种问题的Library开始变得十分流行，包括 Apache Dubbo（手动置顶）、Spring Cloud、Netflix OSS、gRPC 等等。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-micro-service-lib.png"></p><p>这些可复用的类库和框架，确确实实带来了质量和效率上的大幅提升，但是也存在着下列问题：</p><ul><li>并非完全透明：程序员们仍然需要正确理解和使⽤这些库，上手成本和出错概率依然很高。</li><li>限制技术选择：使用这些技术后，应用很容易就会被对应的语⾔和框架强绑定（vendor-lock）。</li><li>维护成本高：库版本升级，需要牵连应⽤一起重新构建和部署；麻烦不说，还要祈祷别出故障。</li></ul><h4 id="服务通信：Sidecar"><a href="#服务通信：Sidecar" class="headerlink" title="服务通信：Sidecar"></a>服务通信：Sidecar</h4><p>像网络协议栈发展的过程一样，将大规模分布式服务所需要的功能剥离出来集成到底层平台是一个众望所归的选择。人们通过应用层的协议(例如HTTP)写出了很多复杂的应用程序和服务，甚至不用考虑TCP是如何控制数据包在网络上传输的。这就是我们微服务所需要的，从事服务开发的工程师们可以专注于业务逻辑的开发，避免浪费时间去编写服务基础设施代码或者管理这些库和框架。</p><p>在这个想法下，我们可以得到类似于如下的图：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-protocol.png"></p><p>不幸的是，更改协议栈来增加微服务的功能不是一个可行的方案，许多开发者是通过一组代理来实现此功能。这里的设计思想是<strong>服务不需要和下游服务直连，所有的流量都通过该代理透明的来实现对应的功能</strong>。这里的透明代理，通过一种叫做 <code>Sidecar</code> 的模式来运行，Sidecar将上述类库和框架要干的事情从应用中彻底剥离了出来，并统一下沉到了基础设施层，这其中的典型代表就是 Linkerd 和 Envoy。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-sidecar.png"></p><h4 id="服务通信：Service-Mesh"><a href="#服务通信：Service-Mesh" class="headerlink" title="服务通信：Service Mesh"></a>服务通信：Service Mesh</h4><p>在这种模型中，每个服务都会有一个配套的代理SideCar。考虑到服务之间的通信仅仅通过SideCar代理，我们最终得到如下的部署图：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-data.png"></p><p>Buoyant的CEO William Morgan ，发现了各个SideCar代理之间互联组成了一个网状网络，<strong>2017初，William为这个网状的平台起了一个<a href="https://buoyant.io/2017/04/25/whats-a-service-mesh-and-why-do-i-need-one/" target="_blank" rel="external nofollow noopener noreferrer">“Service Mesh”的定义</a></strong>。</p><blockquote><p>Service Mesh是一个用于服务和服务之间通信的专用基础设施层。它负责服务请求能够在复杂的服务拓扑(组成了云原生应用)中可靠的进行投递。在实践中，Serivce Mesh的典型实现是作为轻量级网络代理阵列，部署在应用程序旁边，不需要业务进程感知到。</p></blockquote><p>William关于Service Mesh的定义中，最有说服力的一点是，他不再将SideCar代理视为一个独立组件，而是承认了<strong>它们组成的网络像它们自身一样是有价值的</strong></p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-data2.png"></p><p>随着很多公司将它们的微服务部署到更复杂的系统运行环境中，例如Kubernetes和Mesos，人们开始使用这些平台提供的工具来实现合适的Serivce Mesh的想法。它们将独立的SideCar代理从独立的工作环境中转移到一个适当的，有集中的控制面。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-control.png"></p><p>看下我们的鸟瞰图，服务之间的流量仍然是通过SideCar代理来进行转发，但是控制平面知道每个SideCar实例。控制平面能够让代理实现例如访问控制，指标收集等需要协作完成的事情。Istio是这个模型的典型实现。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh-control2.png"> </p><h2 id="主流实现"><a href="#主流实现" class="headerlink" title="主流实现"></a>主流实现</h2><p>Service Mesh 的主流实现包括：</p><ul><li>Linkerd：背后公司是Buoyant，开发语⾔使用Scala，2016年1⽉15日初次发布，2017年1⽉23日加入CNCF。</li><li>Envoy：背后公司是Lyft，开发语言使用C++ 11，2016年9月13日初次发布，2017年9⽉14日加⼊CNCF。</li><li>Istio：背后公司是Google和IBM，开发语言使用Go，2017年5⽉月10日初次发布。</li><li>Conduit：背后公司也是Buoyant，开发语言使用Rust和Go，2017年12月5日初次发布，现在已经加入了 <code>Linkerd</code> 项目。</li></ul><h3 id="Linkerd"><a href="#Linkerd" class="headerlink" title="Linkerd"></a>Linkerd</h3><p>现在（2020.09.08） <code>Linkerd</code> 已经发展到 2.8 版本，由控制面和数据面组成，详情可以参考 <a href="https://linkerd.io/2/reference/architecture/" target="_blank" rel="external nofollow noopener noreferrer">这里</a></p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_linkerd-control-plane.png"></p><h3 id="Envoy"><a href="#Envoy" class="headerlink" title="Envoy"></a>Envoy</h3><p>Envoy是一个高性能的Service Mesh软件，现在主要被用于数据面作为 Sidecar 代理，详情可以参考 <a href="../7beb34d2/">这里</a></p><p><img alt data-src="https://cdn.jsdelivr.net/gh/yangchuansheng/imghosting/img/20200504160047.png"></p><h3 id="Istio"><a href="#Istio" class="headerlink" title="Istio"></a>Istio</h3><p>Istio是第二代 Service Mesh，第一次提出控制面的概念，详情可以参考 <a href="../22cae0b8/">这里</a></p><p><img alt="Istio Arch" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-10_istio-arch.svg"></p><h3 id="NginMesh"><a href="#NginMesh" class="headerlink" title="NginMesh"></a>NginMesh</h3><p>Service Mesh 最基础的功能毕竟是 sidecar proxy. 提到 proxy 怎么能够少了 nginx? 我想nginx自己也是这么想的吧 毫不意外，nginx也推出了其 service mesh 的开源实现：nginMesh.</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_nginx-sidecar.png"></p><p>不过，与 William Morgan 的死磕策略不同，nginMesh 从一开始就没有想过要做一套完整的第二代Service Mesh 开源方案，而是直接宣布兼容Istio, 作为Istio的 sidecar proxy. 由于 nginx 在反向代理方面广泛的使用，以及运维技术的相对成熟，nginMesh在sidecar proxy领域应该会有一席之地。</p><h2 id="对比Kubernetes原生架构"><a href="#对比Kubernetes原生架构" class="headerlink" title="对比Kubernetes原生架构"></a>对比Kubernetes原生架构</h2><h3 id="Kube-proxy-vs-Sidecar"><a href="#Kube-proxy-vs-Sidecar" class="headerlink" title="Kube-proxy vs Sidecar"></a>Kube-proxy vs Sidecar</h3><p>下图展示的是 Kubernetes 与 Service Mesh 中的的服务访问关系：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_k8s-vs-service-mesh.png"></p><ul><li>Kubernetes 集群的每个节点都部署了一个 <code>kube-proxy</code> 组件，该组件会与 Kubernetes API Server 通信，获取集群中的 <code>Service</code> 信息，然后设置 iptables 规则，直接将对某个 <code>Service</code> 的请求发送到对应的 Endpoint（属于同一组 <code>Service</code> 的 <code>Pod</code>）上。</li><li>Kube-proxy 实现了流量在 Kubernetes <code>Service</code> 多个 <code>Pod</code> 实例间的负载均衡，但是如何对这些 <code>Service</code> 间的流量做细粒度的控制，比如按照百分比划分流量到不同的应用版本（这些应用都属于同一个 <code>Service</code>，但位于不同的 deployment 上），做金丝雀发布（灰度发布）和蓝绿发布？Kubernetes 社区给出了 <a href="https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#canary-deployments" target="_blank" rel="external nofollow noopener noreferrer">使用 Deployment 做金丝雀发布的方法</a>，该方法本质上就是通过修改 <code>Pod</code> 的 label 来将不同的 <code>Pod</code> 划归到 Deployment 的 <code>Service</code> 上。</li></ul><p><code>kube-proxy</code> 的设置都是全局生效的，无法对每个服务做细粒度的控制，而 <code>Service Mesh</code> 通过 <code>Sidecar</code> proxy 的方式将 Kubernetes 中对流量的控制从 <code>Service</code> 一层抽离出来，可以做更多的扩展。</p><h3 id="Ingress-vs-Gateway"><a href="#Ingress-vs-Gateway" class="headerlink" title="Ingress vs Gateway"></a>Ingress vs Gateway</h3><p> <code>kube-proxy</code> 只能路由 Kubernetes 集群内部的流量，而我们知道 Kubernetes 集群的 <code>Pod</code> 位于 CNI 创建的外网络中，集群外部是无法直接与其通信的，因此 Kubernetes 中创建了 Ingress 这个资源对象，它由位于 Kubernetes 边缘节点（这样的节点可以是很多个也可以是一组）的 Ingress controller 驱动，负责管理 <strong>南北向流量</strong>，Ingress 必须对接各种 Ingress Controller 才能使用，比如 <a href="https://github.com/kubernetes/ingress-nginx" target="_blank" rel="external nofollow noopener noreferrer">nginx ingress controller</a>、<a href="https://traefik.io/" target="_blank" rel="external nofollow noopener noreferrer">traefik</a>。</p><ul><li>Ingress 只适用于 HTTP 流量，使用方式也很简单，只能对 <code>Service</code>、port、HTTP 路径等有限字段匹配来路由流量，这导致它无法路由如 MySQL、Redis 和各种私有 RPC 等 TCP 流量。</li><li>要想直接路由南北向的流量，只能使用 <code>Service</code> 的 LoadBalancer 或 NodePort，前者需要云厂商支持，后者需要进行额外的端口管理。</li><li>有些 Ingress controller 支持暴露 TCP 和 UDP 服务，但是只能使用 <code>Service</code> 来暴露，Ingress 本身是不支持的，例如 <a href="https://kubernetes.github.io/ingress-nginx/user-guide/exposing-tcp-udp-services/" target="_blank" rel="external nofollow noopener noreferrer">nginx ingress controller</a>，服务暴露的端口是通过创建 ConfigMap 的方式来配置的。</li></ul><p><code>Istio</code> Gateway 的功能与 Kubernetes Ingress 类似，都是负责集群的南北向流量。<code>Istio</code> <code>Gateway</code> 描述的负载均衡器用于承载进出网格边缘的连接。该规范中描述了一系列开放端口和这些端口所使用的协议、负载均衡的 SNI 配置等内容。Gateway 是一种 CRD 扩展，它同时复用了 <code>Sidecar</code> proxy 的能力，详细配置请参考 <a href="https://istio.io/docs/reference/config/networking/gateway/" target="_blank" rel="external nofollow noopener noreferrer">Istio 官网</a>。</p><h2 id="落地问题"><a href="#落地问题" class="headerlink" title="落地问题"></a>落地问题</h2><p>服务网格的出现带来的变革：</p><p>第一，<strong>微服务治理与业务逻辑的解耦</strong>。服务网格把 SDK 中的<strong>大部分</strong>能力从应用中剥离出来，拆解为独立进程，以 Sidecar 的模式进行部署。服务网格通过将服务通信及相关管控功能从业务程序中分离并下沉到基础设施层，使其和业务系统完全解耦，使开发人员更加专注于业务本身。</p><blockquote><p>注意，这里提到了一个词“大部分”，SDK 中往往还需要保留<strong>协议编解码</strong>的逻辑，甚至在某些场景下还需要一个轻量级的 SDK 来实现细粒度的治理与监控策略。例如，要想实现方法级别的调用链追踪，服务网格则需要业务应用实现 trace ID 的传递，而这部分实现逻辑也可以通过轻量级的 SDK 实现。因此，从代码层面来讲，服务网格并非是零侵入的。</p></blockquote><p>第二，<strong>异构系统的统一治理</strong>。随着新技术的发展和人员更替，在同一家公司中往往会出现不同语言、不同框架的应用和服务，为了能够统一管控这些服务，以往的做法是为每种语言、每种框架都开发一套完整的 SDK，维护成本非常之高，而且给公司的中间件团队带来了很大的挑战。有了服务网格之后，通过将主体的服务治理能力下沉到基础设施，多语言的支持就轻松很多了。只需要提供一个非常轻量级的 SDK，甚至很多情况下都不需要一个单独的 SDK，就可以方便地实现多语言、多协议的统一流量管控、监控等需求。</p><p>此外，服务网格相对于传统微服务框架，还拥有三大技术优势：</p><ul><li>可观察性。因为服务网格是一个专用的基础设施层，所有的服务间通信都要通过它，所以它在技术堆栈中处于独特的位置，以便在服务调用级别上提供统一的遥测指标。这意味着，所有服务都被监控为“黑盒”。服务网格捕获诸如来源、目的地、协议、URL、状态码、延迟、持续时间等线路数据。这本质上等同于 web 服务器日志可以提供的数据，但是服务网格可以为所有服务捕获这些数据，而不仅仅是单个服务的 web 层。需要指出的是，收集数据仅仅是解决微服务应用程序中可观察性问题的一部分。存储与分析这些数据则需要额外能力的机制的补充，然后作用于警报或实例自动伸缩等。</li><li>流量控制。通过 <code>Service Mesh</code>，可以为服务提供智能路由（蓝绿部署、金丝雀发布、A/B test）、超时重试、熔断、故障注入、流量镜像等各种控制能力。而以上这些往往是传统微服务框架不具备，但是对系统来说至关重要的功能。例如，服务网格承载了微服务之间的通信流量，因此可以在网格中通过规则进行故障注入，模拟部分微服务出现故障的情况，对整个应用的健壮性进行测试。由于服务网格的设计目的是有效地将来源请求调用连接到其最优目标服务实例，所以这些流量控制特性是“面向目的地的”。这正是服务网格流量控制能力的一大特点。</li><li>安全。在某种程度上，单体架构应用受其单地址空间的保护。然而，一旦单体架构应用被分解为多个微服务，网络就会成为一个重要的攻击面。更多的服务意味着更多的网络流量，这对黑客来说意味着更多的机会来攻击信息流。而服务网格恰恰提供了保护网络调用的能力和基础设施。服务网格的安全相关的好处主要体现在以下三个核心领域：服务的认证、服务间通讯的加密、安全相关策略的强制执行。</li></ul><p>服务网格带来了巨大变革并且拥有其强大的技术优势，被称为第二代“微服务架构”。然而就像之前说的软件开发没有银弹，传统微服务架构有许多痛点，而服务网格也不例外，也有它的局限性。</p><ul><li>增加了复杂度。服务网格将 <code>Sidecar</code> 代理和其它组件引入到已经很复杂的分布式环境中，会极大地增加整体链路和操作运维的复杂性。</li><li>运维人员需要更专业。在容器编排器（如 Kubernetes）上添加 <code>Istio</code> 之类的服务网格，通常需要运维人员成为这两种技术的专家，以便充分使用二者的功能以及定位环境中遇到的问题。</li><li>延迟。从链路层面来讲，服务网格是一种侵入性的、复杂的技术，可以为系统调用增加显著的延迟。这个延迟是毫秒级别的，但是在特殊业务场景下，这个延迟可能也是难以容忍的。</li><li>平台的适配。服务网格的侵入性迫使开发人员和运维人员适应高度自治的平台并遵守平台的规则。</li></ul><h2 id="展望未来"><a href="#展望未来" class="headerlink" title="展望未来"></a>展望未来</h2><p>展望未来，Kubernetes 正在爆炸式发展，它已经成为企业绿地应用的容器编排的首选。如果说 Kubernetes 已经彻底赢得了市场，并且基于 Kubernetes 的应用程序的规模和复杂性持续增加，那么就会有一个临界点，而服务网格则将是有效管理这些应用程序所必需的。随着服务网格技术的持续发展，其实现产品（如 <code>Istio</code>）的架构与功能的不断优化，服务网格将完全取代传统微服务架构，成为大小企业微服务化和上云改造的首选架构。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://philcalcado.com/2017/08/03/pattern_service_mesh.html" target="_blank" rel="external nofollow noopener noreferrer">https://philcalcado.com/2017/08/03/pattern_service_mesh.html</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;blockquote&gt;
&lt;p&gt;Service Mesh 是一个&lt;strong&gt;基础设施层&lt;/strong&gt;，用于处理&lt;strong&gt;服务到服务间&lt;/strong&gt;的网络通信。&lt;strong&gt;云原生应用&lt;/strong&gt;有着复杂的服务拓扑，Service Mesh负责在这些&lt;strong&gt;网络拓扑中实现请求的可靠传递&lt;/strong&gt;。在实践中，Service Mesh通常实现为一组轻量级的&lt;strong&gt;网络代理&lt;/strong&gt;，它们与应用程序部署在一起，但是&lt;strong&gt;对应用保持透明&lt;/strong&gt;。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;本文作为 「Service Mesh」系列开篇，将理清 Service Mesh 的前世今生，通过对其概念与原理的理解，开始上手 Service Mesh的工作。与此同时，我们也会讨论 Service Mesh 在业界当前的应用现状，探讨其落地的难点与痛点。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-09-08_service-mesh.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="service mesh" scheme="https://houmin.cc/tags/service-mesh/"/>
    
      <category term="sidecar" scheme="https://houmin.cc/tags/sidecar/"/>
    
      <category term="service" scheme="https://houmin.cc/tags/service/"/>
    
  </entry>
  
  <entry>
    <title>【异构计算】NVIDIA GPU MIG</title>
    <link href="https://houmin.cc/posts/4e8612ed/"/>
    <id>https://houmin.cc/posts/4e8612ed/</id>
    <published>2020-11-19T09:03:47.000Z</published>
    <updated>2022-11-09T15:13:45.392Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>MIG，也就是 <code>Multi-Instance GPU</code> 是 NVIDIA 在 <code>NVIDIA GTC 2020</code> 发布的最新 Ampere 架构的 <code>NVIDIA A100 GPU</code> 推出的新特性。当配置为 MIG 运行状态时，A100 可以通过分出最多 7 个核心来帮助供应商提高 GPU 服务器的利用率，无需额外投入。MIG 提供了一种多用户使用隔离的GPU资源、提高GPU资源使用率的新的方式，特别适合于云服务提供商的多租户场景，保证一个租户的运行不干扰另一个租户。本文将介绍 MIG 的新特性和使用方法，以及在容器和 k8s 中使用 MIG 的方案。 </p><a id="more"></a><h2 id="MIG-技术简介"><a href="#MIG-技术简介" class="headerlink" title="MIG 技术简介"></a>MIG 技术简介</h2><p>随着深度学习的广泛应用，使用GPU加速训练和推理越来越普遍。然而，高昂的GPU价格在这里成为了不可忽视的成本，有时候单个GPU并没有得到充分的利用，在多租户之间如何能够共享GPU并且互不干扰成为了一个重要课题，尤其是在云服务环境使用GPU的场景下。针对这个问题，有很多种解决方案，分别是软件级虚拟化GPU和硬件级虚拟化GPU，而 MIG 即是硬件级虚拟化GPU的一种方式：</p><blockquote><p>Data center managers aim to keep resource utilization high, so an ideal data center accelerator doesn’t just go big- it also efficiently accelerates many smaller workloads.</p></blockquote><p><strong>MIG主要技术特点</strong></p><ol><li>每个GI独立的SM，完全隔离的显存（包括隔离的显存，L2cache，独立的DMA控制器等），从而可以保证每个GI的QoS</li><li>支持虚拟机，容器，进程层面的使用</li></ol><p>首先看一下传统GPU的内部架构，<strong>MIG的目的是使虚拟的每个GPU实例都拥有上面类似的架构。</strong></p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_nvidia-gpu.png"></p><h3 id="基本概念"><a href="#基本概念" class="headerlink" title="基本概念"></a><strong>基本概念</strong></h3><p>MIG对资源的划分可以分为两级，分别是GPU Instance、Compute Instance</p><h4 id="GPU-Instance"><a href="#GPU-Instance" class="headerlink" title="GPU Instance"></a>GPU Instance</h4><p>MIG功能可以将单个GPU划分为多个GPU分区，称为 <code>GPU Insance</code>。创建GPU实例可以认为是将一个大GPU拆分为多个较小的GPU，每个GPU实例都具有专用的计算和内存资源。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-04_nvidia-mig-compare.png"></p><p>每个GPU实例的行为就像一个较小的，功能齐全的独立GPU，其中包括：</p><ul><li>预定义数量的GPC</li><li>SMs</li><li>L2 Cache</li><li>Frame buffer</li></ul><p>注意：在MIG操作模式下，每个GPU实例中的单个GPC启用了7个TPC（14个SM），这使所有GPU切片具有相同的一致计算性能。</p><ul><li><strong>GPU Engine</strong>：一个 GPU Engine 是 GPU 中执行工作的组件，常用的GPU Engine 如下，每个Engine都能够被独立地调度和为不同 GPU Context 执行工作<ul><li><strong>Compute/Graphics engine</strong> that executes the compute instructions</li><li>the copy engine (<strong>CE</strong>) that is responsible for performing DMAs</li><li><strong>NVDEC</strong> for video decoding</li><li><strong>NVENC</strong> for encoding</li></ul></li><li><strong>GPU Memory Slice</strong>：一个 GPU Memory Slice 是 A100 GPU Memory 的一个最小片段，包括对应的 <code>memory controllers</code> 和 <code>cache</code>，粗略来说一个 GPU Memory Slice 大致是总的GPU Memory资源的 1/8，包括memory的 capacity 和 bandwidth。</li><li><p><strong>GPU SM Slice</strong>：一个 GPU SM Slice 是 A100 GPU SMs 的一个最小片段，粗略来说一个 GPU SM Slice 大致是总的GPU SM资源的 1/7</p></li><li><p><strong>GPU Slice</strong>：一个 GPU Slice 是 A100 GPU 中集合一个 <code>GPU Memory Slice</code> 和 一个 <code>GPU SM Slice</code> 的最小片段</p></li><li><strong>GPU Instance</strong>：一个 GPU Instance 是 GPU Slices 和 GPU Engines (DMAs, NVDECs, etc.)的结合</li></ul><h4 id="Compute-Instance"><a href="#Compute-Instance" class="headerlink" title="Compute Instance"></a>Compute Instance</h4><p>一个 GPU Instance 可以被划分为多个 Compute Instance，多个Compute Instance之间共享Memory和Engine，它包含了原来GPU Instance里面 <code>GPU SM slices</code> 和 <code>GPU Engines</code> 的一个子集(DMAs, NVDECs, etc.)：</p><ul><li>默认情况下，将在每个GPU实例下创建一个 Compute Instances，从而公开GPU实例中可用的所有GPU计算资源。</li><li>可以将GPU实例细分为多个较小的 Compute Instances，以进一步拆分其计算资源。</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_nvidia-compute-instance.png"></p><h3 id="架构对比"><a href="#架构对比" class="headerlink" title="架构对比"></a>架构对比</h3><p>pre-A100 GPU每个用户独占SM、Frame Buffer、L2 Cache。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_csp-multi-user-today.png"></p><p>A100 MIG将GPU进行物理切割，每个虚拟GPU instance具有独立的SM、L2 Cache、DRAM。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_csp-mig.png"></p><p>下面是MIG 配置多个独立的GPU Compute workloads。每个GPC分配固定的CE和DEC。A100中有5个decoder。</p><p>当1个GPU instance中包含2个Compute instance时，2个Compute instance共享CE、DEC和L2、Frame Buffer。</p><ul><li>GPC：Graphics Processor Cluster</li><li>TPC：Texture Processor Cluster</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_nvidia-mig-partition.png"></p><p>Compute instance使多个上下文可以在GPU实例上同时运行。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_nvidia-mig.png"></p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-13_nvidia-mig.png"></p><h3 id="MIG-隔离"><a href="#MIG-隔离" class="headerlink" title="MIG 隔离"></a>MIG 隔离</h3><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_mig-isolation.png"></p><p><strong>和上一代Volta MPS技术的对比</strong></p><blockquote><p>MPS was designed for sharing the GPU among applications from <strong>a single user</strong>, but not for multi-user or <strong>multi-tenant use</strong> cases.</p></blockquote><p>解决了MPS存在的memory system resources were shared across all the applications问题，同时继承了Volta MPS所有功能</p><div class="table-container"><table><thead><tr><th style="text-align:left">对比项</th><th style="text-align:left">MPS</th><th style="text-align:left">MIG</th></tr></thead><tbody><tr><td style="text-align:left">Partition Type</td><td style="text-align:left">Logical</td><td style="text-align:left">Physical</td></tr><tr><td style="text-align:left">Max Partitions</td><td style="text-align:left">48</td><td style="text-align:left">7</td></tr><tr><td style="text-align:left">SM Performance Isolation</td><td style="text-align:left">Yes (by percentage, not partitioning)</td><td style="text-align:left">Yes</td></tr><tr><td style="text-align:left">Memory Protection</td><td style="text-align:left">Yes</td><td style="text-align:left">Yes</td></tr><tr><td style="text-align:left">Memory Bandwidth QoS</td><td style="text-align:left">No</td><td style="text-align:left">Yes</td></tr><tr><td style="text-align:left">Error Isolation</td><td style="text-align:left">No</td><td style="text-align:left">Yes</td></tr><tr><td style="text-align:left">Cross-Partition Interop</td><td style="text-align:left">IPC</td><td style="text-align:left">Limited IPC</td></tr><tr><td style="text-align:left">Reconfigure</td><td style="text-align:left">Process Launch</td><td style="text-align:left">When Idle</td></tr></tbody></table></div><h3 id="GPU-Partitioning"><a href="#GPU-Partitioning" class="headerlink" title="GPU Partitioning"></a>GPU Partitioning</h3><p>每个 GI 包括的资源不是随意定义的，NVIDIA 提供了 一系列的 <code>GPU Instance Profiles</code>，用户在创建 GI 时必须按照这个 Profile 来切割。我们知道，A100 总共有 8 个 GPU Memory Slice 和 7 个 SM Slice，那么切分总共有5种 Profile：</p><div class="table-container"><table><thead><tr><th>Profile Name</th><th>Fraction of Memory</th><th>Fraction of SMs</th><th>Hardware Units</th><th>Number of Instances Available</th></tr></thead><tbody><tr><td>MIG 1g.5gb</td><td>1/8</td><td>1/7</td><td>0 NVDECs</td><td>7</td></tr><tr><td>MIG 2g.10gb</td><td>2/8</td><td>2/7</td><td>1 NVDECs</td><td>3</td></tr><tr><td>MIG 3g.20gb</td><td>4/8</td><td>3/7</td><td>2 NVDECs</td><td>2</td></tr><tr><td>MIG 4g.20gb</td><td>4/8</td><td>4/7</td><td>2 NVDECs</td><td>1</td></tr><tr><td>MIG 7g.40gb</td><td>Full</td><td>7/7</td><td>5 NVDECs</td><td>1</td></tr></tbody></table></div><p>注意：这里对于 <code>A100-SXM4-40GB</code> 总的 Memory大小是40GB，所以最小单位是 <code>1g.5gb</code>，如果对于 <code>A100-SXM4-80GB</code>，则最小单位是 <code>1g.10gb</code>。</p><p>也就是说，这几种 Profile 确定了 A100 GPU 可以被切分的方式，如下图，所有可以切分的方式只是下图从左到右选择不同的Profile，并且两个Profile上下不重叠。唯一的例外是，现在 NVIDIA 不支持 (4 memory, 4 compute) 和 (4 memory, 3 compute) 的组合：</p><p><img alt data-src="https://docs.nvidia.com/datacenter/tesla/mig-user-guide/graphics/gpu-instances-combo-pic.png"></p><p>下图就是组合的一种方式：A100 GPU 被切割成了3个GPU Instance，分别的大小是</p><ul><li>4 memory，4 compute</li><li>2 memory，2 compute</li><li>1 memory，1 compute</li></ul><p><img alt="Example Configuration of GPU Instances." data-src="https://docs.nvidia.com/datacenter/tesla/mig-user-guide/graphics/gpu-instances-example-pic.png"></p><p>下图也是组合的一种可能：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_mig-profile0.png"></p><p>前面提到， 硬件上 NVIDIA 不支持 (4 memory, 4 compute) 和 (4 memory, 3 compute) 的组合，但是支持两个  (4 memory, 3 compute) 的组合，这里左边的一个  (4 memory, 3 compute) 是将  (4 memory, 4 compute) 示例化为一个  (4 memory, 3 compute)。如下图就将 A100 切分成两个 GPU Instance，每个GPU Instance都有 (4 memory, 3 compute)</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_mig-profile1.png"></p><p>或者切分成3个GPU Instance：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_mig-profile2.png"></p><p>也可以切分成下面这种4个GPU Instance：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-01-14_mig-profile3.png"></p><p>总的来说，一共有 18 种切分方法：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-14_mig-profiles.png"></p><p>注意，下图中的两种切分并不相同，因为每个切分的Instance 的 <code>physical layout</code> 也很重要：</p><p><img alt="Placement of GPU Instances." data-src="https://docs.nvidia.com/datacenter/tesla/mig-user-guide/graphics/gpu-instances-placement-pic.png"></p><h2 id="MIG-技术使用"><a href="#MIG-技术使用" class="headerlink" title="MIG 技术使用"></a>MIG 技术使用</h2><p>具体到A100卡，实际实现有两个型号，分别是</p><ul><li>GA100 Full GPU with 128 SMs</li><li>A100 Tensor Core GPU with 108 SMs</li></ul><p>本次调研中使用的卡是108 SM版本</p><h3 id="驱动安装"><a href="#驱动安装" class="headerlink" title="驱动安装"></a>驱动安装</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line">$ nvidia-smi</span><br><span class="line">Wed Jan 13 11:42:34 2021</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |</span><br><span class="line">|-------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                               |                      |               MIG M. |</span><br><span class="line">|===============================+======================+======================|</span><br><span class="line">|   0  A100-SXM4-40GB      Off  | 00000000:00:08.0 Off |                    0 |</span><br><span class="line">| N/A   26C    P0    43W / 400W |      0MiB / 40536MiB |      0%      Default |</span><br><span class="line">|                               |                      |             Disabled |</span><br><span class="line">+-------------------------------+----------------------+----------------------+</span><br><span class="line"></span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                  |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |</span><br><span class="line">|        ID   ID                                                   Usage      |</span><br><span class="line">|=============================================================================|</span><br><span class="line">|  No running processes found                                                 |</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">$ tree /dev/</span><br><span class="line">├── nvidia0</span><br><span class="line">├── nvidia-caps</span><br><span class="line">│   ├── nvidia-cap1</span><br><span class="line">│   └── nvidia-cap2</span><br><span class="line">├── nvidiactl</span><br><span class="line">├── nvidia-modeset</span><br><span class="line">├── nvidia-uvm</span><br><span class="line">├── nvidia-uvm-tools</span><br></pre></td></tr></table></figure><h3 id="开启MIG支持"><a href="#开启MIG支持" class="headerlink" title="开启MIG支持"></a>开启MIG支持</h3><p>查询是否开启MIG</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv</span><br><span class="line">pci.bus_id, mig.mode.current</span><br><span class="line">00000000:00:08.0, Disabled</span><br></pre></td></tr></table></figure><p>对于指定卡开启mig，只有在卡空闲时才能更改mig enable 设置</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">$ nvidia-smi -i 0 -mig 1</span><br><span class="line">Warning: MIG mode is <span class="keyword">in</span> pending <span class="built_in">enable</span> state <span class="keyword">for</span> GPU 00000000:00:08.0:In use by another client</span><br><span class="line">00000000:00:08.0 is currently being used by one or more other processes (e.g. CUDA application or a monitoring application such as another instance of nvidia-smi). Please first <span class="built_in">kill</span> all processes using the device and retry the <span class="built_in">command</span> or reboot the system to make MIG mode effective.</span><br><span class="line">All <span class="keyword">done</span>.</span><br></pre></td></tr></table></figure><blockquote><p>If you are using MIG inside a VM with GPU passthrough, then you may need to reboot the VM to allow the GPU to be in MIG mode as in some cases, GPU reset is not allowed via the hypervisor for security reasons. This can be seen in the following example:</p></blockquote><p>重启之后</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ nvidia-smi -i 0 --query-gpu=pci.bus_id,mig.mode.current --format=csv</span><br><span class="line">pci.bus_id, mig.mode.current</span><br><span class="line">00000000:00:08.0, Enabled</span><br></pre></td></tr></table></figure><h3 id="查询可分配-GI-信息"><a href="#查询可分配-GI-信息" class="headerlink" title="查询可分配 GI 信息"></a>查询可分配 GI 信息</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi mig -lgip</span></span><br><span class="line">+--------------------------------------------------------------------------+</span><br><span class="line">| GPU instance profiles:                                                   |</span><br><span class="line">| GPU   Name          ID    Instances   Memory     P2P    SM    DEC   ENC  |</span><br><span class="line">|                           Free/Total   GiB              CE    JPEG  OFA  |</span><br><span class="line">|==========================================================================|</span><br><span class="line">|   0  MIG 1g.5gb     19     7/7        4.75       No     14     0     0   |</span><br><span class="line">|                                                          1     0     0   |</span><br><span class="line">+--------------------------------------------------------------------------+</span><br><span class="line">|   0  MIG 2g.10gb    14     3/3        9.75       No     28     1     0   |</span><br><span class="line">|                                                          2     0     0   |</span><br><span class="line">+--------------------------------------------------------------------------+</span><br><span class="line">|   0  MIG 3g.20gb     9     2/2        19.62      No     42     2     0   |</span><br><span class="line">|                                                          3     0     0   |</span><br><span class="line">+--------------------------------------------------------------------------+</span><br><span class="line">|   0  MIG 4g.20gb     5     1/1        19.62      No     56     2     0   |</span><br><span class="line">|                                                          4     0     0   |</span><br><span class="line">+--------------------------------------------------------------------------+</span><br><span class="line">|   0  MIG 7g.40gb     0     1/1        39.50      No     98     5     0   |</span><br><span class="line">|                                                          7     1     1   |</span><br><span class="line">+--------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><h3 id="查询-GI-placements"><a href="#查询-GI-placements" class="headerlink" title="查询 GI placements"></a>查询 GI placements</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi mig -lgipp</span></span><br><span class="line">GPU  0 Profile ID 19 Placements: &#123;0,1,2,3,4,5,6&#125;:1</span><br><span class="line">GPU  0 Profile ID 14 Placements: &#123;0,2,4&#125;:2</span><br><span class="line">GPU  0 Profile ID  9 Placements: &#123;0,4&#125;:4</span><br><span class="line">GPU  0 Profile ID  5 Placement : &#123;0&#125;:4</span><br><span class="line">GPU  0 Profile ID  0 Placement : &#123;0&#125;:8</span><br></pre></td></tr></table></figure><h3 id="创建-GPU-Instances"><a href="#创建-GPU-Instances" class="headerlink" title="创建 GPU Instances"></a>创建 GPU Instances</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi mig -cgi 9,14,19</span></span><br><span class="line">Successfully created GPU instance ID  2 on GPU  0 using profile MIG 3g.20gb (ID  9)</span><br><span class="line">Successfully created GPU instance ID  3 on GPU  0 using profile MIG 2g.10gb (ID 14)</span><br><span class="line">Successfully created GPU instance ID  9 on GPU  0 using profile MIG 1g.5gb (ID 19)</span><br></pre></td></tr></table></figure><h3 id="查询-GPU-Instance"><a href="#查询-GPU-Instance" class="headerlink" title="查询 GPU Instance"></a>查询 GPU Instance</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi mig -lgi</span></span><br><span class="line">+----------------------------------------------------+</span><br><span class="line">| GPU instances:                                     |</span><br><span class="line">| GPU   Name          Profile  Instance   Placement  |</span><br><span class="line">|                       ID       ID       Start:Size |</span><br><span class="line">|====================================================|</span><br><span class="line">|   0  MIG 1g.5gb       19        9          2:1     |</span><br><span class="line">+----------------------------------------------------+</span><br><span class="line">|   0  MIG 2g.10gb      14        3          0:2     |</span><br><span class="line">+----------------------------------------------------+</span><br><span class="line">|   0  MIG 3g.20gb       9        2          4:4     |</span><br><span class="line">+----------------------------------------------------+</span><br></pre></td></tr></table></figure><h3 id="创建-Compute-Instance"><a href="#创建-Compute-Instance" class="headerlink" title="创建 Compute Instance"></a>创建 Compute Instance</h3><p>创建CI前，首先需要查询对应的GI支持Profile列表，可以发现上文创建的ID为2的GI可以进一步分为3种类型的CI</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi mig -lcip -gi 2</span></span><br><span class="line">+--------------------------------------------------------------------------------------+</span><br><span class="line">| Compute instance profiles:                                                           |</span><br><span class="line">| GPU     GPU       Name             Profile  Instances   Exclusive       Shared       |</span><br><span class="line">|       Instance                       ID     Free/Total     SM       DEC   ENC   OFA  |</span><br><span class="line">|         ID                                                          CE    JPEG       |</span><br><span class="line">|======================================================================================|</span><br><span class="line">|   0      2       MIG 1c.3g.20gb       0      3/3           14        2     0     0   |</span><br><span class="line">|                                                                      3     0         |</span><br><span class="line">+--------------------------------------------------------------------------------------+</span><br><span class="line">|   0      2       MIG 2c.3g.20gb       1      1/1           28        2     0     0   |</span><br><span class="line">|                                                                      3     0         |</span><br><span class="line">+--------------------------------------------------------------------------------------+</span><br><span class="line">|   0      2       MIG 3g.20gb          2*     1/1           42        2     0     0   |</span><br><span class="line">|                                                                      3     0         |</span><br><span class="line">+--------------------------------------------------------------------------------------+</span><br><span class="line"></span><br><span class="line"><span class="comment"># nvidia-smi mig -lcip -gi 3</span></span><br><span class="line">+--------------------------------------------------------------------------------------+</span><br><span class="line">| Compute instance profiles:                                                           |</span><br><span class="line">| GPU     GPU       Name             Profile  Instances   Exclusive       Shared       |</span><br><span class="line">|       Instance                       ID     Free/Total     SM       DEC   ENC   OFA  |</span><br><span class="line">|         ID                                                          CE    JPEG       |</span><br><span class="line">|======================================================================================|</span><br><span class="line">|   0      3       MIG 1c.2g.10gb       0      2/2           14        1     0     0   |</span><br><span class="line">|                                                                      2     0         |</span><br><span class="line">+--------------------------------------------------------------------------------------+</span><br><span class="line">|   0      3       MIG 2g.10gb          1*     1/1           28        1     0     0   |</span><br><span class="line">|                                                                      2     0         |</span><br><span class="line">+--------------------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><p>然后进一步将ID为2的GI划分为两个CI，Profile分别是1c.3g.20gb，2c.3g.20gb，具体命令如下</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi mig -cci 0,1 -gi 2</span></span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID  2 using profile MIG 1c.3g.20gb (ID  0)</span><br><span class="line">Successfully created compute instance ID  1 on GPU  0 GPU instance ID  2 using profile MIG 2c.3g.20gb (ID  1)</span><br></pre></td></tr></table></figure><h3 id="查询-Compute-Instance"><a href="#查询-Compute-Instance" class="headerlink" title="查询 Compute Instance"></a>查询 Compute Instance</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi mig -lci -gi 2</span></span><br><span class="line">+--------------------------------------------------------------------+</span><br><span class="line">| Compute instances:                                                 |</span><br><span class="line">| GPU     GPU       Name             Profile   Instance   Placement  |</span><br><span class="line">|       Instance                       ID        ID       Start:Size |</span><br><span class="line">|         ID                                                         |</span><br><span class="line">|====================================================================|</span><br><span class="line">|   0      2       MIG 1c.3g.20gb       0         0          0:1     |</span><br><span class="line">+--------------------------------------------------------------------+</span><br><span class="line">|   0      2       MIG 2c.3g.20gb       1         1          1:2     |</span><br><span class="line">+--------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><p>执行 <code>nvidia-smi</code> 也可以看到如下输出</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi</span></span><br><span class="line">Wed Jan 13 12:04:54 2021</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |</span><br><span class="line">|-------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                               |                      |               MIG M. |</span><br><span class="line">|===============================+======================+======================|</span><br><span class="line">|   0  A100-SXM4-40GB      On   | 00000000:00:08.0 Off |                   On |</span><br><span class="line">| N/A   26C    P0    43W / 400W |     11MiB / 40536MiB |     N/A      Default |</span><br><span class="line">|                               |                      |              Enabled |</span><br><span class="line">+-------------------------------+----------------------+----------------------+</span><br><span class="line"></span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| MIG devices:                                                                |</span><br><span class="line">+------------------+----------------------+-----------+-----------------------+</span><br><span class="line">| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |</span><br><span class="line">|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|</span><br><span class="line">|                  |                      |        ECC|                       |</span><br><span class="line">|==================+======================+===========+=======================|</span><br><span class="line">|  0    2   0   0  |      5MiB / 20096MiB | 14      0 |  3   0    2    0    0 |</span><br><span class="line">|                  |      0MiB / 32767MiB |           |                       |</span><br><span class="line">+------------------+                      +-----------+-----------------------+</span><br><span class="line">|  0    2   1   1  |                      | 28      0 |  3   0    2    0    0 |</span><br><span class="line">|                  |                      |           |                       |</span><br><span class="line">+------------------+----------------------+-----------+-----------------------+</span><br><span class="line"></span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                  |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |</span><br><span class="line">|        ID   ID                                                   Usage      |</span><br><span class="line">|=============================================================================|</span><br><span class="line">|  No running processes found                                                 |</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><p>执行<code>nvidia-smi -L</code> 可以列出每个设备的UUID，供后续计算时使用</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi -L</span></span><br><span class="line">GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)</span><br><span class="line">  MIG 1c.3g.20gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/0)</span><br><span class="line">  MIG 2c.3g.20gb Device 1: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/1)</span><br></pre></td></tr></table></figure><h3 id="删除-CPU-Instance"><a href="#删除-CPU-Instance" class="headerlink" title="删除 CPU Instance"></a>删除 CPU Instance</h3><p>可以使用如下命令删除gi实例1上的ci实例0</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">nvidia-smi mig -dci -ci 0 -gi 1</span><br></pre></td></tr></table></figure><h2 id="使用-MIG"><a href="#使用-MIG" class="headerlink" title="使用 MIG"></a>使用 MIG</h2><h3 id="Bare-Metal"><a href="#Bare-Metal" class="headerlink" title="Bare-Metal"></a>Bare-Metal</h3><p>暂时没有拿到 bare metal 的 A100 机器，TODO</p><h3 id="Container"><a href="#Container" class="headerlink" title="Container"></a>Container</h3><h4 id="前置条件"><a href="#前置条件" class="headerlink" title="前置条件"></a>前置条件</h4><ul><li>安装Docker</li><li>安装NVIDIA Container Toolkit：<ul><li>Nvidia-docker2 版本推荐在 v2.5.0 以上</li></ul></li></ul><h4 id="运行容器"><a href="#运行容器" class="headerlink" title="运行容器"></a>运行容器</h4><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/1 nvidia/cuda nvidia-smi</span></span><br><span class="line">docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused <span class="string">"process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/1 --compute --utility --require=cuda&gt;=11.1 brand=tesla,driver&gt;=418,driver&lt;419 brand=tesla,driver&gt;=440,driver&lt;441 brand=tesla,driver&gt;=450,driver&lt;451 --pid=11936 /var/lib/docker/overlay2/5ee3e036c29f6cd488a3ad1ab1c55a47e595ffff530075853396745de546e4a8/merged]\\\\nnvidia-container-cli: device error: unknown device id: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/2/1\\\\n\\\"\""</span>: unknown.</span><br><span class="line">ERRO[0000] error waiting <span class="keyword">for</span> container: context canceled</span><br></pre></td></tr></table></figure><p>怀疑是 NVIDIA Docker Toolkit 版本太老</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># /usr/bin/nvidia-container-runtime -v</span></span><br><span class="line">runc version 1.0.0-rc10</span><br><span class="line">commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd</span><br><span class="line">spec: 1.0.1-dev</span><br></pre></td></tr></table></figure><p>安装新版本的 NVIDIA Docker Toolkit</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">Dependencies Resolved</span><br><span class="line"></span><br><span class="line">==========================================================================================================================================================================</span><br><span class="line"> Package                                       Arch                       Version                                      Repository                                    Size</span><br><span class="line">==========================================================================================================================================================================</span><br><span class="line">Installing:</span><br><span class="line"> nvidia-docker2                                noarch                     2.5.0-1                                      nvidia-docker                                8.4 k</span><br><span class="line">Installing <span class="keyword">for</span> dependencies:</span><br><span class="line"> container-selinux                             noarch                     2:2.119.1-1.c57a6f9.tl2                      tlinux                                        39 k</span><br><span class="line"> containerd.io                                 x86_64                     1.2.5-3.1.el7                                tlinux                                        22 M</span><br><span class="line"> docker-ce                                     x86_64                     3:18.09.5-3.el7                              tlinux                                        19 M</span><br><span class="line"> docker-ce-cli                                 x86_64                     1:18.09.5-3.el7                              tlinux                                        14 M</span><br><span class="line">Updating <span class="keyword">for</span> dependencies:</span><br><span class="line"> libnvidia-container-tools                     x86_64                     1.3.1-1                                      libnvidia-container                           42 k</span><br><span class="line"> libnvidia-container1                          x86_64                     1.3.1-1                                      libnvidia-container                           86 k</span><br><span class="line"> nvidia-container-runtime                      x86_64                     3.4.0-1                                      nvidia-container-runtime                     693 k</span><br><span class="line"> nvidia-container-toolkit                      x86_64                     1.4.0-2                                      nvidia-container-runtime                     819 k</span><br></pre></td></tr></table></figure><p>环境配置好后，即可通过 <code>docker</code> 运行容器使用GPU：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br></pre></td><td class="code"><pre><span class="line">$ docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/2/0 nvidia/cuda nvidia-smi</span><br><span class="line">Wed Jan 13 11:30:19 2021</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |</span><br><span class="line">|-------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                               |                      |               MIG M. |</span><br><span class="line">|===============================+======================+======================|</span><br><span class="line">|   0  A100-SXM4-40GB      On   | 00000000:00:08.0 Off |                   On |</span><br><span class="line">| N/A   26C    P0    42W / 400W |                  N/A |     N/A      Default |</span><br><span class="line">|                               |                      |              Enabled |</span><br><span class="line">+-------------------------------+----------------------+----------------------+</span><br><span class="line"></span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| MIG devices:                                                                |</span><br><span class="line">+------------------+----------------------+-----------+-----------------------+</span><br><span class="line">| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |</span><br><span class="line">|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|</span><br><span class="line">|                  |                      |        ECC|                       |</span><br><span class="line">|==================+======================+===========+=======================|</span><br><span class="line">|  0    2   0   0  |      5MiB / 20096MiB | 14      0 |  3   0    2    0    0 |</span><br><span class="line">|                  |      0MiB / 32767MiB |           |                       |</span><br><span class="line">+------------------+----------------------+-----------+-----------------------+</span><br><span class="line"></span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                  |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |</span><br><span class="line">|        ID   ID                                                   Usage      |</span><br><span class="line">|=============================================================================|</span><br><span class="line">|  No running processes found                                                 |</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><h3 id="Kubernetes"><a href="#Kubernetes" class="headerlink" title="Kubernetes"></a>Kubernetes</h3><h4 id="前置依赖"><a href="#前置依赖" class="headerlink" title="前置依赖"></a>前置依赖</h4><ul><li>NVIDIA R450+ datacenter driver: 450.80.02+</li><li>NVIDIA Container Toolkit (nvidia-docker2): v2.5.0+</li><li>NVIDIA k8s-device-plugin: v0.7.0+</li><li>NVIDIA gpu-feature-discovery: v0.2.0+</li></ul><h4 id="None"><a href="#None" class="headerlink" title="None"></a>None</h4><p>确认 Node 上的 MIG 特性开启，此时没有创建任何GI：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line">Thu Jan 14 16:35:34 2021</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |</span><br><span class="line">|-------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                               |                      |               MIG M. |</span><br><span class="line">|===============================+======================+======================|</span><br><span class="line">|   0  A100-SXM4-40GB      On   | 00000000:00:08.0 Off |                   On |</span><br><span class="line">| N/A   26C    P0    43W / 400W |      0MiB / 40536MiB |     N/A      Default |</span><br><span class="line">|                               |                      |              Enabled |</span><br><span class="line">+-------------------------------+----------------------+----------------------+</span><br><span class="line"></span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| MIG devices:                                                                |</span><br><span class="line">+------------------+----------------------+-----------+-----------------------+</span><br><span class="line">| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |</span><br><span class="line">|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|</span><br><span class="line">|                  |                      |        ECC|                       |</span><br><span class="line">|==================+======================+===========+=======================|</span><br><span class="line">|  No MIG devices found                                                       |</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line"></span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                  |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |</span><br><span class="line">|        ID   ID                                                   Usage      |</span><br><span class="line">|=============================================================================|</span><br><span class="line">|  No running processes found                                                 |</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><p>启动 <code>Device Plugin</code>，此时 <code>mig-strategy</code> 是 <code>none</code>：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">kind:</span> <span class="string">DaemonSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">nvidia-device-plugin-daemonset</span></span><br><span class="line">  <span class="attr">namespace:</span> <span class="string">kube-system</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">matchLabels:</span></span><br><span class="line">      <span class="attr">name:</span> <span class="string">nvidia-device-plugin-ds</span></span><br><span class="line">  <span class="attr">updateStrategy:</span></span><br><span class="line">    <span class="attr">type:</span> <span class="string">RollingUpdate</span></span><br><span class="line">  <span class="attr">template:</span></span><br><span class="line">    <span class="attr">metadata:</span></span><br><span class="line">      <span class="comment"># This annotation is deprecated. Kept here for backward compatibility</span></span><br><span class="line">      <span class="comment"># See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/</span></span><br><span class="line">      <span class="attr">annotations:</span></span><br><span class="line">        <span class="attr">scheduler.alpha.kubernetes.io/critical-pod:</span> <span class="string">""</span></span><br><span class="line">      <span class="attr">labels:</span></span><br><span class="line">        <span class="attr">name:</span> <span class="string">nvidia-device-plugin-ds</span></span><br><span class="line">    <span class="attr">spec:</span></span><br><span class="line">      <span class="attr">tolerations:</span></span><br><span class="line">      <span class="comment"># This toleration is deprecated. Kept here for backward compatibility</span></span><br><span class="line">      <span class="comment"># See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">key:</span> <span class="string">CriticalAddonsOnly</span></span><br><span class="line">        <span class="attr">operator:</span> <span class="string">Exists</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">key:</span> <span class="string">nvidia.com/gpu</span></span><br><span class="line">        <span class="attr">operator:</span> <span class="string">Exists</span></span><br><span class="line">        <span class="attr">effect:</span> <span class="string">NoSchedule</span></span><br><span class="line">      <span class="comment"># Mark this pod as a critical add-on; when enabled, the critical add-on</span></span><br><span class="line">      <span class="comment"># scheduler reserves resources for critical add-on pods so that they can</span></span><br><span class="line">      <span class="comment"># be rescheduled after a failure.</span></span><br><span class="line">      <span class="comment"># See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/</span></span><br><span class="line">      <span class="attr">priorityClassName:</span> <span class="string">"system-node-critical"</span></span><br><span class="line">      <span class="attr">containers:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">image:</span> <span class="string">nvidia/k8s-device-plugin:v0.7.0</span></span><br><span class="line">        <span class="attr">name:</span> <span class="string">nvidia-device-plugin-ctr</span></span><br><span class="line">        <span class="attr">args:</span> <span class="string">["--fail-on-init-error=false"]</span></span><br><span class="line">        <span class="attr">securityContext:</span></span><br><span class="line">          <span class="attr">allowPrivilegeEscalation:</span> <span class="literal">false</span></span><br><span class="line">          <span class="attr">capabilities:</span></span><br><span class="line">            <span class="attr">drop:</span> <span class="string">["ALL"]</span></span><br><span class="line">        <span class="attr">volumeMounts:</span></span><br><span class="line">          <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">device-plugin</span></span><br><span class="line">            <span class="attr">mountPath:</span> <span class="string">/var/lib/kubelet/device-plugins</span></span><br><span class="line">      <span class="attr">volumes:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">device-plugin</span></span><br><span class="line">          <span class="attr">hostPath:</span></span><br><span class="line">            <span class="attr">path:</span> <span class="string">/var/lib/kubelet/device-plugins</span></span><br></pre></td></tr></table></figure><p>可以看到 <code>Node</code> 上可以用 <code>nvidia.com/gpu</code> 资源数目：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">Capacity:</span></span><br><span class="line"><span class="string">...</span></span><br><span class="line">  <span class="attr">nvidia.com/gpu:</span>          <span class="number">1</span></span><br><span class="line"><span class="attr">Allocatable:</span></span><br><span class="line"><span class="string">...</span></span><br><span class="line">  <span class="attr">nvidia.com/gpu:</span>          <span class="number">1</span></span><br></pre></td></tr></table></figure><p>部署 <code>Pod</code>：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl run -it --rm \</span><br><span class="line">   --image=nvidia/cuda \</span><br><span class="line">   --restart=Never \</span><br><span class="line">   --limits=nvidia.com/gpu=1 \</span><br><span class="line">GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)</span><br><span class="line">pod <span class="string">"mig-none-example"</span> deleted</span><br></pre></td></tr></table></figure><h4 id="Single"><a href="#Single" class="headerlink" title="Single"></a>Single</h4><p> 确认 Node 上的MIG特性开启后，创建大小相同的7个GI，每个GI对应着一个CI：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line">$ nvidia-smi mig -cgi 19,19,19,19,19,19,19 -C</span><br><span class="line">Successfully created GPU instance ID 13 on GPU  0 using profile MIG 1g.5gb (ID 19)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID 13 using profile MIG 1g.5gb (ID  0)</span><br><span class="line">Successfully created GPU instance ID 11 on GPU  0 using profile MIG 1g.5gb (ID 19)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID 11 using profile MIG 1g.5gb (ID  0)</span><br><span class="line">Successfully created GPU instance ID 12 on GPU  0 using profile MIG 1g.5gb (ID 19)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID 12 using profile MIG 1g.5gb (ID  0)</span><br><span class="line">Successfully created GPU instance ID  7 on GPU  0 using profile MIG 1g.5gb (ID 19)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID  7 using profile MIG 1g.5gb (ID  0)</span><br><span class="line">Successfully created GPU instance ID  8 on GPU  0 using profile MIG 1g.5gb (ID 19)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID  8 using profile MIG 1g.5gb (ID  0)</span><br><span class="line">Successfully created GPU instance ID  9 on GPU  0 using profile MIG 1g.5gb (ID 19)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID  9 using profile MIG 1g.5gb (ID  0)</span><br><span class="line">Successfully created GPU instance ID 10 on GPU  0 using profile MIG 1g.5gb (ID 19)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID 10 using profile MIG 1g.5gb (ID  0)</span><br><span class="line">$ nvidia-smi -L</span><br><span class="line">GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)</span><br><span class="line">  MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/7/0)</span><br><span class="line">  MIG 1g.5gb Device 1: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/8/0)</span><br><span class="line">  MIG 1g.5gb Device 2: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/9/0)</span><br><span class="line">  MIG 1g.5gb Device 3: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/10/0)</span><br><span class="line">  MIG 1g.5gb Device 4: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/11/0)</span><br><span class="line">  MIG 1g.5gb Device 5: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/12/0)</span><br><span class="line">  MIG 1g.5gb Device 6: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/13/0)</span><br></pre></td></tr></table></figure><p>部署 <code>Device Plugin</code>：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">DaemonSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">nvidia-device-plugin-daemonset</span></span><br><span class="line">  <span class="attr">namespace:</span> <span class="string">kube-system</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">matchLabels:</span></span><br><span class="line">      <span class="attr">name:</span> <span class="string">nvidia-device-plugin-ds</span></span><br><span class="line">  <span class="attr">updateStrategy:</span></span><br><span class="line">    <span class="attr">type:</span> <span class="string">RollingUpdate</span></span><br><span class="line">  <span class="attr">template:</span></span><br><span class="line">    <span class="attr">metadata:</span></span><br><span class="line">      <span class="comment"># This annotation is deprecated. Kept here for backward compatibility</span></span><br><span class="line">      <span class="comment"># See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/</span></span><br><span class="line">      <span class="attr">annotations:</span></span><br><span class="line">        <span class="attr">scheduler.alpha.kubernetes.io/critical-pod:</span> <span class="string">""</span></span><br><span class="line">      <span class="attr">labels:</span></span><br><span class="line">        <span class="attr">name:</span> <span class="string">nvidia-device-plugin-ds</span></span><br><span class="line">    <span class="attr">spec:</span></span><br><span class="line">      <span class="attr">tolerations:</span></span><br><span class="line">      <span class="comment"># This toleration is deprecated. Kept here for backward compatibility</span></span><br><span class="line">      <span class="comment"># See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">key:</span> <span class="string">CriticalAddonsOnly</span></span><br><span class="line">        <span class="attr">operator:</span> <span class="string">Exists</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">key:</span> <span class="string">nvidia.com/gpu</span></span><br><span class="line">        <span class="attr">operator:</span> <span class="string">Exists</span></span><br><span class="line">        <span class="attr">effect:</span> <span class="string">NoSchedule</span></span><br><span class="line">      <span class="comment"># Mark this pod as a critical add-on; when enabled, the critical add-on</span></span><br><span class="line">      <span class="comment"># scheduler reserves resources for critical add-on pods so that they can</span></span><br><span class="line">      <span class="comment"># be rescheduled after a failure.</span></span><br><span class="line">      <span class="comment"># See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/</span></span><br><span class="line">      <span class="attr">priorityClassName:</span> <span class="string">"system-node-critical"</span></span><br><span class="line">      <span class="attr">containers:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">image:</span> <span class="string">nvidia/k8s-device-plugin:v0.7.0</span></span><br><span class="line">        <span class="attr">name:</span> <span class="string">nvidia-device-plugin-ctr</span></span><br><span class="line">        <span class="attr">args:</span> <span class="string">["--fail-on-init-error=false",</span> <span class="string">"--mig-strategy=single"</span><span class="string">]</span></span><br><span class="line">        <span class="attr">securityContext:</span></span><br><span class="line">          <span class="attr">allowPrivilegeEscalation:</span> <span class="literal">false</span></span><br><span class="line">          <span class="attr">capabilities:</span></span><br><span class="line">            <span class="attr">drop:</span> <span class="string">["ALL"]</span></span><br><span class="line">        <span class="attr">volumeMounts:</span></span><br><span class="line">          <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">device-plugin</span></span><br><span class="line">            <span class="attr">mountPath:</span> <span class="string">/var/lib/kubelet/device-plugins</span></span><br><span class="line">      <span class="attr">volumes:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">device-plugin</span></span><br><span class="line">          <span class="attr">hostPath:</span></span><br><span class="line">            <span class="attr">path:</span> <span class="string">/var/lib/kubelet/device-plugins</span></span><br></pre></td></tr></table></figure><p>这时候可以看到 Node 上面的标记 <code>nvidia.com/gpu</code> 变成了 7 个：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">Capacity:</span></span><br><span class="line"><span class="string">...</span></span><br><span class="line">  <span class="attr">nvidia.com/gpu:</span>          <span class="number">7</span></span><br><span class="line"><span class="attr">Allocatable:</span></span><br><span class="line"><span class="string">...</span></span><br><span class="line">  <span class="attr">nvidia.com/gpu:</span>          <span class="number">7</span></span><br></pre></td></tr></table></figure><p>部署 <code>discovery</code></p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">DaemonSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">  <span class="attr">labels:</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/version:</span> <span class="number">0.2</span><span class="number">.0</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/part-of:</span> <span class="string">nvidia-gpu</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">matchLabels:</span></span><br><span class="line">      <span class="attr">app.kubernetes.io/name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">      <span class="attr">app.kubernetes.io/part-of:</span> <span class="string">nvidia-gpu</span></span><br><span class="line">  <span class="attr">template:</span></span><br><span class="line">    <span class="attr">metadata:</span></span><br><span class="line">      <span class="attr">labels:</span></span><br><span class="line">        <span class="attr">app.kubernetes.io/name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">        <span class="attr">app.kubernetes.io/version:</span> <span class="number">0.2</span><span class="number">.0</span></span><br><span class="line">        <span class="attr">app.kubernetes.io/part-of:</span> <span class="string">nvidia-gpu</span></span><br><span class="line">    <span class="attr">spec:</span></span><br><span class="line">      <span class="attr">containers:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">image:</span> <span class="string">nvidia/gpu-feature-discovery:v0.2.0</span></span><br><span class="line">          <span class="attr">name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">          <span class="attr">args:</span> <span class="string">["--mig-strategy=single"]</span></span><br><span class="line">          <span class="attr">volumeMounts:</span></span><br><span class="line">            <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">output-dir</span></span><br><span class="line">              <span class="attr">mountPath:</span> <span class="string">"/etc/kubernetes/node-feature-discovery/features.d"</span></span><br><span class="line">            <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">dmi-product-name</span></span><br><span class="line">              <span class="attr">mountPath:</span> <span class="string">"/sys/class/dmi/id/product_name"</span></span><br><span class="line">          <span class="attr">env:</span></span><br><span class="line">            <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">NVIDIA_MIG_MONITOR_DEVICES</span></span><br><span class="line">              <span class="attr">value:</span> <span class="string">all</span></span><br><span class="line">          <span class="attr">securityContext:</span></span><br><span class="line">            <span class="attr">privileged:</span> <span class="literal">true</span></span><br><span class="line">      <span class="attr">nodeSelector:</span></span><br><span class="line">        <span class="attr">feature.node.kubernetes.io/pci-10de.present:</span> <span class="string">"true"</span> <span class="comment"># NVIDIA vendor ID</span></span><br><span class="line">      <span class="attr">volumes:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">output-dir</span></span><br><span class="line">          <span class="attr">hostPath:</span></span><br><span class="line">            <span class="attr">path:</span> <span class="string">"/etc/kubernetes/node-feature-discovery/features.d"</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">dmi-product-name</span></span><br><span class="line">          <span class="attr">hostPath:</span></span><br><span class="line">            <span class="attr">path:</span> <span class="string">"/sys/class/dmi/id/product_name"</span></span><br></pre></td></tr></table></figure><p>运行 Pod 申请GPU：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line">$  <span class="keyword">for</span> i <span class="keyword">in</span> $(seq 7); <span class="keyword">do</span></span><br><span class="line">   kubectl run \</span><br><span class="line">      --image=nvidia/cuda:11.0-base \</span><br><span class="line">      --restart=Never \</span><br><span class="line">      --limits=nvidia.com/gpu=1 \</span><br><span class="line">      mig-single-example-<span class="variable">$&#123;i&#125;</span> -- bash -c <span class="string">"nvidia-smi -L; sleep infinity"</span></span><br><span class="line"><span class="keyword">done</span></span><br><span class="line">pod/mig-single-example-1 created</span><br><span class="line">pod/mig-single-example-2 created</span><br><span class="line">pod/mig-single-example-3 created</span><br><span class="line">pod/mig-single-example-4 created</span><br><span class="line">pod/mig-single-example-5 created</span><br><span class="line">pod/mig-single-example-6 created</span><br><span class="line">pod/mig-single-example-7 created</span><br><span class="line"></span><br><span class="line">$ <span class="keyword">for</span> i <span class="keyword">in</span> $(seq 7); <span class="keyword">do</span></span><br><span class="line"><span class="built_in">echo</span> <span class="string">"mig-single-example-<span class="variable">$&#123;i&#125;</span>"</span>;</span><br><span class="line">kubectl logs mig-single-example-<span class="variable">$&#123;i&#125;</span></span><br><span class="line"><span class="built_in">echo</span> <span class="string">""</span>;</span><br><span class="line"><span class="keyword">done</span></span><br><span class="line"></span><br><span class="line">mig-single-example-1</span><br><span class="line">GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)</span><br><span class="line">  MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/11/0)</span><br><span class="line"></span><br><span class="line">mig-single-example-2</span><br><span class="line">GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)</span><br><span class="line">  MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/7/0)</span><br><span class="line"></span><br><span class="line">mig-single-example-3</span><br><span class="line">GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)</span><br><span class="line">  MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/8/0)</span><br><span class="line">  </span><br><span class="line">...</span><br><span class="line"></span><br><span class="line">$ <span class="keyword">for</span> i <span class="keyword">in</span> $(seq 7); <span class="keyword">do</span></span><br><span class="line">kubectl delete pod mig-single-example-<span class="variable">$&#123;i&#125;</span>;</span><br><span class="line"><span class="keyword">done</span></span><br><span class="line"></span><br><span class="line">pod <span class="string">"mig-single-example-1"</span> deleted</span><br><span class="line">pod <span class="string">"mig-single-example-2"</span> deleted</span><br><span class="line">...</span><br></pre></td></tr></table></figure><h4 id="Mixed"><a href="#Mixed" class="headerlink" title="Mixed"></a>Mixed</h4><p> 确认 Node 上的MIG特性开启后，创建不同大小的3个GI，每个GI对应着一个CI：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">$ nvidia-smi mig -cgi 9,14,19 -C</span><br><span class="line">Successfully created GPU instance ID  2 on GPU  0 using profile MIG 3g.20gb (ID  9)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID  2 using profile MIG 3g.20gb (ID  2)</span><br><span class="line">Successfully created GPU instance ID  3 on GPU  0 using profile MIG 2g.10gb (ID 14)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID  3 using profile MIG 2g.10gb (ID  1)</span><br><span class="line">Successfully created GPU instance ID  9 on GPU  0 using profile MIG 1g.5gb (ID 19)</span><br><span class="line">Successfully created compute instance ID  0 on GPU  0 GPU instance ID  9 using profile MIG 1g.5gb (ID  0)</span><br><span class="line">$ nvidia-smi -L</span><br><span class="line">GPU 0: A100-SXM4-40GB (UUID: GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1)</span><br><span class="line">  MIG 3g.20gb Device 0: (UUID: MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/2/0)</span><br><span class="line">  MIG 2g.10gb Device 1: (UUID: MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/3/0)</span><br><span class="line">  MIG 1g.5gb Device 2: (UUID: MIG-GPU-61148488-f8ba-c817-b2e8-18f59e2b66b1/9/0)</span><br></pre></td></tr></table></figure><p>启动 <code>Device Plugin</code> ：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">DaemonSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">nvidia-device-plugin-daemonset</span></span><br><span class="line">  <span class="attr">namespace:</span> <span class="string">kube-system</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">matchLabels:</span></span><br><span class="line">      <span class="attr">name:</span> <span class="string">nvidia-device-plugin-ds</span></span><br><span class="line">  <span class="attr">updateStrategy:</span></span><br><span class="line">    <span class="attr">type:</span> <span class="string">RollingUpdate</span></span><br><span class="line">  <span class="attr">template:</span></span><br><span class="line">    <span class="attr">metadata:</span></span><br><span class="line">      <span class="comment"># This annotation is deprecated. Kept here for backward compatibility</span></span><br><span class="line">      <span class="comment"># See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/</span></span><br><span class="line">      <span class="attr">annotations:</span></span><br><span class="line">        <span class="attr">scheduler.alpha.kubernetes.io/critical-pod:</span> <span class="string">""</span></span><br><span class="line">      <span class="attr">labels:</span></span><br><span class="line">        <span class="attr">name:</span> <span class="string">nvidia-device-plugin-ds</span></span><br><span class="line">    <span class="attr">spec:</span></span><br><span class="line">      <span class="attr">tolerations:</span></span><br><span class="line">      <span class="comment"># This toleration is deprecated. Kept here for backward compatibility</span></span><br><span class="line">      <span class="comment"># See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">key:</span> <span class="string">CriticalAddonsOnly</span></span><br><span class="line">        <span class="attr">operator:</span> <span class="string">Exists</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">key:</span> <span class="string">nvidia.com/gpu</span></span><br><span class="line">        <span class="attr">operator:</span> <span class="string">Exists</span></span><br><span class="line">        <span class="attr">effect:</span> <span class="string">NoSchedule</span></span><br><span class="line">      <span class="comment"># Mark this pod as a critical add-on; when enabled, the critical add-on</span></span><br><span class="line">      <span class="comment"># scheduler reserves resources for critical add-on pods so that they can</span></span><br><span class="line">      <span class="comment"># be rescheduled after a failure.</span></span><br><span class="line">      <span class="comment"># See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/</span></span><br><span class="line">      <span class="attr">priorityClassName:</span> <span class="string">"system-node-critical"</span></span><br><span class="line">      <span class="attr">containers:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">image:</span> <span class="string">nvidia/k8s-device-plugin:v0.7.0</span></span><br><span class="line">        <span class="attr">name:</span> <span class="string">nvidia-device-plugin-ctr</span></span><br><span class="line">        <span class="attr">args:</span> <span class="string">["--fail-on-init-error=false",</span> <span class="string">"--mig-strategy=mixed"</span><span class="string">]</span></span><br><span class="line">        <span class="attr">securityContext:</span></span><br><span class="line">          <span class="attr">allowPrivilegeEscalation:</span> <span class="literal">false</span></span><br><span class="line">          <span class="attr">capabilities:</span></span><br><span class="line">            <span class="attr">drop:</span> <span class="string">["ALL"]</span></span><br><span class="line">        <span class="attr">volumeMounts:</span></span><br><span class="line">          <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">device-plugin</span></span><br><span class="line">            <span class="attr">mountPath:</span> <span class="string">/var/lib/kubelet/device-plugins</span></span><br><span class="line">      <span class="attr">volumes:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">device-plugin</span></span><br><span class="line">          <span class="attr">hostPath:</span></span><br><span class="line">            <span class="attr">path:</span> <span class="string">/var/lib/kubelet/device-plugins</span></span><br></pre></td></tr></table></figure><p>启动 <code>Device Plugin</code> 之后，可以看到 Node 上的有MIG的<code>resource type</code>：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">Capacity:</span><br><span class="line">...</span><br><span class="line">  nvidia.com/gpu:          0</span><br><span class="line">  nvidia.com/mig-1g.5gb:   1</span><br><span class="line">  nvidia.com/mig-2g.10gb:  1</span><br><span class="line">  nvidia.com/mig-3g.20gb:  1</span><br><span class="line">  pods:                    61</span><br><span class="line">Allocatable:</span><br><span class="line">...</span><br><span class="line">  nvidia.com/gpu:          0</span><br><span class="line">  nvidia.com/mig-1g.5gb:   1</span><br><span class="line">  nvidia.com/mig-2g.10gb:  1</span><br><span class="line">  nvidia.com/mig-3g.20gb:  1</span><br><span class="line">  pods:                    61</span><br></pre></td></tr></table></figure><p>这时候启动 <code>gpu-feature-discovery</code>，启动策略是 <code>mixed</code>：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">DaemonSet</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">  <span class="attr">labels:</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/version:</span> <span class="number">0.2</span><span class="number">.0</span></span><br><span class="line">    <span class="attr">app.kubernetes.io/part-of:</span> <span class="string">nvidia-gpu</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">matchLabels:</span></span><br><span class="line">      <span class="attr">app.kubernetes.io/name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">      <span class="attr">app.kubernetes.io/part-of:</span> <span class="string">nvidia-gpu</span></span><br><span class="line">  <span class="attr">template:</span></span><br><span class="line">    <span class="attr">metadata:</span></span><br><span class="line">      <span class="attr">labels:</span></span><br><span class="line">        <span class="attr">app.kubernetes.io/name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">        <span class="attr">app.kubernetes.io/version:</span> <span class="number">0.2</span><span class="number">.0</span></span><br><span class="line">        <span class="attr">app.kubernetes.io/part-of:</span> <span class="string">nvidia-gpu</span></span><br><span class="line">    <span class="attr">spec:</span></span><br><span class="line">      <span class="attr">containers:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">image:</span> <span class="string">nvidia/gpu-feature-discovery:v0.2.0</span></span><br><span class="line">          <span class="attr">name:</span> <span class="string">gpu-feature-discovery</span></span><br><span class="line">          <span class="attr">args:</span> <span class="string">["--mig-strategy=mixed"]</span></span><br><span class="line">          <span class="attr">volumeMounts:</span></span><br><span class="line">            <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">output-dir</span></span><br><span class="line">              <span class="attr">mountPath:</span> <span class="string">"/etc/kubernetes/node-feature-discovery/features.d"</span></span><br><span class="line">            <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">dmi-product-name</span></span><br><span class="line">              <span class="attr">mountPath:</span> <span class="string">"/sys/class/dmi/id/product_name"</span></span><br><span class="line">          <span class="attr">env:</span></span><br><span class="line">            <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">NVIDIA_MIG_MONITOR_DEVICES</span></span><br><span class="line">              <span class="attr">value:</span> <span class="string">all</span></span><br><span class="line">          <span class="attr">securityContext:</span></span><br><span class="line">            <span class="attr">privileged:</span> <span class="literal">true</span></span><br><span class="line">      <span class="comment">#nodeSelector:</span></span><br><span class="line">      <span class="comment">#  feature.node.kubernetes.io/pci-10de.present: "true" # NVIDIA vendor ID</span></span><br><span class="line">      <span class="attr">volumes:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">output-dir</span></span><br><span class="line">          <span class="attr">hostPath:</span></span><br><span class="line">            <span class="attr">path:</span> <span class="string">"/etc/kubernetes/node-feature-discovery/features.d"</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">dmi-product-name</span></span><br><span class="line">          <span class="attr">hostPath:</span></span><br><span class="line">            <span class="attr">path:</span> <span class="string">"/sys/class/dmi/id/product_name"</span></span><br></pre></td></tr></table></figure><p>这时候查看 Node 的 label，可以看到 MIG 相关的 label 已经打上 ？</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl get node -o json | \</span><br><span class="line">   jq <span class="string">'.items[0].metadata.labels | with_entries(select(.key | startswith("nvidia.com")))'</span></span><br><span class="line">&#123;&#125;</span><br></pre></td></tr></table></figure><p>使用 <code>kubectl</code> 启动 Pod：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">$ kubectl run -it --rm \</span><br><span class="line">   --image=nvidia/cuda:11.0-base \</span><br><span class="line">   --restart=Never \</span><br><span class="line">   --limits=nvidia.com/mig-1g.5gb=1 \</span><br><span class="line">   mig-mixed-example -- nvidia-smi -L</span><br><span class="line">GPU 0: A100-SXM4-40GB (UUID: GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181)</span><br><span class="line">  MIG 1g.5gb Device 0: (UUID: MIG-GPU-ed92375c-b61c-7a27-2611-bc72ad3ea181/9/0)</span><br><span class="line">pod <span class="string">"mig-mixed-example"</span> deleted</span><br></pre></td></tr></table></figure><h4 id="当前TKE的问题"><a href="#当前TKE的问题" class="headerlink" title="当前TKE的问题"></a>当前TKE的问题</h4><ul><li>驱动版本 和 Nvidia-container-toolkit 版本 较老，需要更新</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># TKE GPU Node查看到 Driver 信息</span></span><br><span class="line">$ nvidia-smi -a</span><br><span class="line">Driver Version                      : 418.67</span><br><span class="line">CUDA Version                        : 10.1</span><br><span class="line"></span><br><span class="line"><span class="comment"># Nvidia Container Toolkit 版本</span></span><br><span class="line">nvidia-container-runtime-3.1.0-1</span><br><span class="line">nvidia-container-toolkit-1.0.1-2</span><br><span class="line">libnvidia-container-tools-1.0.2-1</span><br><span class="line">libnvidia-container1-1.0.2-1</span><br><span class="line"></span><br><span class="line"><span class="comment"># NVIDIA Device Plugin 版本较老</span></span><br><span class="line">nvidia/k8s-device-plugin:1.10</span><br><span class="line"></span><br><span class="line">应该用 NVIDIA k8s-device-plugin: v0.7.0+</span><br></pre></td></tr></table></figure><ul><li>VM 中使用 MIG，开启MIG特性需要重启VM</li></ul><blockquote><p> If you are using MIG inside a VM with GPU passthrough, then you may <strong>need to reboot the VM</strong> to allow the GPU to be in MIG mode as in some cases, GPU reset is not allowed via the hypervisor for security reasons. This can be seen in the following example:</p></blockquote><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">$ sudo nvidia-smi -i 0 -mig 1</span><br><span class="line">Warning: MIG mode is <span class="keyword">in</span> pending <span class="built_in">enable</span> state <span class="keyword">for</span> GPU 00000000:00:03.0:Not Supported</span><br><span class="line">Reboot the system or try nvidia-smi --gpu-reset to make MIG mode effective on GPU 00000000:00:03.0</span><br><span class="line">All <span class="keyword">done</span>.</span><br><span class="line"></span><br><span class="line">$ sudo nvidia-smi --gpu-reset</span><br><span class="line">Resetting GPU 00000000:00:03.0 is not supported.</span><br></pre></td></tr></table></figure><h2 id="划分MIG后的性能对比"><a href="#划分MIG后的性能对比" class="headerlink" title="划分MIG后的性能对比"></a>划分MIG后的性能对比</h2><h3 id="整块卡的性能"><a href="#整块卡的性能" class="headerlink" title="整块卡的性能"></a>整块卡的性能</h3><div class="table-container"><table><thead><tr><th style="text-align:left">测试项目</th><th style="text-align:left">实测性能</th><th style="text-align:left">官方标准性能</th></tr></thead><tbody><tr><td style="text-align:left">FP32MAD</td><td style="text-align:left">19.436TF</td><td style="text-align:left">19.5 TF</td></tr><tr><td style="text-align:left">FP64MAD</td><td style="text-align:left">9.690TF</td><td style="text-align:left">9.7 TF</td></tr><tr><td style="text-align:left">int32mad</td><td style="text-align:left">19.446TF</td><td style="text-align:left">-</td></tr><tr><td style="text-align:left">int32add</td><td style="text-align:left">18.906TF</td><td style="text-align:left">-</td></tr><tr><td style="text-align:left">FP32GEMMTensor (矩阵大小满足最佳性能要求)</td><td style="text-align:left">158.426TF</td><td style="text-align:left">156 TF</td></tr><tr><td style="text-align:left">FP32GEMMTensor (不满足最佳性能要求)</td><td style="text-align:left">68.054 TF</td><td style="text-align:left">-</td></tr><tr><td style="text-align:left">FP32GEMM</td><td style="text-align:left">19.047 TF</td><td style="text-align:left">-</td></tr></tbody></table></div><p><strong>备注</strong></p><ol><li>GEMM 需要在cuda 11.0 下重编，才能达到以上效果</li><li>满足最佳性能要求是的GEMM大小参数 9000 <em> 6000 </em> 6000</li><li>不满足最佳性能要求的GEMM大小参数 8997 <em> 5998 </em> 5998</li></ol><h3 id="MIG卡的性能"><a href="#MIG卡的性能" class="headerlink" title="MIG卡的性能"></a>MIG卡的性能</h3><p>为了测试各个CI和GI的性能，对3g.20gb GI进行进一步划分，分为 2c.3g.20gb, 1c.3g.20gb，另外两个GI不做进一步划分，直接在GI基础上创建CI。</p><p>至此一块GPU卡被分为四个CI分别是</p><ul><li>MIG 1c.3g.20gb</li><li>MIG 2c.3g.20gb</li><li>MIG 2g.10gb</li><li>MIG 1g.5gb</li></ul><h4 id="各CI串行执行"><a href="#各CI串行执行" class="headerlink" title="各CI串行执行"></a>各CI串行执行</h4><p><strong>MIG 1c.3g.20gb</strong></p><div class="table-container"><table><thead><tr><th style="text-align:left">测试项目</th><th style="text-align:left">实测性能(OPS)</th></tr></thead><tbody><tr><td style="text-align:left">FP32MAD</td><td style="text-align:left">2.523 T</td></tr><tr><td style="text-align:left">FP64MAD</td><td style="text-align:left">1.261 T</td></tr><tr><td style="text-align:left">INT32MAD</td><td style="text-align:left">2.524 T</td></tr><tr><td style="text-align:left">INT32ADD</td><td style="text-align:left">2.455 T</td></tr><tr><td style="text-align:left">FP32GEMMTensor (矩阵大小满足最佳性能要求)</td><td style="text-align:left">23.081 T</td></tr><tr><td style="text-align:left">FP32GEMMTensor (不满足最佳性能要求)</td><td style="text-align:left">8.940 T</td></tr><tr><td style="text-align:left">FP32GEMM</td><td style="text-align:left">2.476 T</td></tr></tbody></table></div><p><strong>MIG 2c.3g.20gb</strong></p><div class="table-container"><table><thead><tr><th style="text-align:left">测试项目</th><th style="text-align:left">实测性能(OPS)</th></tr></thead><tbody><tr><td style="text-align:left">FP32MAD</td><td style="text-align:left">5.046 T</td></tr><tr><td style="text-align:left">FP64MAD</td><td style="text-align:left">2.521 T</td></tr><tr><td style="text-align:left">INT32MAD</td><td style="text-align:left">5.049 T</td></tr><tr><td style="text-align:left">INT32ADD</td><td style="text-align:left">4.908 T</td></tr><tr><td style="text-align:left">FP32GEMMTensor (矩阵大小满足最佳性能要求)</td><td style="text-align:left">44.941 T</td></tr><tr><td style="text-align:left">FP32GEMMTensor (不满足最佳性能要求)</td><td style="text-align:left">18.920 T</td></tr><tr><td style="text-align:left">FP32GEMM</td><td style="text-align:left">4.909 T</td></tr></tbody></table></div><p><strong>MIG 2g.10gb</strong></p><div class="table-container"><table><thead><tr><th style="text-align:left">测试项目</th><th style="text-align:left">实测性能(OPS)</th></tr></thead><tbody><tr><td style="text-align:left">FP32MAD</td><td style="text-align:left">5.046 T</td></tr><tr><td style="text-align:left">FP64MAD</td><td style="text-align:left">2.521 T</td></tr><tr><td style="text-align:left">INT32MAD</td><td style="text-align:left">5.049 T</td></tr><tr><td style="text-align:left">INT32ADD</td><td style="text-align:left">4.908 T</td></tr><tr><td style="text-align:left">FP32GEMMTensor (矩阵大小满足最佳性能要求)</td><td style="text-align:left">40.151 T</td></tr><tr><td style="text-align:left">FP32GEMMTensor (不满足最佳性能要求)</td><td style="text-align:left">17.514 T</td></tr><tr><td style="text-align:left">FP32GEMM</td><td style="text-align:left">4.909 T</td></tr></tbody></table></div><p><strong>MIG 1g.5gb</strong></p><div class="table-container"><table><thead><tr><th style="text-align:left">测试项目</th><th style="text-align:left">实测性能(OPS)</th></tr></thead><tbody><tr><td style="text-align:left">FP32MAD</td><td style="text-align:left">2.523 T</td></tr><tr><td style="text-align:left">FP64MAD</td><td style="text-align:left">1.261 T</td></tr><tr><td style="text-align:left">INT32MAD</td><td style="text-align:left">2.524 T</td></tr><tr><td style="text-align:left">INT32ADD</td><td style="text-align:left">2.454 T</td></tr><tr><td style="text-align:left">FP32GEMMTensor (矩阵大小满足最佳性能要求)</td><td style="text-align:left">16.453 T</td></tr><tr><td style="text-align:left">FP32GEMMTensor (不满足最佳性能要求)</td><td style="text-align:left">8.261 T</td></tr><tr><td style="text-align:left">FP32GEMM</td><td style="text-align:left">2.476T</td></tr></tbody></table></div><p><strong>备注</strong></p><p>在串行执行FP32MAD任务时，1c.3g.20gb，1g.5gb的测试任务时保持在14.3%，2c.3g.20gb，2g.10gb的测试任务时保持在28.6%附近</p><h4 id="各CI并行执行"><a href="#各CI并行执行" class="headerlink" title="各CI并行执行"></a>各CI并行执行</h4><p>统一执行FP32MAD</p><div class="table-container"><table><thead><tr><th style="text-align:left">测试项目</th><th style="text-align:left">实测性能(OPS)</th></tr></thead><tbody><tr><td style="text-align:left">1c.3g.20gb</td><td style="text-align:left">2.523 T</td></tr><tr><td style="text-align:left">2c.3g.20gb</td><td style="text-align:left">5.044 T</td></tr><tr><td style="text-align:left">2g.10gb</td><td style="text-align:left">5.046 T</td></tr><tr><td style="text-align:left">1g.5gb</td><td style="text-align:left">2.523 T</td></tr></tbody></table></div><p>统一执行FP32GEMMTensor</p><div class="table-container"><table><thead><tr><th style="text-align:left">测试项目</th><th style="text-align:left">实测性能(OPS)</th></tr></thead><tbody><tr><td style="text-align:left">1c.3g.20gb</td><td style="text-align:left">20.450 T</td></tr><tr><td style="text-align:left">2c.3g.20gb</td><td style="text-align:left">41.194 T</td></tr><tr><td style="text-align:left">2g.10gb</td><td style="text-align:left">39.773 T</td></tr><tr><td style="text-align:left">1g.5gb</td><td style="text-align:left">16.336 T</td></tr></tbody></table></div><p><strong>备注</strong></p><p>在并行执行FP32MAD任务时，SmActivity,SmOccupancy,FP32Activity三项监控指标保持在85.7%附近</p><p>分别执行不同类型的计算</p><div class="table-container"><table><thead><tr><th style="text-align:left">测试项目</th><th style="text-align:left">实测性能(OPS)</th></tr></thead><tbody><tr><td style="text-align:left">1c.3g.20gb FP32MAD</td><td style="text-align:left">2.523 T</td></tr><tr><td style="text-align:left">2c.3g.20gb FP64MAD</td><td style="text-align:left">2.521 T</td></tr><tr><td style="text-align:left">2g.10gb INT32MAD</td><td style="text-align:left">5.048 T</td></tr><tr><td style="text-align:left">1g.5gb INT32ADD</td><td style="text-align:left">2.454 T</td></tr></tbody></table></div><p><strong>备注</strong></p><p>在并行执行不同计算任务时，SmActivity,SmOccupancy,FP64Activity,FP32Activity分别为85.7%, 78.1%, 28.5%, 57.0%</p><p>根据测试结果，验证了CI，GI隔离的有效性，具体结论如下</p><ol><li>对比各个MIG上任务串行执行，以及并行执行的性能数据，可以有效验证CI，GI隔离的有效性</li><li>划分CI，GI存在一定的性能损失，1g.5gb 上测得的性能并不等与整张卡的1/7，从整张卡的维度来看，存在10%的性能损失。考虑原因，A100 卡总共有108 SMs，但是分为7个MIG实例后，每个MIG实例只有14个SM 14*7 = 98 SMs，有10个SM将无法使用，这10个SM的浪费就是产生性能损失的源头。</li><li>对比2c.3g.20gb 2g.10gb可以发现在2c.3g.20gb（一个GI上软隔离的CI）Tensor计算性能比 2c.3g.20gb（完全隔离的GI)更好一些，同时对比所有CI同时执行FP32GemmTensor，可以发现同一个GI上的CI同时执行GEMM（相比FP32MAD，有一定的显存读写）时，两个CI的计算性能比单独执行时会有所下降，更接近单独GI的性能。即说明2c.3g.20gb性能强于2g.10gb，是由于CI隔离不完全导致的。</li></ol><h3 id="A100卡可以为后续工作带来的价值"><a href="#A100卡可以为后续工作带来的价值" class="headerlink" title="A100卡可以为后续工作带来的价值"></a>A100卡可以为后续工作带来的价值</h3><ol><li>每个MIG实例的完整隔离，可以支持多种虚拟化场景，包括虚拟机，容器</li><li>最小实例的基础计算能力，大约为T4卡的三分之一，P4卡的一半，计算能力适中，内存带宽1,555 GB/s 相比于P4卡 192 GB/s，T4 320+ GB/s，带宽足够充裕，不会成为瓶颈</li><li>存在离线计算和在线推理使用同一种GPU的可能性，打通离线，在线两个GPU资源池<ul><li>T4卡的具体性能指标 16G显存，SM 40, 8 TensorCores/SM, 64 INT32Cores/SM, 64 FP32Cores/SM,</li><li>A100卡的具体性能指标 40G显存，SM 108, 4 Third-generation Tensor Cores/SM, 64 FP32 CUDA Cores/SM,</li></ul></li></ol><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA Multi-Instance GPU User Guide</a></li><li><a href="https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA Ampere Architecture WhitePaper</a></li><li><a href="https://docs.google.com/document/d/1Dxx5MwG_GiBeKOuMNwv4QbO8OqA7XFdzn7fzzI7AQDg" target="_blank" rel="external nofollow noopener noreferrer">Design Document: Challenges Supporting MIG in Kubernetes</a></li><li><a href="https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html" target="_blank" rel="external nofollow noopener noreferrer">User Guide: MIG Support in Kubernetes</a></li><li><a href="https://github.com/NVIDIA/k8s-device-plugin/issues/180" target="_blank" rel="external nofollow noopener noreferrer">Github Issue: k8s device plugin Supporting MIG</a></li><li><a href="https://docs.google.com/document/d/1mdgMQ8g7WmaI_XVVRrCvHPFPOMCm5LQD5JefgAh6N8g" target="_blank" rel="external nofollow noopener noreferrer">PoC: Supporting MIG in Kubernetes</a></li><li><a href="https://docs.google.com/document/d/1bshSIcWNYRZGfywgwRHa07C0qRyOYKxWYxClbeJM-WM" target="_blank" rel="external nofollow noopener noreferrer">Steps to Enable MIG Support in Kubernetes</a></li><li><a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html" target="_blank" rel="external nofollow noopener noreferrer">Install NVIDIA Container Toolkit Guide</a></li><li><a href="https://developer.nvidia.com/zh-cn/blog/nvidia-ampere-architecture-in-depth/" target="_blank" rel="external nofollow noopener noreferrer">深度了解 NVIDIA Ampere 架构</a></li><li><a href="https://blog.csdn.net/han2529386161/article/details/106411138" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA GPU A100 Ampere 架构深度解析</a></li><li><a href="https://help.didiyun.com/hc/kb/article/1414838/" target="_blank" rel="external nofollow noopener noreferrer">https://help.didiyun.com/hc/kb/article/1414838/</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;MIG，也就是 &lt;code&gt;Multi-Instance GPU&lt;/code&gt; 是 NVIDIA 在 &lt;code&gt;NVIDIA GTC 2020&lt;/code&gt; 发布的最新 Ampere 架构的 &lt;code&gt;NVIDIA A100 GPU&lt;/code&gt; 推出的新特性。当配置为 MIG 运行状态时，A100 可以通过分出最多 7 个核心来帮助供应商提高 GPU 服务器的利用率，无需额外投入。MIG 提供了一种多用户使用隔离的GPU资源、提高GPU资源使用率的新的方式，特别适合于云服务提供商的多租户场景，保证一个租户的运行不干扰另一个租户。本文将介绍 MIG 的新特性和使用方法，以及在容器和 k8s 中使用 MIG 的方案。 &lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-12-16_gpu-mig-overview.jpg" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="虚拟化" scheme="https://houmin.cc/tags/%E8%99%9A%E6%8B%9F%E5%8C%96/"/>
    
      <category term="GPU" scheme="https://houmin.cc/tags/GPU/"/>
    
      <category term="MIG" scheme="https://houmin.cc/tags/MIG/"/>
    
  </entry>
  
  <entry>
    <title>【异构计算】GPU 共享</title>
    <link href="https://houmin.cc/posts/cf391335/"/>
    <id>https://houmin.cc/posts/cf391335/</id>
    <published>2020-11-18T03:11:07.000Z</published>
    <updated>2022-11-09T15:13:45.391Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>原生的 k8s 基于 <code>Device Plugin</code> 和 <code>Extended Resource</code> 机制实现了在容器中使用GPU，但是只支持GPU的独占使用，不允许在Pod间共享GPU，这大大降低了对集群中GPU的利用率。为了在集群层面共享GPU，我们需要实现GPU资源的隔离与调度，本文将依次介绍阿里的 <a href="https://github.com/AliyunContainerService/gpushare-scheduler-extender" target="_blank" rel="external nofollow noopener noreferrer">GPUShare</a> 与腾讯的 <a href="https://github.com/tkestack/gpu-manager" target="_blank" rel="external nofollow noopener noreferrer">GPUManager</a>，分析其实现机制。</p><a id="more"></a><h2 id="阿里GPUShare"><a href="#阿里GPUShare" class="headerlink" title="阿里GPUShare"></a>阿里GPUShare</h2><p>阿里的 <a href="https://github.com/AliyunContainerService/gpushare-scheduler-extender" target="_blank" rel="external nofollow noopener noreferrer">GPUShare</a> 基于 <a href="https://github.com/NVIDIA/nvidia-docker" target="_blank" rel="external nofollow noopener noreferrer">Nvidia Docker2</a> 和他们的 <a href="https://docs.google.com/document/d/1ZgKH_K4SEfdiE_OfxQ836s4yQWxZfSjS288Tq9YIWCA/edit#heading=h.r88v2xgacqr" target="_blank" rel="external nofollow noopener noreferrer">gpu sharing design</a> 设计而实现的，为了使用阿里的GPUShare，首先需要配置Node上的 Docker Runtime 并安装 <code>NVIDIA Docker 2</code>，具体过程可以参考 <a href="../574111db">在Docker中使用GPU</a>。</p><h3 id="架构设计"><a href="#架构设计" class="headerlink" title="架构设计"></a>架构设计</h3><h4 id="假设条件"><a href="#假设条件" class="headerlink" title="假设条件"></a>假设条件</h4><ul><li>尽管GPU可以从 CUDA Cores 和 GPU Memory 两个维度来衡量GPU的能力，<strong>在推理的场景，我们可以假定CUDA core的数量和GPU  Memory的大小是成比例的</strong></li><li>在模型开发和推理的场景下，<strong>用户申请的GPU资源不超过1个GPU，也就是说 resource limit 是 一个GPU</strong></li><li>每个Node上所有卡的GPU Memory相同，这样可以通过 <code>gpuTotalMemory</code> 和 <code>gpuTotalCount</code> 算出Node上每张卡的GPU Memory</li></ul><h4 id="设计原则"><a href="#设计原则" class="headerlink" title="设计原则"></a>设计原则</h4><ul><li><p>设计里定义了两种 <code>Extended Resource</code>：</p><ul><li><code>aliyun.com/gpu-mem</code>： 单位从 <code>number of GPUs</code> 变更为 <code>amount of GPU memory in MiB</code>，如果一个Node有多个GPU设备，这里计算的是总的GPU Memory</li><li><code>aliyun.com/gpu-count</code>：对应于Node上的GPU 设备的数目</li></ul></li><li>基于k8s原生的Scheduler Extender、Extended Resource、DevicePlugin机制来实现</li><li>这个方案只实现GPU的共享，不实现算力和显存的隔离，如果想实现隔离，在阿里云可以搭配 <a href="https://www.alibabacloud.com/help/zh/doc-detail/163994.htm" target="_blank" rel="external nofollow noopener noreferrer">cGPU</a> 一起使用</li></ul><h4 id="核心组件"><a href="#核心组件" class="headerlink" title="核心组件"></a>核心组件</h4><p>下图是整个设计的核心组件：</p><ul><li>GPU Share Scheduler Extender：基于k8s scheduler extender机制，作用于调度过程的<code>Filter</code>和<code>Bind</code>阶段，用于决定某个Node上的一个GPU设备是否可以提供足够的GPU Memory，并将GPU分配的结果记录到Pod Spec 的 Annotation中</li><li>GPU Share Device Plugin：基于k8s device plugin机制，根据GPU Share Scheduler Extender记录在Pod Spec的Annotation，实现GPU 设备的 Allocation。</li></ul><p><img alt="GPU Share Design" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_aliyun-gpu-share.jpg"></p><h3 id="具体过程"><a href="#具体过程" class="headerlink" title="具体过程"></a>具体过程</h3><h4 id="设备资源报告"><a href="#设备资源报告" class="headerlink" title="设备资源报告"></a>设备资源报告</h4><p><code>GPU Share Device Plugin</code> 基于 <code>nvml</code> 库来查询每个Node上GPU设备的数目和每个GPU设备的GPU Memory。</p><p>这些资源状况被通过 <code>ListAndWatch()</code> 汇报给 Kubelet，然后 kubelet 会上报给 APIServer，这时候执行 <code>kubectl get node</code> 可以看到在 <code>status</code> 看到相关的<code>Extended Resource</code>字段：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Node</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="number">10.0</span><span class="number">.0</span><span class="number">.4</span></span><br><span class="line">  <span class="attr">labels:</span></span><br><span class="line">    <span class="attr">gpushare:</span> <span class="string">"true"</span></span><br><span class="line">    <span class="string">...</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">podCIDR:</span> <span class="number">172.16</span><span class="number">.1</span><span class="number">.0</span><span class="string">/26</span></span><br><span class="line">  <span class="attr">podCIDRs:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="number">172.16</span><span class="number">.1</span><span class="number">.0</span><span class="string">/26</span></span><br><span class="line">  <span class="attr">providerID:</span> <span class="string">qcloud:///800002/ins-hsmsc4x9</span></span><br><span class="line"><span class="attr">status:</span></span><br><span class="line">  <span class="string">...</span></span><br><span class="line">  <span class="attr">allocatable:</span></span><br><span class="line">    <span class="attr">aliyun.com/gpu-count:</span> <span class="string">"1"</span></span><br><span class="line">    <span class="attr">aliyun.com/gpu-mem:</span> <span class="string">"22"</span></span><br><span class="line">    <span class="attr">cpu:</span> <span class="string">5926m</span></span><br><span class="line">    <span class="attr">ephemeral-storage:</span> <span class="string">"47438316671"</span></span><br><span class="line">    <span class="attr">hugepages-2Mi:</span> <span class="string">"0"</span></span><br><span class="line">    <span class="attr">memory:</span> <span class="string">54222084Ki</span></span><br><span class="line">    <span class="string">...</span></span><br><span class="line">  <span class="attr">capacity:</span></span><br><span class="line">    <span class="attr">aliyun.com/gpu-count:</span> <span class="string">"1"</span></span><br><span class="line">    <span class="attr">aliyun.com/gpu-mem:</span> <span class="string">"22"</span></span><br><span class="line">    <span class="attr">cpu:</span> <span class="string">"6"</span></span><br><span class="line">    <span class="attr">ephemeral-storage:</span> <span class="string">51473868Ki</span></span><br><span class="line">    <span class="attr">hugepages-2Mi:</span> <span class="string">"0"</span></span><br><span class="line">    <span class="attr">memory:</span> <span class="string">57448708Ki</span></span><br><span class="line">    <span class="string">...</span></span><br></pre></td></tr></table></figure><h4 id="调度插件扩展"><a href="#调度插件扩展" class="headerlink" title="调度插件扩展"></a>调度插件扩展</h4><p>用户申请GPU的时候，在 Extended Resource 中只填写 <code>gpu-mem</code>，下面部署一个单机版的Tensorflow：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Deployment</span></span><br><span class="line"><span class="attr">metadata:</span> </span><br><span class="line">  <span class="attr">name:</span> <span class="string">tensorflow</span></span><br><span class="line">  <span class="attr">labels:</span></span><br><span class="line">    <span class="attr">k8s-app:</span> <span class="string">tensorflow</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">replicas:</span> <span class="number">1</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">matchLabels:</span></span><br><span class="line">      <span class="attr">k8s-app:</span> <span class="string">tensorflow</span></span><br><span class="line">  <span class="attr">template:</span></span><br><span class="line">    <span class="attr">metadata:</span></span><br><span class="line">      <span class="attr">labels:</span></span><br><span class="line">        <span class="attr">k8s-app:</span> <span class="string">tensorflow</span></span><br><span class="line">    <span class="attr">spec:</span></span><br><span class="line">      <span class="attr">containers:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">tensorflow</span></span><br><span class="line">        <span class="attr">image:</span> <span class="string">tensorflow/tensorflow:2.2.1-gpu-py3-jupyter</span></span><br><span class="line">        <span class="attr">ports:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">containerPort:</span> <span class="number">8888</span></span><br><span class="line">        <span class="attr">resources:</span></span><br><span class="line">          <span class="attr">limits:</span></span><br><span class="line">            <span class="attr">cpu:</span> <span class="number">4</span></span><br><span class="line">            <span class="attr">memory:</span> <span class="string">2Gi</span></span><br><span class="line">            <span class="attr">aliyun.com/gpu-mem:</span> <span class="number">3</span></span><br><span class="line">          <span class="attr">requests:</span></span><br><span class="line">            <span class="attr">cpu:</span> <span class="number">2</span></span><br><span class="line">            <span class="attr">memory:</span> <span class="string">1Gi</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">    <span class="attr">name:</span> <span class="string">jupyter-service</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">type:</span> <span class="string">NodePort</span></span><br><span class="line">  <span class="attr">ports:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">port:</span> <span class="number">80</span></span><br><span class="line">    <span class="attr">targetPort:</span> <span class="number">8888</span></span><br><span class="line">    <span class="attr">name:</span> <span class="string">tensorflow</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">k8s-app:</span> <span class="string">tensorflow</span></span><br></pre></td></tr></table></figure><h5 id="Filter"><a href="#Filter" class="headerlink" title="Filter"></a>Filter</h5><p>当kube-scheduler运行完所有的Filter函数后，就会调用 <code>GPU Share Extender</code> 的 Filter 函数。在原生的过滤中，kube-scheduler会计算是否有足够的Extended Resource（算的是总共的GPU Memory），但是不能知道是否某个GPU设备有足够的资源，这时候就需要调度器插件来实现。以下图为例：</p><ul><li>用户申请了8138MiB的GPU Memory，对于原生调度器，N1节点只剩下  (16276 * 2 - 16276 - 12207 = 4069) 的GPU资源，不满足 Extended Resource可用的条件，N1节点被过滤掉</li><li>接下来的N2节点和N3节点剩余的总的资源数都有8138MiB，那么该选择哪一个呢</li><li>在 <code>GPU Share Extender</code> 的过滤中，他需要找到有单个GPU能够满足用户申请的资源，当检查到N2节点的时候，发现虽然总的GPU Memory有8138MiB，但是每个GPU设备都只剩4096MiB了，不能满足单设备8138的需求，所以N2被过滤掉</li><li>扫描到N3节点，发现GPU0满足8138MiB的需求，符合要求</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_aliyun-gpu-share-filter.jpg"></p><blockquote><p><strong>这里有一个问题：当一个Node上有多张卡的时候，Scheduler Extender是如何知道每张卡当前可用的Capacity的呢？</strong></p></blockquote><p>我们看一下Extender在 Filter 阶段执行的函数，对于要创建的Pod，当前Node检查自己拥有的所有可用GPU，一旦有一个GPU的可用显存大于申请的显存，那么当前Node是可以被调度的。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// check if the pod can be allocated on the node</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(n *NodeInfo)</span> <span class="title">Assume</span><span class="params">(pod *v1.Pod)</span> <span class="params">(allocatable <span class="keyword">bool</span>)</span></span> &#123;</span><br><span class="line">allocatable = <span class="literal">false</span></span><br><span class="line"></span><br><span class="line">n.rwmu.RLock()</span><br><span class="line"><span class="keyword">defer</span> n.rwmu.RUnlock()</span><br><span class="line"></span><br><span class="line">availableGPUs := n.getAvailableGPUs()</span><br><span class="line">reqGPU := <span class="keyword">uint</span>(utils.GetGPUMemoryFromPodResource(pod))</span><br><span class="line">log.Printf(<span class="string">"debug: AvailableGPUs: %v in node %s"</span>, availableGPUs, n.name)</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> <span class="built_in">len</span>(availableGPUs) &gt; <span class="number">0</span> &#123;</span><br><span class="line"><span class="keyword">for</span> devID := <span class="number">0</span>; devID &lt; <span class="built_in">len</span>(n.devs); devID++ &#123;</span><br><span class="line">availableGPU, ok := availableGPUs[devID]</span><br><span class="line"><span class="keyword">if</span> ok &#123;</span><br><span class="line"><span class="keyword">if</span> availableGPU &gt;= reqGPU &#123;</span><br><span class="line">allocatable = <span class="literal">true</span></span><br><span class="line"><span class="keyword">break</span></span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> allocatable</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接下来的一个问题是，每个Node可用的GPU显存是如何得到的呢？我们进入到 <code>getAvailableGPUs</code> 继续看：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(n *NodeInfo)</span> <span class="title">getAvailableGPUs</span><span class="params">()</span> <span class="params">(availableGPUs <span class="keyword">map</span>[<span class="keyword">int</span>]<span class="keyword">uint</span>)</span></span> &#123;</span><br><span class="line">allGPUs := n.getAllGPUs()</span><br><span class="line">usedGPUs := n.getUsedGPUs()</span><br><span class="line">unhealthyGPUs := n.getUnhealthyGPUs()</span><br><span class="line">availableGPUs = <span class="keyword">map</span>[<span class="keyword">int</span>]<span class="keyword">uint</span>&#123;&#125;</span><br><span class="line"><span class="keyword">for</span> id, totalGPUMem := <span class="keyword">range</span> allGPUs &#123;</span><br><span class="line"><span class="keyword">if</span> usedGPUMem, found := usedGPUs[id]; found &#123;</span><br><span class="line">availableGPUs[id] = totalGPUMem - usedGPUMem</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">log.Printf(<span class="string">"info: available GPU list %v before removing unhealty GPUs"</span>, availableGPUs)</span><br><span class="line"><span class="keyword">for</span> id, _ := <span class="keyword">range</span> unhealthyGPUs &#123;</span><br><span class="line">log.Printf(<span class="string">"info: delete dev %d from availble GPU list"</span>, id)</span><br><span class="line"><span class="built_in">delete</span>(availableGPUs, id)</span><br><span class="line">&#125;</span><br><span class="line">log.Printf(<span class="string">"info: available GPU list %v after removing unhealty GPUs"</span>, availableGPUs)</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> availableGPUs</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>这里可以看到，<code>Scheduler Extender</code> 内部维护了当前Node上所有的GPU显存状态和已经用了的GPU显存状态信息：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// device index: gpu memory</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(n *NodeInfo)</span> <span class="title">getUsedGPUs</span><span class="params">()</span> <span class="params">(usedGPUs <span class="keyword">map</span>[<span class="keyword">int</span>]<span class="keyword">uint</span>)</span></span> &#123;</span><br><span class="line">usedGPUs = <span class="keyword">map</span>[<span class="keyword">int</span>]<span class="keyword">uint</span>&#123;&#125;</span><br><span class="line"><span class="keyword">for</span> _, dev := <span class="keyword">range</span> n.devs &#123;</span><br><span class="line">usedGPUs[dev.idx] = dev.GetUsedGPUMemory()</span><br><span class="line">&#125;</span><br><span class="line">log.Printf(<span class="string">"info: getUsedGPUs: %v in node %s, and devs %v"</span>, usedGPUs, n.name, n.devs)</span><br><span class="line"><span class="keyword">return</span> usedGPUs</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// device index: gpu memory</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(n *NodeInfo)</span> <span class="title">getAllGPUs</span><span class="params">()</span> <span class="params">(allGPUs <span class="keyword">map</span>[<span class="keyword">int</span>]<span class="keyword">uint</span>)</span></span> &#123;</span><br><span class="line">allGPUs = <span class="keyword">map</span>[<span class="keyword">int</span>]<span class="keyword">uint</span>&#123;&#125;</span><br><span class="line"><span class="keyword">for</span> _, dev := <span class="keyword">range</span> n.devs &#123;</span><br><span class="line">allGPUs[dev.idx] = dev.totalGPUMem</span><br><span class="line">&#125;</span><br><span class="line">log.Printf(<span class="string">"info: getAllGPUs: %v in node %s, and dev %v"</span>, allGPUs, n.name, n.devs)</span><br><span class="line"><span class="keyword">return</span> allGPUs</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>关于 <code>GetUsedGPUMemory</code>，是<code>Scheduler Extender</code> 内部维护的 <code>DeviceInfo</code> 所记录的，这里的 <code>d.podMap</code> 会在每次Extender执行 <code>Bind</code> 的时候，将对应的Pod添加到对应的Node上的 <code>DeviceInfo</code>中：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(d *DeviceInfo)</span> <span class="title">GetUsedGPUMemory</span><span class="params">()</span> <span class="params">(gpuMem <span class="keyword">uint</span>)</span></span> &#123;</span><br><span class="line">log.Printf(<span class="string">"debug: GetUsedGPUMemory() podMap %v, and its address is %p"</span>, d.podMap, d)</span><br><span class="line">d.rwmu.RLock()</span><br><span class="line"><span class="keyword">defer</span> d.rwmu.RUnlock()</span><br><span class="line"><span class="keyword">for</span> _, pod := <span class="keyword">range</span> d.podMap &#123;</span><br><span class="line"><span class="keyword">if</span> pod.Status.Phase == v1.PodSucceeded || pod.Status.Phase == v1.PodFailed &#123;</span><br><span class="line">log.Printf(<span class="string">"debug: skip the pod %s in ns %s due to its status is %s"</span>, pod.Name, pod.Namespace, pod.Status.Phase)</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line">&#125;</span><br><span class="line"><span class="comment">// gpuMem += utils.GetGPUMemoryFromPodEnv(pod)</span></span><br><span class="line">gpuMem += utils.GetGPUMemoryFromPodAnnotation(pod)</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">return</span> gpuMem</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>再总结总结，本质上是 <code>Scheduler Extender</code> 维护了一个 <code>devs</code> 这么一个数据结构，使得它可以知道当前Node上每个GPU设备的显存状态。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// NodeInfo is node level aggregated information.</span></span><br><span class="line"><span class="keyword">type</span> NodeInfo <span class="keyword">struct</span> &#123;</span><br><span class="line">name           <span class="keyword">string</span></span><br><span class="line">node           *v1.Node</span><br><span class="line">devs           <span class="keyword">map</span>[<span class="keyword">int</span>]*DeviceInfo</span><br><span class="line">gpuCount       <span class="keyword">int</span></span><br><span class="line">gpuTotalMemory <span class="keyword">int</span></span><br><span class="line">rwmu           *sync.RWMutex</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>那么问题来了，我们通过ApiServer，只能知道对应Node上的 <code>gpuCount</code> 和 <code>gpuTotalMemory</code>，而不知道每张卡各自的显存的。这个 <code>devs</code> 是怎么初始化得到每张卡的显存信息呢的呢？继续看代码：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Create Node Level</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">NewNodeInfo</span><span class="params">(node *v1.Node)</span> *<span class="title">NodeInfo</span></span> &#123;</span><br><span class="line">log.Printf(<span class="string">"debug: NewNodeInfo() creates nodeInfo for %s"</span>, node.Name)</span><br><span class="line"></span><br><span class="line">devMap := <span class="keyword">map</span>[<span class="keyword">int</span>]*DeviceInfo&#123;&#125;</span><br><span class="line"><span class="keyword">for</span> i := <span class="number">0</span>; i &lt; utils.GetGPUCountInNode(node); i++ &#123;</span><br><span class="line">devMap[i] = newDeviceInfo(i, <span class="keyword">uint</span>(utils.GetTotalGPUMemory(node)/utils.GetGPUCountInNode(node)))</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> <span class="built_in">len</span>(devMap) == <span class="number">0</span> &#123;</span><br><span class="line">log.Printf(<span class="string">"warn: node %s with nodeinfo %v has no devices"</span>,</span><br><span class="line">node.Name,</span><br><span class="line">node)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> &amp;NodeInfo&#123;</span><br><span class="line">name:           node.Name,</span><br><span class="line">node:           node,</span><br><span class="line">devs:           devMap,</span><br><span class="line">gpuCount:       utils.GetGPUCountInNode(node),</span><br><span class="line">gpuTotalMemory: utils.GetTotalGPUMemory(node),</span><br><span class="line">rwmu:           <span class="built_in">new</span>(sync.RWMutex),</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>可以看到，<strong>这里在初始化的时候，默认设定每张GPU卡的显存大小一样，通过平均得到每张卡的心存信息。</strong></p><h5 id="Bind"><a href="#Bind" class="headerlink" title="Bind"></a>Bind</h5><ul><li>当调度器发现有Node符合要求，这时候会把Pod和Node Bind到一起，<code>GPU Share Extender</code> 需要做两件事情：<ul><li>根据 <code>binpack</code> 原则找到Node上对应的GPU设备，并将 GPU Device ID记录到 Pod的 Annotation中 <code>ALIYUN_GPU_ID</code>。他也会将Pod使用的GPU Memory记录到Pod Annotation中：<code>ALIYUN_COM_GPU_MEM_POD</code> 和 <code>ALIYUN_COM_GPU_MEM_ASSUME_TIME</code></li><li>Bind the Node and Pod with kubernetes API</li></ul></li><li>如果没有找到合适的Node符合要求，那么就不会做Bind操作</li></ul><p>以下图为例，N1中有4个GPU，其中GPU0（12207），GPU1（8138）、GPU2（4069）和GPU3（16276）, GPU2因为资源不够被过滤掉，剩下的3个GPU根据 Binpack 原则，我们选用GPU1（图里面 Annotation错了，不是0，而是1）</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_aliyun-gpu-share-bind.jpg"></p><p>我们看一看在找GPU设备的时候是如何操作的，可以看到这里通过 <code>candidateGPUMemory &gt; availableGPU</code> 这里实现了 <code>binpack</code>。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// allocate the GPU ID to the pod</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(n *NodeInfo)</span> <span class="title">allocateGPUID</span><span class="params">(pod *v1.Pod)</span> <span class="params">(candidateDevID <span class="keyword">int</span>, found <span class="keyword">bool</span>)</span></span> &#123;</span><br><span class="line"></span><br><span class="line">reqGPU := <span class="keyword">uint</span>(<span class="number">0</span>)</span><br><span class="line">found = <span class="literal">false</span></span><br><span class="line">candidateDevID = <span class="number">-1</span></span><br><span class="line">candidateGPUMemory := <span class="keyword">uint</span>(<span class="number">0</span>)</span><br><span class="line">availableGPUs := n.getAvailableGPUs()</span><br><span class="line"></span><br><span class="line">reqGPU = <span class="keyword">uint</span>(utils.GetGPUMemoryFromPodResource(pod))</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> reqGPU &gt; <span class="keyword">uint</span>(<span class="number">0</span>) &#123;</span><br><span class="line">log.Printf(<span class="string">"info: reqGPU for pod %s in ns %s: %d"</span>, pod.Name, pod.Namespace, reqGPU)</span><br><span class="line">log.Printf(<span class="string">"info: AvailableGPUs: %v in node %s"</span>, availableGPUs, n.name)</span><br><span class="line"><span class="keyword">if</span> <span class="built_in">len</span>(availableGPUs) &gt; <span class="number">0</span> &#123;</span><br><span class="line"><span class="keyword">for</span> devID := <span class="number">0</span>; devID &lt; <span class="built_in">len</span>(n.devs); devID++ &#123;</span><br><span class="line">availableGPU, ok := availableGPUs[devID]</span><br><span class="line"><span class="keyword">if</span> ok &#123;</span><br><span class="line"><span class="keyword">if</span> availableGPU &gt;= reqGPU &#123;</span><br><span class="line"><span class="keyword">if</span> candidateDevID == <span class="number">-1</span> || candidateGPUMemory &gt; availableGPU &#123;</span><br><span class="line">candidateDevID = devID</span><br><span class="line">candidateGPUMemory = availableGPU</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">found = <span class="literal">true</span></span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> found &#123;</span><br><span class="line">log.Printf(<span class="string">"info: Find candidate dev id %d for pod %s in ns %s successfully."</span>,</span><br><span class="line">candidateDevID,</span><br><span class="line">pod.Name,</span><br><span class="line">pod.Namespace)</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">log.Printf(<span class="string">"warn: Failed to find available GPUs %d for the pod %s in the namespace %s"</span>,</span><br><span class="line">reqGPU,</span><br><span class="line">pod.Name,</span><br><span class="line">pod.Namespace)</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> candidateDevID, found</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="Kubelet创建Pod"><a href="#Kubelet创建Pod" class="headerlink" title="Kubelet创建Pod"></a>Kubelet创建Pod</h4><p>接下来由Kubelet在创建container前调用 <code>GPU Share Device Plugin</code> 的 <code>Allocate</code> 函数，参数是申请的GPU Memory的数量。</p><p>Pod运行成功后，执行 <code>kubectl get pod</code> 可以看到：</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Pod</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">annotations:</span></span><br><span class="line">    <span class="attr">ALIYUN_COM_GPU_MEM_ASSIGNED:</span> <span class="string">"true"</span></span><br><span class="line">    <span class="attr">ALIYUN_COM_GPU_MEM_ASSUME_TIME:</span> <span class="string">"1606125285243248618"</span></span><br><span class="line">    <span class="attr">ALIYUN_COM_GPU_MEM_DEV:</span> <span class="string">"22"</span></span><br><span class="line">    <span class="attr">ALIYUN_COM_GPU_MEM_IDX:</span> <span class="string">"0"</span></span><br><span class="line">    <span class="attr">ALIYUN_COM_GPU_MEM_POD:</span> <span class="string">"3"</span></span><br><span class="line">  <span class="string">...</span></span><br></pre></td></tr></table></figure><ul><li><p>Device Plugin 从 k8s apiserver 拿到所有Pending的Pod中属于GPU Share的Pod，并且按照 AssumedTimestamp排序</p></li><li><p>选择符合Allocation传入的GPU Memory的Pod，如果有多个，选择最早的那个Pod</p></li><li><p>标记 <code>ALIYUN_COM_GPU_MEM_ASSIGNED</code> 为 True</p></li><li><p>把 DeviceID 作为下NVIDIA_VISIBLE_DEVICES环境变量告诉 Nvidia Docker2，并且创建容器</p></li></ul><p><img alt data-src="https://github.com/AliyunContainerService/gpushare-scheduler-extender/raw/master/docs/designs/sequence.jpg"></p><blockquote><p><strong>这里问题是device plugin的allocate接口参数是什么，是否包含pod信息，是否包含pod annotation？</strong></p></blockquote><p>查看 Device Plugin 的代码，这一个申请的GPU Memory的数量让我很疑惑，为何要这么算？</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">for</span> _, req := <span class="keyword">range</span> reqs.ContainerRequests &#123;</span><br><span class="line">podReqGPU += <span class="keyword">uint</span>(<span class="built_in">len</span>(req.DevicesIDs))</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>继续看 <code>Device Plugin</code> 的 <code>DeviceIDs</code> 是如何生成的。这里调用了 <code>nvml library</code> 可以探测到本Node上拥有的GPU有多少个，每个显存是多少。接下来 <code>Device Plugin</code> 会创建一系列的 <code>FakeDeviceID</code>，并将这个DeviceIDs返回给 Kubelet，这就解释了为什么要通过上面的方法计算申请的 GPU Memory，这里的Memory以MiB为单位。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">getDevices</span><span class="params">()</span> <span class="params">([]*pluginapi.Device, <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">uint</span>)</span></span> &#123;</span><br><span class="line">n, err := nvml.GetDeviceCount()</span><br><span class="line">check(err)</span><br><span class="line"></span><br><span class="line"><span class="keyword">var</span> devs []*pluginapi.Device</span><br><span class="line">realDevNames := <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">uint</span>&#123;&#125;</span><br><span class="line"><span class="keyword">for</span> i := <span class="keyword">uint</span>(<span class="number">0</span>); i &lt; n; i++ &#123;</span><br><span class="line">d, err := nvml.NewDevice(i)</span><br><span class="line">check(err)</span><br><span class="line"><span class="comment">// realDevNames = append(realDevNames, d.UUID)</span></span><br><span class="line"><span class="keyword">var</span> id <span class="keyword">uint</span></span><br><span class="line">log.Infof(<span class="string">"Deivce %s's Path is %s"</span>, d.UUID, d.Path)</span><br><span class="line">_, err = fmt.Sscanf(d.Path, <span class="string">"/dev/nvidia%d"</span>, &amp;id)</span><br><span class="line">check(err)</span><br><span class="line">realDevNames[d.UUID] = id</span><br><span class="line"><span class="comment">// var KiB uint64 = 1024</span></span><br><span class="line">log.Infof(<span class="string">"# device Memory: %d"</span>, <span class="keyword">uint</span>(*d.Memory))</span><br><span class="line"><span class="keyword">if</span> getGPUMemory() == <span class="keyword">uint</span>(<span class="number">0</span>) &#123;</span><br><span class="line">setGPUMemory(<span class="keyword">uint</span>(*d.Memory))</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">for</span> j := <span class="keyword">uint</span>(<span class="number">0</span>); j &lt; getGPUMemory(); j++ &#123;</span><br><span class="line">fakeID := generateFakeDeviceID(d.UUID, j)</span><br><span class="line"><span class="keyword">if</span> j == <span class="number">0</span> &#123;</span><br><span class="line">log.Infoln(<span class="string">"# Add first device ID: "</span> + fakeID)</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">if</span> j == getGPUMemory()<span class="number">-1</span> &#123;</span><br><span class="line">log.Infoln(<span class="string">"# Add last device ID: "</span> + fakeID)</span><br><span class="line">&#125;</span><br><span class="line">devs = <span class="built_in">append</span>(devs, &amp;pluginapi.Device&#123;</span><br><span class="line">ID:     fakeID,</span><br><span class="line">Health: pluginapi.Healthy,</span><br><span class="line">&#125;)</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> devs, realDevNames</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>我们看一下 <code>Device Plugin</code> 是如何找到对应的Pod的，可以看到一旦碰到有Pod申请的GPU显存与Kubelet传入的显存大小一致，那么则找到对应的Pod了。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">pods, err := getCandidatePods()</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">   log.Infof(<span class="string">"invalid allocation requst: Failed to find candidate pods due to %v"</span>, err)</span><br><span class="line">   <span class="keyword">return</span> buildErrResponse(reqs, podReqGPU), <span class="literal">nil</span></span><br><span class="line">&#125;</span><br><span class="line">...</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> _, pod := <span class="keyword">range</span> pods &#123;</span><br><span class="line">   <span class="keyword">if</span> getGPUMemoryFromPodResource(pod) == podReqGPU &#123;</span><br><span class="line">      log.Infof(<span class="string">"Found Assumed GPU shared Pod %s in ns %s with GPU Memory %d"</span>,</span><br><span class="line">         pod.Name,</span><br><span class="line">         pod.Namespace,</span><br><span class="line">         podReqGPU)</span><br><span class="line">      assumePod = pod</span><br><span class="line">      found = <span class="literal">true</span></span><br><span class="line">      <span class="keyword">break</span></span><br><span class="line">   &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>这里的 <code>getCandidatePods</code>就是List所有Pending的Pod中 Assume Memory的，并且按照时间排序：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// pick up the gpushare pod with assigned status is false, and</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">getCandidatePods</span><span class="params">()</span> <span class="params">([]*v1.Pod, error)</span></span> &#123;</span><br><span class="line">candidatePods := []*v1.Pod&#123;&#125;</span><br><span class="line">allPods, err := getPendingPodsInNode()</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> candidatePods, err</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">for</span> _, pod := <span class="keyword">range</span> allPods &#123;</span><br><span class="line">current := pod</span><br><span class="line"><span class="keyword">if</span> isGPUMemoryAssumedPod(&amp;current) &#123;</span><br><span class="line">candidatePods = <span class="built_in">append</span>(candidatePods, &amp;current)</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">  ...</span><br><span class="line"><span class="keyword">return</span> makePodOrderdByAge(candidatePods), <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><blockquote><p><strong>那么这里有一个问题：如果在同一个Node有两个Pod <poda, podb>，都申请了相同的GPU显存大小，比如3G，那么kubelet是在创建容器的时候，是如何保证两个Pod不混淆的呢？混淆会有问题吗，kubelet建Pod的时候到底是怎么搞的？是谁触发了kubelet创建容器？</poda,></strong></p></blockquote><hr><h2 id="腾讯GPUManager"><a href="#腾讯GPUManager" class="headerlink" title="腾讯GPUManager"></a>腾讯GPUManager</h2><p>GPU Manager 提供一个 All-in-One 的 GPU 管理器，基于 Kubernetes DevicePlugin 插件系统实现，该管理器提供了分配并共享 GPU、GPU 指标查询、容器运行前的 GPU 相关设备准备等功能，支持用户在 Kubernetes 集群中使用 GPU 设备。</p><ul><li><strong>拓扑分配</strong>：提供基于 GPU 拓扑分配功能，当用户分配超过1张 GPU 卡的应用，可以选择拓扑连接最快的方式分配 GPU 设备。</li><li><strong>GPU 共享</strong>：允许用户提交小于1张卡资源的任务，并提供 QoS 保证。</li><li><strong>应用 GPU 指标的查询</strong>：用户可以访问主机端口（默认为 5678）的 <code>/metrics</code> 路径，可以为 Prometheus 提供 GPU 指标的收集功能，访问 <code>/usage</code> 路径可以进行可读性的容器状况查询。</li></ul><h3 id="架构设计-1"><a href="#架构设计-1" class="headerlink" title="架构设计"></a>架构设计</h3><h4 id="设计原则-1"><a href="#设计原则-1" class="headerlink" title="设计原则"></a>设计原则</h4><ul><li><p>设计里定义了两种 <code>Extended Resource</code>：</p><ul><li><code>tencent.com/vcuda-core</code> ： <code>vcuda-core</code>对应的是使用率，单张卡有100个core</li><li><code>tencent.com/vcuda-memory</code> ：<code>vcuda-memory</code> 是显存，每个单位是256MB的显存</li><li>如果申请的资源为50%利用率，7680MB显存，<code>tencent.com/vcuda-core</code> 填写50，<code>tencent.com/vcuda-memory</code> 填写成30</li><li>同样支持原来的独占卡的方式，只需要在core的地方填写100的整数倍，memory值填写大于0的任意值</li></ul></li><li>基于k8s原生的Scheduler Extender、Extended Resource、DevicePlugin机制来实现</li><li>这个方案同时实现GPU的共享与算力和显存的隔离，类似于阿里云 <a href="https://www.alibabacloud.com/help/zh/doc-detail/163994.htm" target="_blank" rel="external nofollow noopener noreferrer">cGPU</a> 加上GPUShare 一起使用</li></ul><h4 id="核心组件-1"><a href="#核心组件-1" class="headerlink" title="核心组件"></a>核心组件</h4><p>GaiaGPU的实现主要分为两个部分：Kubernetes 部分 和 vCUDA 部分</p><ul><li>Kubernetes部分基于 Kubernetes 的 Extended Resources、Device Plugin 和 Scheduler Extender机制，实现了下面两个项目<ul><li><a href="https://github.com/tkestack/gpu-manager" target="_blank" rel="external nofollow noopener noreferrer">GPU Manager </a>：实现为一个 Device Plugin，与 NVIDIA 的 <a href="https://github.com/NVIDIA/k8s-device-plugin" target="_blank" rel="external nofollow noopener noreferrer">k8s-device-plugin</a> 相比，不需要额外配置 <code>nvidia-docker2</code>，使用的是原生的 <code>runc</code></li><li><a href="https://github.com/tkestack/gpu-admission" target="_blank" rel="external nofollow noopener noreferrer">GPU Admission</a>：实现为一个Scheduler Extender，注意这里的Extender在论文中没有提到，下图中的GPU Scheduler实现的是topology的选卡，属于现在GPU Manager项目的一部分，与这里的调度器插件无关</li></ul></li><li>vCUDA 部分通过 <a href="https://github.com/tkestack/vcuda-controller" target="_blank" rel="external nofollow noopener noreferrer">vcuda-controller</a> 来实现，作为 NVIDIA 的 CUDA 库的封装</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-22_gaia-gpu-manager.png"></p><h3 id="具体过程-1"><a href="#具体过程-1" class="headerlink" title="具体过程"></a>具体过程</h3><h4 id="设备资源上报"><a href="#设备资源上报" class="headerlink" title="设备资源上报"></a>设备资源上报</h4><ul><li>与阿里的 <a href="https://github.com/AliyunContainerService/gpushare-scheduler-extender" target="_blank" rel="external nofollow noopener noreferrer">GPUShare</a> 一样，GPU Manager 在 <code>ListAndWatch</code> 返回给Kubelet的也不是实际的GPU设备，而是 <code>a list of vGPUs</code>，</li><li>GPU被虚拟化为两个资源维度，memory 和 computing resource<ul><li>memory：以256M内存作为单位，每个memory unit叫做 <code>vmemory</code> device</li><li>computing resource：将一个物理GPU划分为100个 <code>vprocessor</code> devices，每个 <code>vprocessor</code> 占有 1%的GPU利用率</li></ul></li><li>用户申请具有GPU的Pod资源Manifest如下：</li></ul><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Pod</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">vcuda</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">restartPolicy:</span> <span class="string">Never</span></span><br><span class="line">  <span class="attr">hostNetwork:</span> <span class="literal">true</span></span><br><span class="line">  <span class="attr">containers:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">image:</span> <span class="string">tensorflow</span></span><br><span class="line">    <span class="attr">name:</span> <span class="string">vcuda-test</span></span><br><span class="line">    <span class="attr">command:</span> <span class="string">['/usr/local/nvidia/bin/nvidia-smi']</span></span><br><span class="line">    <span class="attr">resources:</span></span><br><span class="line">      <span class="attr">requests:</span></span><br><span class="line">        <span class="attr">tencent.com/vcuda-core:</span> <span class="number">50</span></span><br><span class="line">        <span class="attr">tencent.com/vcuda-memory:</span> <span class="number">30</span></span><br><span class="line">      <span class="attr">limits:</span></span><br><span class="line">        <span class="attr">tencent.com/vcuda-core:</span> <span class="number">50</span></span><br><span class="line">        <span class="attr">tencent.com/vcuda-memory:</span> <span class="number">3</span></span><br></pre></td></tr></table></figure><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-22_gaia-device-plugin.png"></p><p> 下面看具体代码，首先是向 <code>kubelet</code> 注册：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *managerImpl)</span> <span class="title">RegisterToKubelet</span><span class="params">()</span> <span class="title">error</span></span> &#123;</span><br><span class="line">socketFile := filepath.Join(m.config.DevicePluginPath, types.KubeletSocket)</span><br><span class="line">dialOptions := []grpc.DialOption&#123;grpc.WithInsecure(), grpc.WithDialer(utils.UnixDial), grpc.WithBlock(), grpc.WithTimeout(time.Second * <span class="number">5</span>)&#125;</span><br><span class="line"></span><br><span class="line">conn, err := grpc.Dial(socketFile, dialOptions...)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> err</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">defer</span> conn.Close()</span><br><span class="line"></span><br><span class="line">client := pluginapi.NewRegistrationClient(conn)</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> _, srv := <span class="keyword">range</span> m.bundleServer &#123;</span><br><span class="line">req := &amp;pluginapi.RegisterRequest&#123;</span><br><span class="line">Version:      pluginapi.Version,</span><br><span class="line">Endpoint:     path.Base(srv.SocketName()),</span><br><span class="line">ResourceName: srv.ResourceName(),</span><br><span class="line">Options:      &amp;pluginapi.DevicePluginOptions&#123;PreStartRequired: <span class="literal">true</span>&#125;,</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">glog.V(<span class="number">2</span>).Infof(<span class="string">"Register to kubelet with endpoint %s"</span>, req.Endpoint)</span><br><span class="line">_, err = client.Register(context.Background(), req)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> err</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>这里有一个 <code>m.bundleServer</code>，分别是 <code>vcore</code> 和 <code>vmemory</code> 的 gRPC Server。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *managerImpl)</span> <span class="title">setupGRPCService</span><span class="params">()</span></span> &#123;</span><br><span class="line">vcoreServer := newVcoreServer(m)</span><br><span class="line">vmemoryServer := newVmemoryServer(m)</span><br><span class="line"></span><br><span class="line">m.bundleServer[types.VCoreAnnotation] = vcoreServer</span><br><span class="line">m.bundleServer[types.VMemoryAnnotation] = vmemoryServer</span><br><span class="line"></span><br><span class="line">displayapi.RegisterGPUDisplayServer(m.srv, m)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接下来看 <code>ListAndWatch</code> 的实现，对于两种资源，它会去检查 <code>capacity()</code>里面包含对应 <code>resourceName</code> 的：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">//ListAndWatchWithResourceName send devices for request resource back to server</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(ta *NvidiaTopoAllocator)</span> <span class="title">ListAndWatchWithResourceName</span><span class="params">(resourceName <span class="keyword">string</span>, e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer)</span> <span class="title">error</span></span> &#123;</span><br><span class="line">devs := <span class="built_in">make</span>([]*pluginapi.Device, <span class="number">0</span>)</span><br><span class="line"><span class="keyword">for</span> _, dev := <span class="keyword">range</span> ta.capacity() &#123;</span><br><span class="line"><span class="keyword">if</span> strings.HasPrefix(dev.ID, resourceName) &#123;</span><br><span class="line">devs = <span class="built_in">append</span>(devs, dev)</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">s.Send(&amp;pluginapi.ListAndWatchResponse&#123;Devices: devs&#125;)</span><br><span class="line"></span><br><span class="line"><span class="comment">// We don't send unhealthy state</span></span><br><span class="line"><span class="keyword">for</span> &#123;</span><br><span class="line">time.Sleep(time.Second)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">glog.V(<span class="number">2</span>).Infof(<span class="string">"ListAndWatch %s exit"</span>, resourceName)</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>那么这里的 <code>ta.capicity()</code> 是如何得到的呢？这里维护了一个拓扑树，树根是物理的Host，树叶是物理的GPU。这里根据树叶上GPU的数目和总的显存大小，构建了 <code>vcore</code> 设备 和 <code>vmemory</code> 设备，命名以各自的资源名为前缀。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(ta *NvidiaTopoAllocator)</span> <span class="title">capacity</span><span class="params">()</span> <span class="params">(devs []*pluginapi.Device)</span></span> &#123;</span><br><span class="line"><span class="keyword">var</span> (</span><br><span class="line">gpuDevices, memoryDevices []*pluginapi.Device</span><br><span class="line">totalMemory               <span class="keyword">int64</span></span><br><span class="line">)</span><br><span class="line"></span><br><span class="line">nodes := ta.tree.Leaves()</span><br><span class="line"><span class="keyword">for</span> i := <span class="keyword">range</span> nodes &#123;</span><br><span class="line">totalMemory += <span class="keyword">int64</span>(nodes[i].Meta.TotalMemory)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">totalCores := <span class="built_in">len</span>(nodes) * nvtree.HundredCore</span><br><span class="line">gpuDevices = <span class="built_in">make</span>([]*pluginapi.Device, totalCores)</span><br><span class="line"><span class="keyword">for</span> i := <span class="number">0</span>; i &lt; totalCores; i++ &#123;</span><br><span class="line">gpuDevices[i] = &amp;pluginapi.Device&#123;</span><br><span class="line">ID:     fmt.Sprintf(<span class="string">"%s-%d"</span>, types.VCoreAnnotation, i),</span><br><span class="line">Health: pluginapi.Healthy,</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">totalMemoryBlocks := totalMemory / types.MemoryBlockSize</span><br><span class="line">memoryDevices = <span class="built_in">make</span>([]*pluginapi.Device, totalMemoryBlocks)</span><br><span class="line"><span class="keyword">for</span> i := <span class="keyword">int64</span>(<span class="number">0</span>); i &lt; totalMemoryBlocks; i++ &#123;</span><br><span class="line">memoryDevices[i] = &amp;pluginapi.Device&#123;</span><br><span class="line">ID:     fmt.Sprintf(<span class="string">"%s-%d-%d"</span>, types.VMemoryAnnotation, types.MemoryBlockSize, i),</span><br><span class="line">Health: pluginapi.Healthy,</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">devs = <span class="built_in">append</span>(devs, gpuDevices...)</span><br><span class="line">devs = <span class="built_in">append</span>(devs, memoryDevices...)</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="调度插件扩展-1"><a href="#调度插件扩展-1" class="headerlink" title="调度插件扩展"></a>调度插件扩展</h4><h5 id="细粒度Quota准入"><a href="#细粒度Quota准入" class="headerlink" title="细粒度Quota准入"></a>细粒度Quota准入</h5><p><code>GPU Quota Admission</code> 作为调度器插件，实现了更细粒度的quota调度准入维度。用户通过配置一个 <code>ConfigMap</code>，对每个 <code>Namespace</code>可用的GPU卡的配额做规划，同时也定义了资源池，这样在调度的时候就可以实现按照资源池及GPU型号进行策略调度。</p><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line">&#123;</span><br><span class="line">  <span class="attr">"A"</span>: &#123;</span><br><span class="line">    <span class="attr">"pool"</span>: [<span class="string">"public"</span>], <span class="comment">// Pods in namespace 'A' could use pool 'public'</span></span><br><span class="line">    <span class="attr">"quota"</span>: &#123;</span><br><span class="line">      <span class="attr">"M40"</span>: <span class="number">2</span>,</span><br><span class="line">      <span class="attr">"P100"</span>: <span class="number">3</span></span><br><span class="line">    &#125;</span><br><span class="line">  &#125;,</span><br><span class="line">  <span class="attr">"B"</span>: &#123;</span><br><span class="line">    <span class="attr">"pool"</span>: [ <span class="string">"wx"</span> ], <span class="comment">// Pods in namespace 'B' could use pool 'wx'</span></span><br><span class="line">    <span class="attr">"quota"</span>: &#123;</span><br><span class="line">      <span class="attr">"M40"</span>: <span class="number">8</span>,</span><br><span class="line">      <span class="attr">"P100"</span>: <span class="number">2</span></span><br><span class="line">    &#125;</span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>具体在调度的时候，对每一个Pod，根据Namespace可以筛选出一系列含有GPU的Pods，然后当前Namespace下，对于某种GPU Model（比如P100），计算已经使用了的GPU大小，根据 <code>ConfigMap</code> 定义的配额，找到没超出。通过这个，得到所有没超出Quota的Models。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> NamespaceQuota <span class="keyword">struct</span> &#123;</span><br><span class="line">Quota <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">int</span> <span class="string">`json:"quota"`</span></span><br><span class="line">Pool []<span class="keyword">string</span> <span class="string">`json:"pool"`</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(gpuFilter *GPUFilter)</span> <span class="title">filterGPUModel</span><span class="params">(pod *corev1.Pod, namespaceQuota NamespaceQuota)</span> <span class="params">([]<span class="keyword">string</span>, error)</span></span> &#123;</span><br><span class="line"><span class="keyword">var</span> filteredGPUModels []<span class="keyword">string</span></span><br><span class="line"><span class="keyword">for</span> gpuModel, limit := <span class="keyword">range</span> namespaceQuota.Quota &#123;</span><br><span class="line">limit = limit * VirtualGPUTimes</span><br><span class="line">  nodeSelector, err := metav1.LabelSelectorAsSelector(&amp;metav1.LabelSelector&#123;</span><br><span class="line">MatchLabels: <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">string</span>&#123;gpuFilter.conf.GPUModelLabel: gpuModel&#125;&#125;)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line">pods, err := gpuFilter.listPodsOnNodes(nodeSelector, pod.Namespace)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line">gpuUsed := calculateGPUUsage(<span class="built_in">append</span>(pods, pod))</span><br><span class="line"><span class="keyword">if</span> gpuUsed &lt;= limit &#123;</span><br><span class="line">filteredGPUModels = <span class="built_in">append</span>(filteredGPUModels, gpuModel)</span><br><span class="line">&#125;</span><br><span class="line">glog.V(<span class="number">4</span>).Infof(<span class="string">"Pods in namespace %s will use %d %s GPU cards after adding this pod, quota is %d"</span>,</span><br><span class="line">pod.Namespace, gpuUsed, gpuModel, limit)</span><br><span class="line">&#125;</span><br><span class="line">glog.V(<span class="number">4</span>).Infof(<span class="string">"These GPU models could be used by pod %s: %+v"</span>, pod.Name, filteredGPUModels)</span><br><span class="line"><span class="keyword">return</span> filteredGPUModels, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接下来在 Filter阶段，根据上面的可用 <code>GPU Models</code> 和定义的 <code>Quota Pool</code>，</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(gpuFilter *GPUFilter)</span> <span class="title">filterNodes</span><span class="params">(nodes []corev1.Node, gpuModels, pools []<span class="keyword">string</span>)</span> <span class="params">(filteredNodes []corev1.Node, failedNodesMap schedulerapi.FailedNodesMap, err error)</span></span> &#123;</span><br><span class="line"><span class="keyword">var</span> gpuModelSelector, poolSelector labels.Selector</span><br><span class="line"></span><br><span class="line">glog.V(<span class="number">4</span>).Infof(<span class="string">"Filter nodes with gpuModels(%+v) and pools(%+v)"</span>, gpuModels, pools)</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> <span class="built_in">len</span>(gpuModels) != <span class="number">0</span> &#123;</span><br><span class="line">gpuModelSelector, err = metav1.LabelSelectorAsSelector(&amp;metav1.LabelSelector&#123;</span><br><span class="line">MatchExpressions: []metav1.LabelSelectorRequirement&#123;&#123;</span><br><span class="line">Key:      gpuFilter.conf.GPUModelLabel,</span><br><span class="line">Operator: metav1.LabelSelectorOpIn,</span><br><span class="line">Values:   gpuModels,</span><br><span class="line">&#125;&#125;&#125;)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">gpuModelSelector = labels.Nothing()</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// If pool is empty, it means that pod could use every pool, it is OK to leave it as a empty selector.</span></span><br><span class="line"><span class="keyword">if</span> <span class="built_in">len</span>(pools) != <span class="number">0</span> &#123;</span><br><span class="line">poolSelector, err = metav1.LabelSelectorAsSelector(&amp;metav1.LabelSelector&#123;</span><br><span class="line">MatchExpressions: []metav1.LabelSelectorRequirement&#123;&#123;</span><br><span class="line">Key:      gpuFilter.conf.GPUPoolLabel,</span><br><span class="line">Operator: metav1.LabelSelectorOpIn,</span><br><span class="line">Values:   pools,</span><br><span class="line">&#125;&#125;&#125;)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">poolSelector = labels.Everything()</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">failedNodesMap = schedulerapi.FailedNodesMap&#123;&#125;</span><br><span class="line"><span class="keyword">for</span> _, node := <span class="keyword">range</span> nodes &#123;</span><br><span class="line"><span class="keyword">if</span> gpuModelSelector.Matches(labels.Set(node.Labels)) &amp;&amp; poolSelector.Matches(labels.Set(node.Labels)) &#123;</span><br><span class="line">filteredNodes = <span class="built_in">append</span>(filteredNodes, node)</span><br><span class="line">glog.V(<span class="number">5</span>).Infof(<span class="string">"Add %s to filteredNodes"</span>, node.Name)</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">failedNodesMap[node.Name] = <span class="string">"ExceedsGPUQuota"</span></span><br><span class="line">glog.V(<span class="number">5</span>).Infof(<span class="string">"Add %s to failedNodesMap"</span>, node.Name)</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">return</span> filteredNodes, failedNodesMap, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>到这一步，也就是实现了细粒度的Quota调度准入控制。</p><h5 id="避免GPU碎片化"><a href="#避免GPU碎片化" class="headerlink" title="避免GPU碎片化"></a>避免GPU碎片化</h5><p>为此我们增加了GPU predicate controller来尽可能的降低系统默认调度策略带来的碎片化问题。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-19_gpu-manager-predicate.png"></p><p>我们看看它是如何实现的，首先在 <code>deviceFilter</code>的入口里面，拿到当前Node上存在的所有Pod：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">pods, err := gpuFilter.ListPodsOnNode(node)</span><br><span class="line">...</span><br><span class="line">nodeInfo := device.NewNodeInfo(node, pods)</span><br><span class="line">alloc := algorithm.NewAllocator(nodeInfo)</span><br><span class="line">newPod, err := alloc.Allocate(pod)</span><br></pre></td></tr></table></figure><p>接下来构建一个 <code>NodeInfo</code> 结构体，里面包含有当前Node的所有信息，这里记录了Node上所有的GPU显存和GPU设备数目。这个是通过Node Status里面两个扩展资源计算出来的。<strong>GPU Manager 方案也是认为每台机器上的GPU的不同卡的显存大小是相同的，这样可以算出每张卡的显存大小</strong>。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> NodeInfo <span class="keyword">struct</span> &#123;</span><br><span class="line">name        <span class="keyword">string</span></span><br><span class="line">node        *v1.Node</span><br><span class="line">devs        <span class="keyword">map</span>[<span class="keyword">int</span>]*DeviceInfo</span><br><span class="line">deviceCount <span class="keyword">int</span></span><br><span class="line">totalMemory <span class="keyword">uint</span></span><br><span class="line">usedCore    <span class="keyword">uint</span></span><br><span class="line">usedMemory  <span class="keyword">uint</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>NodeInfo</code> 里面还有一个 <code>DeviceInfo</code> 的map，用于记录每张卡的使用情况。这里在初始化这个 <code>NodeInfo</code> 数据结构的时候也会根据传入的 <code>pods</code> 信息更新 <code>DeviceInfo</code> 的设备使用情况。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> DeviceInfo <span class="keyword">struct</span> &#123;</span><br><span class="line">id          <span class="keyword">int</span></span><br><span class="line">totalMemory <span class="keyword">uint</span></span><br><span class="line">usedMemory  <span class="keyword">uint</span></span><br><span class="line">usedCore    <span class="keyword">uint</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接下来就是每个 <code>Allocate</code> 函数的实现，对于Pod里面的每一个容器，都会分配得到一个 <code>devIDs</code> 列表，然后得到对Pod打上Annotation：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(alloc *allocator)</span> <span class="title">Allocate</span><span class="params">(pod *v1.Pod)</span> <span class="params">(*v1.Pod, error)</span></span> &#123;</span><br><span class="line">newPod := pod.DeepCopy()</span><br><span class="line"><span class="keyword">for</span> i, c := <span class="keyword">range</span> newPod.Spec.Containers &#123;</span><br><span class="line"><span class="keyword">if</span> !util.IsGPURequiredContainer(&amp;c) &#123;</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line">&#125;</span><br><span class="line">devIDs := []<span class="keyword">string</span>&#123;&#125;</span><br><span class="line">devs, err := alloc.AllocateOne(&amp;c)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">glog.Infof(<span class="string">"failed to allocate for pod %s(%s)"</span>, newPod.Name, c.Name)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">for</span> _, dev := <span class="keyword">range</span> devs &#123;</span><br><span class="line">devIDs = <span class="built_in">append</span>(devIDs, strconv.Itoa(dev.GetID()))</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">if</span> newPod.Annotations == <span class="literal">nil</span> &#123;</span><br><span class="line">newPod.Annotations = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">string</span>)</span><br><span class="line">&#125;</span><br><span class="line">newPod.Annotations[util.PredicateGPUIndexPrefix+strconv.Itoa(i)] = strings.Join(devIDs, <span class="string">","</span>)</span><br><span class="line">&#125;</span><br><span class="line">newPod.Annotations[util.GPUAssigned] = <span class="string">"false"</span></span><br><span class="line">newPod.Annotations[util.PredicateTimeAnnotation] = fmt.Sprintf(<span class="string">"%d"</span>, time.Now().UnixNano())</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> newPod, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接下来的问题就是，这里的 <code>AllocateOne</code> 是如何实现的呢？对于每个容器，根据其申请的GPU资源，可以分为GPU是共享模式还是独占模式，然后调用 <code>Evaluate</code>去得到 <code>devs</code>。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(alloc *allocator)</span> <span class="title">AllocateOne</span><span class="params">(container *v1.Container)</span> <span class="params">([]*device.DeviceInfo, error)</span></span> &#123;</span><br><span class="line"><span class="keyword">var</span> (</span><br><span class="line">devs           []*device.DeviceInfo</span><br><span class="line">sharedMode     <span class="keyword">bool</span></span><br><span class="line">vcore, vmemory <span class="keyword">uint</span></span><br><span class="line">)</span><br><span class="line">node := alloc.nodeInfo.GetNode()</span><br><span class="line">nodeTotalMemory := util.GetCapacityOfNode(node, util.VMemoryAnnotation)</span><br><span class="line">deviceCount := util.GetGPUDeviceCountOfNode(node)</span><br><span class="line">deviceTotalMemory := <span class="keyword">uint</span>(nodeTotalMemory / deviceCount)</span><br><span class="line">needCores := util.GetGPUResourceOfContainer(container, util.VCoreAnnotation)</span><br><span class="line">needMemory := util.GetGPUResourceOfContainer(container, util.VMemoryAnnotation)</span><br><span class="line"></span><br><span class="line"><span class="keyword">switch</span> &#123;</span><br><span class="line"><span class="keyword">case</span> needCores &lt; util.HundredCore:</span><br><span class="line">eval := NewShareMode(alloc.nodeInfo)</span><br><span class="line">devs = eval.Evaluate(needCores, needMemory)</span><br><span class="line">sharedMode = <span class="literal">true</span></span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line">eval := NewExclusiveMode(alloc.nodeInfo)</span><br><span class="line">devs = eval.Evaluate(needCores, needMemory)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> <span class="built_in">len</span>(devs) == <span class="number">0</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"failed to allocate for container %s"</span>, container.Name)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> sharedMode &#123;</span><br><span class="line">vcore = needCores</span><br><span class="line">vmemory = needMemory</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">vcore = util.HundredCore</span><br><span class="line">vmemory = deviceTotalMemory</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">for</span> _, dev := <span class="keyword">range</span> devs &#123;</span><br><span class="line">err := alloc.nodeInfo.AddUsedResources(dev.GetID(), vcore, vmemory)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">glog.Infof(<span class="string">"failed to update used resource for node %s dev %d due to %v"</span>, node.Name, dev.GetID(), err)</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">return</span> devs, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>以共享模式为例，这里拿到当前Node的所有 <code>Device</code>，分别根据最少可用的<code>cores</code>和可用的<code>memory</code>来排序，如果有满足用户需要的设备，则加入到 <code>devs</code> 里面，最后将这个 <code>list</code> 返回给用户。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(al *shareMode)</span> <span class="title">Evaluate</span><span class="params">(cores <span class="keyword">uint</span>, memory <span class="keyword">uint</span>)</span> []*<span class="title">device</span>.<span class="title">DeviceInfo</span></span> &#123;</span><br><span class="line"><span class="keyword">var</span> (</span><br><span class="line">devs        []*device.DeviceInfo</span><br><span class="line">deviceCount = al.node.GetDeviceCount()</span><br><span class="line">tmpStore    = <span class="built_in">make</span>([]*device.DeviceInfo, deviceCount)</span><br><span class="line">sorter      = shareModeSort(device.ByAllocatableCores, device.ByAllocatableMemory, device.ByID)</span><br><span class="line">)</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> i := <span class="number">0</span>; i &lt; deviceCount; i++ &#123;</span><br><span class="line">tmpStore[i] = al.node.GetDeviceMap()[i]</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">sorter.Sort(tmpStore)</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> _, dev := <span class="keyword">range</span> tmpStore &#123;</span><br><span class="line"><span class="keyword">if</span> dev.AllocatableCores() &gt;= cores &amp;&amp; dev.AllocatableMemory() &gt;= memory &#123;</span><br><span class="line">glog.V(<span class="number">4</span>).Infof(<span class="string">"Pick up %d , cores: %d, memory: %d"</span>, dev.GetID(), dev.AllocatableCores(), dev.AllocatableMemory())</span><br><span class="line">devs = <span class="built_in">append</span>(devs, dev)</span><br><span class="line"><span class="keyword">break</span></span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> devs</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>可以看到这里在调度过程中，选择最先满足的那个，一旦满足则跳出选择。这是因为这里的 <code>devs</code> 已经按照最少可用的资源来匹配了，通过这种方式可以减少碎片化。</p><h4 id="Kubelet创建Pod-1"><a href="#Kubelet创建Pod-1" class="headerlink" title="Kubelet创建Pod"></a>Kubelet创建Pod</h4><p>用户创建Pod之后，经过调度找到对应的Node，这时候Kubelet向DevicePlugin执行Allocate函数。因为Kubelet看到的是虚拟的Devices，这里需要有一个从虚拟Device到实际GPU Device的映射，这里就是上图中GPU Manager做的事情，然后发送一个Request给GPU Scheduler，根据拓扑关系选择最合适的GPU，然后GPU Manager将 AllocateResponse返回给Kubelet。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-22_gaia-device-plugin.png"></p><p>我们先看 <code>Allocate</code> 的实现，这段代码比较长，但是实现的逻辑也不难：</p><ul><li>Allocate传入的参数是 <code>deviceIDs</code> 这样里一个List，<strong>里面只有 <code>vcore</code> 这种设备</strong> （代码是这样的，需要进一步看一看 kubelet）</li><li>Pod可能有多个Container，这里每次只处理一个容器<ul><li>如果还有未处理的Pod，先解决未处理Pod中的容器</li><li>否则从当前Node上的Pod遍历，选择与用户申请的 <code>vcore</code> 相同的容器</li></ul></li></ul><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(ta *NvidiaTopoAllocator)</span> <span class="title">Allocate</span><span class="params">(_ context.Context, reqs *pluginapi.AllocateRequest)</span> <span class="params">(*pluginapi.AllocateResponse, error)</span></span> &#123;</span><br><span class="line">ta.Lock()</span><br><span class="line"><span class="keyword">defer</span> ta.Unlock()</span><br><span class="line"></span><br><span class="line"><span class="keyword">var</span> (</span><br><span class="line">reqCount           <span class="keyword">uint</span></span><br><span class="line">candidatePod       *v1.Pod</span><br><span class="line">candidateContainer *v1.Container</span><br><span class="line">found              <span class="keyword">bool</span></span><br><span class="line">)</span><br><span class="line"><span class="keyword">if</span> <span class="built_in">len</span>(reqs.ContainerRequests) &lt; <span class="number">1</span> &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"empty container request"</span>)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// k8s send allocate request for one container at a time</span></span><br><span class="line">req := reqs.ContainerRequests[<span class="number">0</span>]</span><br><span class="line">resps := &amp;pluginapi.AllocateResponse&#123;&#125;</span><br><span class="line">reqCount = <span class="keyword">uint</span>(<span class="built_in">len</span>(req.DevicesIDs))</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> ta.unfinishedPod != <span class="literal">nil</span> &#123;</span><br><span class="line">candidatePod = ta.unfinishedPod</span><br><span class="line">cache := ta.allocatedPod.GetCache(<span class="keyword">string</span>(candidatePod.UID))</span><br><span class="line"><span class="keyword">for</span> i, c := <span class="keyword">range</span> candidatePod.Spec.Containers &#123;</span><br><span class="line"><span class="keyword">if</span> _, ok := cache[c.Name]; ok &#123;</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> !utils.IsGPURequiredContainer(&amp;c) &#123;</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> reqCount != utils.GetGPUResourceOfContainer(&amp;candidatePod.Spec.Containers[i], types.VCoreAnnotation) &#123;</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(msg)</span><br><span class="line">&#125;</span><br><span class="line">candidateContainer = &amp;candidatePod.Spec.Containers[i]</span><br><span class="line">found = <span class="literal">true</span></span><br><span class="line"><span class="keyword">break</span></span><br><span class="line">&#125;</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">pods, err := getCandidatePods(ta.k8sClient, ta.config.Hostname)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">msg := fmt.Sprintf(<span class="string">"Failed to find candidate pods due to %v"</span>, err)</span><br><span class="line">glog.Infof(msg)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(msg)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> _, pod := <span class="keyword">range</span> pods &#123;</span><br><span class="line"><span class="keyword">if</span> found &#123;</span><br><span class="line"><span class="keyword">break</span></span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">for</span> i, c := <span class="keyword">range</span> pod.Spec.Containers &#123;</span><br><span class="line"><span class="keyword">if</span> !utils.IsGPURequiredContainer(&amp;c) &#123;</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line">&#125;</span><br><span class="line">podCache := ta.allocatedPod.GetCache(<span class="keyword">string</span>(pod.UID))</span><br><span class="line"><span class="keyword">if</span> podCache != <span class="literal">nil</span> &#123;</span><br><span class="line"><span class="keyword">if</span> _, ok := podCache[c.Name]; ok &#123;</span><br><span class="line">glog.Infof(<span class="string">"container %s of pod %s has been allocate, continue to next"</span>, c.Name, pod.UID)</span><br><span class="line"><span class="keyword">continue</span></span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"><span class="keyword">if</span> utils.GetGPUResourceOfContainer(&amp;pod.Spec.Containers[i], types.VCoreAnnotation) == reqCount &#123;</span><br><span class="line">glog.Infof(<span class="string">"Found candidate Pod %s(%s) with device count %d"</span>, pod.UID, c.Name, reqCount)</span><br><span class="line">candidatePod = pod</span><br><span class="line">candidateContainer = &amp;pod.Spec.Containers[i]</span><br><span class="line">found = <span class="literal">true</span></span><br><span class="line"><span class="keyword">break</span></span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line">  ...</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>找到这样的一个容器之后，拿到容器申请的 <code>vmemory</code>，每一个虚拟的 <code>vmemory</code> 作为一个设备加入到 <code>req.DevicesIDs</code> 中，继续调用 <code>allocateOne</code>:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> found &#123;</span><br><span class="line"><span class="comment">// get vmemory info from container spec</span></span><br><span class="line">vmemory := utils.GetGPUResourceOfContainer(candidateContainer, types.VMemoryAnnotation)</span><br><span class="line"><span class="keyword">for</span> i := <span class="number">0</span>; i &lt; <span class="keyword">int</span>(vmemory); i++ &#123;</span><br><span class="line">req.DevicesIDs = <span class="built_in">append</span>(req.DevicesIDs, types.VMemoryAnnotation)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">resp, err := ta.allocateOne(candidatePod, candidateContainer, req)</span><br><span class="line"><span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">glog.Errorf(err.Error())</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">&#125;</span><br><span class="line">resps.ContainerResponses = <span class="built_in">append</span>(resps.ContainerResponses, resp)</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">msg := fmt.Sprintf(<span class="string">"candidate pod not found for request %v, allocation failed"</span>, reqs)</span><br><span class="line">glog.Infof(msg)</span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(msg)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> resps, ni</span><br></pre></td></tr></table></figure><p>具体的 <code>Allocate</code> 实现在 <code>allocateOne</code> 里面，根据Pod计算出其申请的 <code>needCores</code> 和 <code>needMemory</code> 之后，根据三种情况有不同的分配策略。注意这里还是在拓扑树上面操作，拓扑树树根是物理的Host，树叶是物理的GPU</p><ul><li>申请的资源超过一张卡，这时候分配的策略是尽可能减少卡之间的通信开销</li><li>申请的资源等于一张卡，这时候的分配策略是尽可能减少拓扑树里面产生没有兄弟节点的叶节点</li><li>申请的资源小于一张卡，这时候的分配策略是尽可能减少卡资源的碎片化</li></ul><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">switch</span> &#123;</span><br><span class="line"><span class="keyword">case</span> needCores &gt; nvtree.HundredCore:</span><br><span class="line">eval, ok := ta.evaluators[<span class="string">"link"</span>]</span><br><span class="line"><span class="comment">// 这种场景下needCores must be multiple of nvtree.HundredCore</span></span><br><span class="line">nodes = eval.Evaluate(needCores, <span class="number">0</span>)</span><br><span class="line"><span class="keyword">case</span> needCores == nvtree.HundredCore:</span><br><span class="line">eval, ok := ta.evaluators[<span class="string">"fragment"</span>]</span><br><span class="line">nodes = eval.Evaluate(needCores, <span class="number">0</span>)</span><br><span class="line"><span class="keyword">default</span>:</span><br><span class="line"><span class="comment">// evaluate in share mode</span></span><br><span class="line">shareMode = <span class="literal">true</span></span><br><span class="line">eval, ok := ta.evaluators[<span class="string">"share"</span>]</span><br><span class="line">nodes = eval.Evaluate(needCores, needMemory)</span><br><span class="line">  &#125;</span><br></pre></td></tr></table></figure><p>这里的 <code>Evaluate</code> 返回的是 <code>NvidiaNode</code> 这样的 GPU 节点，通过这个结构可以构建一个拓扑树：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">//NvidiaNode represents a node of Nvidia GPU</span></span><br><span class="line"><span class="keyword">type</span> NvidiaNode <span class="keyword">struct</span> &#123;</span><br><span class="line">Meta            DeviceMeta</span><br><span class="line">AllocatableMeta SchedulerCache</span><br><span class="line"></span><br><span class="line">Parent   *NvidiaNode</span><br><span class="line">Children []*NvidiaNode</span><br><span class="line">Mask     <span class="keyword">uint32</span></span><br><span class="line"></span><br><span class="line">pendingReset <span class="keyword">bool</span></span><br><span class="line">vchildren    <span class="keyword">map</span>[<span class="keyword">int</span>]*NvidiaNode</span><br><span class="line">ntype        nvml.GpuTopologyLevel</span><br><span class="line">tree         *NvidiaTree</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>关于这里具体的分配算法此处就不再详述了，抓住主脉络。</p><p>接下来构建 <code>pluginapi.ContainerAllocateResponse</code>，这里会分别设置环境变量，挂载的目录，找到的设备，以及<code>Annotation</code>：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">ctntResp := &amp;pluginapi.ContainerAllocateResponse&#123;</span><br><span class="line">Envs:        <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">string</span>),</span><br><span class="line">Mounts:      <span class="built_in">make</span>([]*pluginapi.Mount, <span class="number">0</span>),</span><br><span class="line">Devices:     <span class="built_in">make</span>([]*pluginapi.DeviceSpec, <span class="number">0</span>),</span><br><span class="line">Annotations: <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">string</span>),</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>首先是 <code>Devices</code> 字段：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line">allocatedDevices := sets.NewString()</span><br><span class="line">deviceList := <span class="built_in">make</span>([]<span class="keyword">string</span>, <span class="number">0</span>)</span><br><span class="line"><span class="keyword">for</span> _, n := <span class="keyword">range</span> nodes &#123;</span><br><span class="line">name := n.MinorName()</span><br><span class="line">glog.V(<span class="number">2</span>).Infof(<span class="string">"Allocate %s for %s(%s), Meta (%d:%d)"</span>, name, pod.UID, container.Name, n.Meta.ID, n.Meta.MinorID)</span><br><span class="line"></span><br><span class="line">ctntResp.Annotations[types.VCoreAnnotation] = fmt.Sprintf(<span class="string">"%d"</span>, needCores)</span><br><span class="line">ctntResp.Annotations[types.VMemoryAnnotation] = fmt.Sprintf(<span class="string">"%d"</span>, needMemory)</span><br><span class="line"></span><br><span class="line">ctntResp.Devices = <span class="built_in">append</span>(ctntResp.Devices, &amp;pluginapi.DeviceSpec&#123;</span><br><span class="line">ContainerPath: name,</span><br><span class="line">HostPath:      name,</span><br><span class="line">Permissions:   <span class="string">"rwm"</span>,</span><br><span class="line">&#125;)</span><br><span class="line">deviceList = <span class="built_in">append</span>(deviceList, n.Meta.UUID)</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> !allocated &#123;</span><br><span class="line">ta.tree.MarkOccupied(n, needCores, needMemory)</span><br><span class="line">&#125;</span><br><span class="line">allocatedDevices.Insert(name)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>这里还有一些控制设备：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Append control device</span></span><br><span class="line">ctntResp.Devices = <span class="built_in">append</span>(ctntResp.Devices, &amp;pluginapi.DeviceSpec&#123;</span><br><span class="line">ContainerPath: types.NvidiaCtlDevice,</span><br><span class="line">HostPath:      types.NvidiaCtlDevice,</span><br><span class="line">Permissions:   <span class="string">"rwm"</span>,</span><br><span class="line">&#125;)</span><br><span class="line"></span><br><span class="line">ctntResp.Devices = <span class="built_in">append</span>(ctntResp.Devices, &amp;pluginapi.DeviceSpec&#123;</span><br><span class="line">ContainerPath: types.NvidiaUVMDevice,</span><br><span class="line">HostPath:      types.NvidiaUVMDevice,</span><br><span class="line">Permissions:   <span class="string">"rwm"</span>,</span><br><span class="line">&#125;)</span><br><span class="line"></span><br><span class="line"><span class="comment">// Append default device</span></span><br><span class="line"><span class="keyword">if</span> cfg, found := ta.extraConfig[<span class="string">"default"</span>]; found &#123;</span><br><span class="line"><span class="keyword">for</span> _, dev := <span class="keyword">range</span> cfg.Devices &#123;</span><br><span class="line">ctntResp.Devices = <span class="built_in">append</span>(ctntResp.Devices, &amp;pluginapi.DeviceSpec&#123;</span><br><span class="line">ContainerPath: dev,</span><br><span class="line">HostPath:      dev,</span><br><span class="line">Permissions:   <span class="string">"rwm"</span>,</span><br><span class="line">&#125;)</span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接着是 <code>Annotations</code> 字段：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">ctntResp.Annotations[types.VDeviceAnnotation] = vDeviceAnnotationStr(nodes)</span><br><span class="line"><span class="keyword">if</span> !allocated &#123;</span><br><span class="line">ta.allocatedPod.Insert(<span class="keyword">string</span>(pod.UID), container.Name, &amp;cache.Info&#123;</span><br><span class="line">Devices: allocatedDevices.UnsortedList(),</span><br><span class="line">Cores:   needCores,</span><br><span class="line">Memory:  needMemory,</span><br><span class="line">&#125;)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>然后是 <code>Envs</code> 字段</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// LD_LIBRARY_PATH</span></span><br><span class="line">ctntResp.Envs[<span class="string">"LD_LIBRARY_PATH"</span>] = <span class="string">"/usr/local/nvidia/lib64"</span></span><br><span class="line"><span class="keyword">for</span> _, env := <span class="keyword">range</span> container.Env &#123;</span><br><span class="line"><span class="keyword">if</span> env.Name == <span class="string">"compat32"</span> &amp;&amp; strings.ToLower(env.Value) == <span class="string">"true"</span> &#123;</span><br><span class="line">ctntResp.Envs[<span class="string">"LD_LIBRARY_PATH"</span>] = <span class="string">"/usr/local/nvidia/lib"</span></span><br><span class="line">&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// NVIDIA_VISIBLE_DEVICES</span></span><br><span class="line">ctntResp.Envs[<span class="string">"NVIDIA_VISIBLE_DEVICES"</span>] = strings.Join(deviceList, <span class="string">","</span>)</span><br></pre></td></tr></table></figure><p>最后是 <code>Mounts</code> 字段，这里给GPU容器配置一个volume挂载点来提供CUDA Library以及配置环境变量<code>LD_LIBRARY_PATH</code> 告诉应用哪里去找到 <code>CUDA Library</code>。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> shareMode &#123;</span><br><span class="line">ctntResp.Mounts = <span class="built_in">append</span>(ctntResp.Mounts, &amp;pluginapi.Mount&#123;</span><br><span class="line">ContainerPath: <span class="string">"/usr/local/nvidia"</span>,</span><br><span class="line">HostPath:      types.DriverLibraryPath,</span><br><span class="line">ReadOnly:      <span class="literal">true</span>,</span><br><span class="line">&#125;)</span><br><span class="line">&#125; <span class="keyword">else</span> &#123;</span><br><span class="line">ctntResp.Mounts = <span class="built_in">append</span>(ctntResp.Mounts, &amp;pluginapi.Mount&#123;</span><br><span class="line">ContainerPath: <span class="string">"/usr/local/nvidia"</span>,</span><br><span class="line">HostPath:      types.DriverOriginLibraryPath,</span><br><span class="line">ReadOnly:      <span class="literal">true</span>,</span><br><span class="line">&#125;)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">ctntResp.Mounts = <span class="built_in">append</span>(ctntResp.Mounts, &amp;pluginapi.Mount&#123;</span><br><span class="line">ContainerPath: types.VCUDA_MOUNTPOINT,</span><br><span class="line">HostPath:      filepath.Join(ta.config.VirtualManagerPath, <span class="keyword">string</span>(pod.UID)),</span><br><span class="line">ReadOnly:      <span class="literal">true</span>,</span><br><span class="line">&#125;)</span><br></pre></td></tr></table></figure><h4 id="vGPU-Manager"><a href="#vGPU-Manager" class="headerlink" title="vGPU Manager"></a>vGPU Manager</h4><p><code>vGPU Manager</code> 作为 <code>GPU Manager</code> 这个 <code>DaemonSet</code> 的一部分，负责下发容器配置和监控容器分配的vGPU。上一步在拓扑分配器确定好每个容器的资源配置之后，<code>vGPU Manager</code> 负责为每个容器在 host 上创建一个独立的目录，这个目录以容器的名称命名，并且会被包括在 <code>AllocateResponse</code> 中返回给 kubelet，对就是上面那段代码做的事情。</p><p><code>vGPU Manager</code> 会维护一个使用了GPU的并且仍然活着的容器列表，还会去周期性的检查他们。一旦有容器挂掉，就会将这个容器移出列表并且删去目录。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">//                Host                     |                Container</span></span><br><span class="line"><span class="comment">//                                         |</span></span><br><span class="line"><span class="comment">//                                         |</span></span><br><span class="line"><span class="comment">//  .-----------.                          |</span></span><br><span class="line"><span class="comment">//  | allocator |----------.               |             ___________</span></span><br><span class="line"><span class="comment">//  '-----------'   PodUID |               |             \          \</span></span><br><span class="line"><span class="comment">//                         v               |              ) User App )--------.</span></span><br><span class="line"><span class="comment">//                .-----------------.      |             /__________/         |</span></span><br><span class="line"><span class="comment">//     .----------| virtual-manager |      |                                  |</span></span><br><span class="line"><span class="comment">//     |          '-----------------'      |                                  |</span></span><br><span class="line"><span class="comment">// $VirtualManagerPath/PodUID              |                                  |</span></span><br><span class="line"><span class="comment">//     |                                   |       read /proc/self/cgroup     |</span></span><br><span class="line"><span class="comment">//     |  .------------------.             |       to get PodUID, ContainerID |</span></span><br><span class="line"><span class="comment">//     '-&gt;| create directory |------.      |                                  |</span></span><br><span class="line"><span class="comment">//        '------------------'      |      |                                  |</span></span><br><span class="line"><span class="comment">//                                  |      |                                  |</span></span><br><span class="line"><span class="comment">//                 .----------------'      |       .----------------------.   |</span></span><br><span class="line"><span class="comment">//                 |                       |       | fork call gpu-client |&lt;--'</span></span><br><span class="line"><span class="comment">//                 |                       |       '----------------------'</span></span><br><span class="line"><span class="comment">//                 v                       |                   |</span></span><br><span class="line"><span class="comment">//    .------------------------.           |                   |</span></span><br><span class="line"><span class="comment">//   ( wait for client register )&lt;-------PodUID, ContainerID---'</span></span><br><span class="line"><span class="comment">//    '------------------------'           |</span></span><br><span class="line"><span class="comment">//                 |                       |</span></span><br><span class="line"><span class="comment">//                 v                       |</span></span><br><span class="line"><span class="comment">//   .--------------------------.          |</span></span><br><span class="line"><span class="comment">//   | locate pod and container |          |</span></span><br><span class="line"><span class="comment">//   '--------------------------'          |</span></span><br><span class="line"><span class="comment">//                 |                       |</span></span><br><span class="line"><span class="comment">//                 v                       |</span></span><br><span class="line"><span class="comment">//   .---------------------------.         |</span></span><br><span class="line"><span class="comment">//   | write down configure and  |         |</span></span><br><span class="line"><span class="comment">//   | pid file with containerID |         |</span></span><br><span class="line"><span class="comment">//   | as name                   |         |</span></span><br><span class="line"><span class="comment">//   '---------------------------'         |</span></span><br><span class="line"><span class="comment">//                                         |</span></span><br><span class="line"><span class="comment">//                                         |</span></span><br><span class="line"><span class="comment">//                                         v</span></span><br></pre></td></tr></table></figure><h3 id="vGPU-Library"><a href="#vGPU-Library" class="headerlink" title="vGPU Library"></a>vGPU Library</h3><p>论文中的 <code>vGPU Library</code>，具体实现为 <a href="https://github.com/tkestack/vcuda-controller" target="_blank" rel="external nofollow noopener noreferrer">vcuda-controller</a> ，它运行在容器中用于管理部署在容器中的GPU资源。这个 <code>vGPU Library</code> 本质上就是自己封装了 <code>CUDA Library</code>，劫持了 <code>memory-related</code> API 和 <code>computing-related</code> API，下表显示了劫持的API。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-22_gaia-vcuda.png"></p><p><code>vCUDA</code> 在调用相应API时检查：</p><ul><li>对于显存，一旦该任务申请显存后占用的显存大小大于config中的设置，就报错。</li><li>对于计算资源，存在硬隔离和软隔离两种方式<ul><li>共同点是当任务使用的GPU SM利用率超出资源上限，则暂缓下发API调用。</li><li>不同点是如果有资源空闲，软隔离允许任务超过设置，动态计算资源上限。而硬隔离则不允许超出设置量。</li></ul></li></ul><p>这里对于其具体实现按下不表。</p><p>一个令人疑惑的问题是，在GPU Manager中，用户的容器是如何能够使用这个动态库的呢？具体有两个问题：</p><ul><li>这个库从哪里来？<ul><li><code>GPU Manager</code> 作为 <code>DaemonSet</code> 会在其Image中将我们自定义的库打包进去，然后挂载到Node上的一个目录。</li></ul></li><li>容器中的应用是如何感知到的？<ul><li>这里主要是通过在创建容器的时候，设置 <code>LD_LIBRARY_PATH</code> ，将其指向这个自定义的动态库的地址。</li></ul></li></ul><h3 id="资源监控统计"><a href="#资源监控统计" class="headerlink" title="资源监控统计"></a>资源监控统计</h3><p>这部分代码还没有看。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/designs/designs.md" target="_blank" rel="external nofollow noopener noreferrer">阿里GPUShare设计文档</a></li><li><a href="https://www.alibabacloud.com/help/zh/doc-detail/163994.htm" target="_blank" rel="external nofollow noopener noreferrer">阿里共享调度使用文档</a></li><li><a href="https://ieeexplore.ieee.org/document/8672318" target="_blank" rel="external nofollow noopener noreferrer">Gaia GPUManager论文</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;原生的 k8s 基于 &lt;code&gt;Device Plugin&lt;/code&gt; 和 &lt;code&gt;Extended Resource&lt;/code&gt; 机制实现了在容器中使用GPU，但是只支持GPU的独占使用，不允许在Pod间共享GPU，这大大降低了对集群中GPU的利用率。为了在集群层面共享GPU，我们需要实现GPU资源的隔离与调度，本文将依次介绍阿里的 &lt;a href=&quot;https://github.com/AliyunContainerService/gpushare-scheduler-extender&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;GPUShare&lt;/a&gt; 与腾讯的 &lt;a href=&quot;https://github.com/tkestack/gpu-manager&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;GPUManager&lt;/a&gt;，分析其实现机制。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_aliyun-gpu-share.jpg" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="GPU" scheme="https://houmin.cc/tags/GPU/"/>
    
      <category term="k8s" scheme="https://houmin.cc/tags/k8s/"/>
    
      <category term="device plugin" scheme="https://houmin.cc/tags/device-plugin/"/>
    
      <category term="资源隔离" scheme="https://houmin.cc/tags/%E8%B5%84%E6%BA%90%E9%9A%94%E7%A6%BB/"/>
    
      <category term="scheduler extender" scheme="https://houmin.cc/tags/scheduler-extender/"/>
    
  </entry>
  
  <entry>
    <title>【异构计算】在Docker中使用GPU</title>
    <link href="https://houmin.cc/posts/574111db/"/>
    <id>https://houmin.cc/posts/574111db/</id>
    <published>2020-11-17T12:45:00.000Z</published>
    <updated>2022-11-09T15:13:45.395Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>我们在 <a href="https://houmin.cc/posts/5004f8e5/">GPU 与 CUDA 编程入门</a> 这篇博客中初步介绍了如何Linux上使用GPU的方法，随着容器和k8s的迅猛发展，人们对于在容器中使用GPU的需求越发强烈。本文将基于前文，继续介绍如何在容器中使用GPU，进一步地，介绍在Kubernetes中如何调度GPU，并以Tensorflow为例，介绍如何基于Docker搭建部署了GPU的深度学习开发环境。</p><a id="more"></a><h2 id="背景介绍"><a href="#背景介绍" class="headerlink" title="背景介绍"></a>背景介绍</h2><p>容器最早是用于无缝部署基于CPU的应用，它们对于硬件和平台是无感知的，但是显然这种使用场景对于GPU并不适用。对于不同的GPU，需要机器安装不同的硬件驱动，这极大限制了在容器中使用GPU。为了解决这个问题，最早的一种使用方法是在容器中完全重新安装一次NVIDIA驱动，然后将在容器启动的时候将GPU以字符设备 <code>/dev/nvidia0</code> 的方式传递给容器。然而这种方法要求容器中安装的驱动版本与Host上的驱动版本完全一致，同一个Docker Image不能在各个机器上复用，这极大的限制了容器的扩展性。</p><p>为了解决上述问题，容器必须对于 NVIDIA 驱动是无感知的，基于此 NVIDIA 推出了 <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA Container Toolkit</a>：</p><p><img alt="nvidia-gpu-docker" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-22_nvidia-gpu-docker.png"></p><p>如上图所示， NVIDIA 将原来 CUDA 应用依赖的API环境划分为两个部分：</p><ul><li>驱动级API：由<code>libcuda.so.major.minor</code>动态库和内核module提供支持，图中表示为CUDA Driver<ul><li>驱动级API属于底层API，每当NVIDIA公司释放出某一个版本的驱动时，如果你要升级主机上的驱动，那么内核模块和<code>libcuda.so.major.minor</code>这2个文件就必须同时升级到同一个版本，这样原有的程序才能正常工作,</li><li>不同版本的驱动不能同时存在于宿主机上</li></ul></li><li>非驱动级API：由动态库<code>libcublas.so</code>等用户空间级别的API组成，图中表示为CUDA Toolkit<ul><li>非驱动级API的版本号是以Toolkit自身的版本号来管理, 比如cuda-10，cuda-11</li><li>不同版本的Toolkit可以同时运行在相同的宿主机上</li><li>非驱动级API算是对驱动级API的一种更高级的封装,最终还是要调用驱动级API来实现功能</li></ul></li></ul><p>为了让使用GPU的容器更具可扩展性，关于非驱动级的API被 NVIDIA 打包进了  <a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA Container Toolkit</a>，因此在容器中使用GPU之前，每个机器需要先安装好NVIDIA驱动，之后配置好 NVIDIA Container Toolkit之后，就可以在容器中方便使用GPU了。</p><h2 id="整体架构"><a href="#整体架构" class="headerlink" title="整体架构"></a>整体架构</h2><p>NVIDIA 的容器工具包本质是使用一个<code>nvidia-runc</code>的方式来提供GPU容器的创建, 在用户创建出来的OCI spec上补上几个hook函数，来达到GPU设备运行的准备工作。具体包括以下几个组件，从上到下展示如图：</p><ul><li><code>nvidia-docker2</code></li><li><code>nvidia-container-runtime</code></li><li><code>nvidia-container-toolkit</code></li><li><code>libnvidia-container</code></li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-16_nvidia-docker-arch.png"></p><p>下面对这几个组件依次介绍：</p><h3 id="libnvidia-container"><a href="#libnvidia-container" class="headerlink" title="libnvidia-container"></a><code>libnvidia-container</code></h3><p><code>libnvidia-container</code> 提供了一个 library 和一个配置 GNU/Linux 的 Container 使用 NVIDIA GPU 的 client <code>nvidia-container-cli</code>。<code>libnvidia-container</code> 的实现依赖于 <code>kernel primitives</code>，并且是对于 container runtime 是无关的。</p><p><code>nvidia-container-cli</code> 的主要作用就是将 NVIDIA GPU 注入到容器中，包括 /dev/nvidia0 设备挂载等操作。下面是抓到的日志信息，可以看到其主要操作包括：</p><ul><li>加载内核模块，包括 nvidia/nvidia_uvm/nvidia_modeset 等</li><li>创建字符设备，包括 nvidia0，nvidiactl，nvidia-uvm，nvidia-modeset 等</li><li>Mount GPU设备、NVIDIA 相关库等</li></ul><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589710</span> <span class="number">4693</span> nvc_mount.c:<span class="number">218</span>] zz: mounting /dev/nvidiactl at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/dev/nvidiactl</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589733</span> <span class="number">4693</span> nvc_mount.c:<span class="number">509</span>] whitelisting device node <span class="number">195</span>:<span class="number">255</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589789</span> <span class="number">4693</span> nvc_mount.c:<span class="number">218</span>] zz: mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/dev/nvidia-uvm</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589809</span> <span class="number">4693</span> nvc_mount.c:<span class="number">509</span>] whitelisting device node <span class="number">233</span>:<span class="number">0</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589855</span> <span class="number">4693</span> nvc_mount.c:<span class="number">218</span>] zz: mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/dev/nvidia-uvm-tools</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589875</span> <span class="number">4693</span> nvc_mount.c:<span class="number">509</span>] whitelisting device node <span class="number">233</span>:<span class="number">1</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589959</span> <span class="number">4693</span> nvc_mount.c:<span class="number">218</span>] zz: mounting /dev/nvidia0 at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/dev/nvidia0</span><br></pre></td></tr></table></figure><p>到 <code>libnvidia-container</code> 代码中查看，可以看到</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">if</span> (xmount(err, src, dst, <span class="literal">NULL</span>, MS_BIND, <span class="literal">NULL</span>) &lt; <span class="number">0</span>)</span><br><span class="line">   <span class="keyword">goto</span> fail;</span><br></pre></td></tr></table></figure><p>具体就是将 <code>/dev/nvidia0</code> 设备 bind mount 到 container roofs 的 <code>/dev/nvidia0</code>上。</p><p>下面是详细日志信息：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br></pre></td><td class="code"><pre><span class="line">-- WARNING, the following logs are <span class="keyword">for</span> debugging purposes only --</span><br><span class="line"></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.570499</span> <span class="number">4693</span> nvc.c:<span class="number">374</span>] <span class="function">initializing library <span class="title">context</span> <span class="params">(version=<span class="number">1.3</span><span class="number">.3</span>, build=bd9fc3f2b642345301cb2e23de07ec5386232317)</span></span></span><br><span class="line">I0301 09:23:38.570591 4693 nvc.c:346] using root /</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.570600</span> <span class="number">4693</span> nvc.c:<span class="number">347</span>] <span class="keyword">using</span> ldcache /etc/ld.so.cache</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.570606</span> <span class="number">4693</span> nvc.c:<span class="number">348</span>] <span class="keyword">using</span> unprivileged user <span class="number">65534</span>:<span class="number">65534</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.570624</span> <span class="number">4693</span> nvc.c:<span class="number">391</span>] <span class="function">attempting to load dxcore to see <span class="keyword">if</span> we are <span class="built_in">running</span> under Windows Subsystem <span class="keyword">for</span> <span class="title">Linux</span> <span class="params">(WSL)</span></span></span><br><span class="line">I0301 09:23:38.570708 4693 nvc.c:393] dxcore initialization failed, continuing assuming a non-WSL environment</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.571856</span> <span class="number">4698</span> nvc.c:<span class="number">274</span>] loading kernel <span class="keyword">module</span> nvidia</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.572227</span> <span class="number">4698</span> nvc.c:<span class="number">278</span>] <span class="built_in">running</span> mknod <span class="keyword">for</span> /dev/nvidiactl</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.572285</span> <span class="number">4698</span> nvc.c:<span class="number">282</span>] <span class="built_in">running</span> mknod <span class="keyword">for</span> /dev/nvidia0</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.572324</span> <span class="number">4698</span> nvc.c:<span class="number">286</span>] <span class="built_in">running</span> mknod <span class="keyword">for</span> all nvcaps in /dev/nvidia-caps</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.572342</span> <span class="number">4698</span> nvc.c:<span class="number">292</span>] loading kernel <span class="keyword">module</span> nvidia_uvm</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.572382</span> <span class="number">4698</span> nvc.c:<span class="number">296</span>] <span class="built_in">running</span> mknod <span class="keyword">for</span> /dev/nvidia-uvm</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.572472</span> <span class="number">4698</span> nvc.c:<span class="number">301</span>] loading kernel <span class="keyword">module</span> nvidia_modeset</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.572606</span> <span class="number">4698</span> nvc.c:<span class="number">305</span>] <span class="built_in">running</span> mknod <span class="keyword">for</span> /dev/nvidia-modeset</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.572891</span> <span class="number">4699</span> driver.c:<span class="number">101</span>] starting driver service</span><br><span class="line">I0301 09:23:38.576138 4693 nvc_container.c:388] configuring container with 'compute utility video supervised'</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576499</span> <span class="number">4693</span> nvc_container.c:<span class="number">408</span>] setting pid to <span class="number">4658</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576510</span> <span class="number">4693</span> nvc_container.c:<span class="number">409</span>] setting rootfs to /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576516</span> <span class="number">4693</span> nvc_container.c:<span class="number">410</span>] setting owner to <span class="number">0</span>:<span class="number">0</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576522</span> <span class="number">4693</span> nvc_container.c:<span class="number">411</span>] setting bins directory to /usr/bin</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576528</span> <span class="number">4693</span> nvc_container.c:<span class="number">412</span>] setting libs directory to /usr/lib/x86_64-linux-gnu</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576534</span> <span class="number">4693</span> nvc_container.c:<span class="number">413</span>] setting libs32 directory to /usr/lib/i386-linux-gnu</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576541</span> <span class="number">4693</span> nvc_container.c:<span class="number">414</span>] setting cudart directory to /usr/local/cuda</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576547</span> <span class="number">4693</span> nvc_container.c:<span class="number">415</span>] setting ldconfig to @/sbin/ldconfig (host relative)</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576553</span> <span class="number">4693</span> nvc_container.c:<span class="number">416</span>] setting mount <span class="keyword">namespace</span> to /proc/<span class="number">4658</span>/ns/mnt</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576559</span> <span class="number">4693</span> nvc_container.c:<span class="number">418</span>] setting devices cgroup to /sys/fs/cgroup/devices/docker/<span class="number">5e868</span>a1fa27cc187630a4b41cbdc8cc50b29a1aa35984f69f00b298db75caf4d</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.576571</span> <span class="number">4693</span> nvc_info.c:<span class="number">680</span>] requesting driver information with <span class="string">''</span></span><br><span class="line">I0301 09:23:38.578231 4693 nvc_info.c:169] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.418.126.02</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578374</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvoptix.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578442</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-tls.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578478</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-rtcore.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578515</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-ptxjitcompiler.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578565</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-opticalflow.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578614</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-opencl.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578651</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-ml.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578699</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-ifr.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578749</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-glvkspirv.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578782</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-glsi.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578815</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-glcore.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578851</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-fbc.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578896</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-fatbinaryloader.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578941</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-encode.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.578989</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-eglcore.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579029</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-compiler.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579071</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-cfg.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579115</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvidia-cbl.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579150</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libnvcuvid.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579314</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libcuda.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579405</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libGLX_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579440</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libGLESv2_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579474</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libGLESv1_CM_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579507</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib64/libEGL_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579547</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/vdpau/libvdpau_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579588</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-tls.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579626</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-ptxjitcompiler.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579673</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-opticalflow.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579721</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-opencl.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579755</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-ml.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579802</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-ifr.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579850</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-glvkspirv.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579883</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-glsi.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579916</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-glcore.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.579960</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-fbc.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580007</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-fatbinaryloader.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580039</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-encode.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580085</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-eglcore.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580121</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvidia-compiler.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580158</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libnvcuvid.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580209</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libcuda.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580258</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libGLX_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580291</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libGLESv2_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580324</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libGLESv1_CM_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580356</span> <span class="number">4693</span> nvc_info.c:<span class="number">169</span>] selecting /usr/lib/libEGL_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580373</span> <span class="number">4693</span> nvc_info.c:<span class="number">350</span>] missing library libnvidia-allocator.so</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580381</span> <span class="number">4693</span> nvc_info.c:<span class="number">350</span>] missing library libnvidia-ngx.so</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580387</span> <span class="number">4693</span> nvc_info.c:<span class="number">354</span>] missing compat32 library libnvidia-cfg.so</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580393</span> <span class="number">4693</span> nvc_info.c:<span class="number">354</span>] missing compat32 library libnvidia-allocator.so</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580405</span> <span class="number">4693</span> nvc_info.c:<span class="number">354</span>] missing compat32 library libnvidia-ngx.so</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580412</span> <span class="number">4693</span> nvc_info.c:<span class="number">354</span>] missing compat32 library libnvidia-rtcore.so</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580418</span> <span class="number">4693</span> nvc_info.c:<span class="number">354</span>] missing compat32 library libnvoptix.so</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580424</span> <span class="number">4693</span> nvc_info.c:<span class="number">354</span>] missing compat32 library libnvidia-cbl.so</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580710</span> <span class="number">4693</span> nvc_info.c:<span class="number">276</span>] selecting /usr/bin/nvidia-smi</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580731</span> <span class="number">4693</span> nvc_info.c:<span class="number">276</span>] selecting /usr/bin/nvidia-debugdump</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580750</span> <span class="number">4693</span> nvc_info.c:<span class="number">276</span>] selecting /usr/bin/nvidia-persistenced</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580771</span> <span class="number">4693</span> nvc_info.c:<span class="number">276</span>] selecting /usr/bin/nvidia-cuda-mps-control</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580791</span> <span class="number">4693</span> nvc_info.c:<span class="number">276</span>] selecting /usr/bin/nvidia-cuda-mps-server</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580821</span> <span class="number">4693</span> nvc_info.c:<span class="number">438</span>] listing device /dev/nvidiactl</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580827</span> <span class="number">4693</span> nvc_info.c:<span class="number">438</span>] listing device /dev/nvidia-uvm</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580833</span> <span class="number">4693</span> nvc_info.c:<span class="number">438</span>] listing device /dev/nvidia-uvm-tools</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580839</span> <span class="number">4693</span> nvc_info.c:<span class="number">438</span>] listing device /dev/nvidia-modeset</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580868</span> <span class="number">4693</span> nvc_info.c:<span class="number">321</span>] missing ipc /var/<span class="built_in">run</span>/nvidia-persistenced/socket</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580884</span> <span class="number">4693</span> nvc_info.c:<span class="number">321</span>] missing ipc /tmp/nvidia-mps</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.580891</span> <span class="number">4693</span> nvc_info.c:<span class="number">745</span>] requesting device information with <span class="string">''</span></span><br><span class="line">I0301 09:23:38.588416 4693 nvc_info.c:628] listing device /dev/nvidia0 (GPU-a99e5631-5bcb-b5e1-9a08-e64a25effe1e at 00000000:00:05.0)</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.588494</span> <span class="number">4693</span> nvc_mount.c:<span class="number">354</span>] mounting tmpfs at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/proc/driver/nvidia</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.588826</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/bin/nvidia-smi</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.588866</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/bin/nvidia-debugdump</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.588899</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/bin/nvidia-persistenced</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.588945</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/bin/nvidia-cuda-mps-control</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.588980</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/bin/nvidia-cuda-mps-server</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589177</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libnvidia-ml.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589226</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libnvidia-cfg.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589268</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libcuda.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libcuda.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589308</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libnvidia-opencl.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589350</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libnvidia-ptxjitcompiler.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589391</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libnvidia-fatbinaryloader.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589440</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libnvidia-compiler.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589484</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/vdpau/libvdpau_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libvdpau_nvidia.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589524</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libnvidia-encode.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvidia-encode.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589566</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libnvidia-opticalflow.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589611</span> <span class="number">4693</span> nvc_mount.c:<span class="number">112</span>] mounting /usr/lib64/libnvcuvid.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvcuvid.so<span class="number">.418</span><span class="number">.126</span><span class="number">.02</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589631</span> <span class="number">4693</span> nvc_mount.c:<span class="number">534</span>] creating symlink /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libcuda.so -&gt; libcuda.so<span class="number">.1</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589657</span> <span class="number">4693</span> nvc_mount.c:<span class="number">534</span>] creating symlink /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so -&gt; libnvidia-opticalflow.so<span class="number">.1</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589710</span> <span class="number">4693</span> nvc_mount.c:<span class="number">218</span>] zz: mounting /dev/nvidiactl at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/dev/nvidiactl</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589733</span> <span class="number">4693</span> nvc_mount.c:<span class="number">509</span>] whitelisting device node <span class="number">195</span>:<span class="number">255</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589789</span> <span class="number">4693</span> nvc_mount.c:<span class="number">218</span>] zz: mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/dev/nvidia-uvm</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589809</span> <span class="number">4693</span> nvc_mount.c:<span class="number">509</span>] whitelisting device node <span class="number">233</span>:<span class="number">0</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589855</span> <span class="number">4693</span> nvc_mount.c:<span class="number">218</span>] zz: mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/dev/nvidia-uvm-tools</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589875</span> <span class="number">4693</span> nvc_mount.c:<span class="number">509</span>] whitelisting device node <span class="number">233</span>:<span class="number">1</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.589959</span> <span class="number">4693</span> nvc_mount.c:<span class="number">218</span>] zz: mounting /dev/nvidia0 at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/dev/nvidia0</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.590046</span> <span class="number">4693</span> nvc_mount.c:<span class="number">422</span>] mounting /proc/driver/nvidia/gpus/<span class="number">0000</span>:<span class="number">00</span>:<span class="number">05.0</span> at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged/proc/driver/nvidia/gpus/<span class="number">0000</span>:<span class="number">00</span>:<span class="number">05.0</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.590069</span> <span class="number">4693</span> nvc_mount.c:<span class="number">509</span>] whitelisting device node <span class="number">195</span>:<span class="number">0</span></span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.590097</span> <span class="number">4693</span> nvc_ldcache.c:<span class="number">360</span>] executing /sbin/ldconfig from host at /var/lib/docker/overlay2/<span class="number">09</span>af6a668c457545500c0bc7e152195750c2ccfe948daeae6d8a573a1e738ba0/merged</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.598613</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.598725</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599063</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so<span class="number">.1</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599085</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so<span class="number">.1</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599139</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libvdpau_nvidia.so<span class="number">.1</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599310</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599355</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599417</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libcuda.so<span class="number">.1</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599468</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libcuda.so is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599589</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvcuvid.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599609</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599628</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so<span class="number">.1</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599729</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libcuda.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599749</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvcuvid.so<span class="number">.1</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599770</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-encode.so<span class="number">.1</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599819</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libvdpau_nvidia.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599839</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so<span class="number">.1</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599860</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.599951</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-encode.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.600048</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.600056</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so<span class="number">.418</span><span class="number">.67</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">W0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.600076</span> <span class="number">4693</span> utils.c:<span class="number">121</span>] /sbin/ldconfig: <span class="built_in">File</span> /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so<span class="number">.1</span> is empty, <span class="keyword">not</span> checked.</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.622024</span> <span class="number">4693</span> nvc.c:<span class="number">429</span>] shutting down library context</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.628575</span> <span class="number">4699</span> driver.c:<span class="number">156</span>] terminating driver service</span><br><span class="line">I0301 <span class="number">09</span>:<span class="number">23</span>:<span class="number">38.629013</span> <span class="number">4693</span> driver.c:<span class="number">196</span>] driver service terminated successfully</span><br></pre></td></tr></table></figure><h3 id="nvidia-container-toolkit"><a href="#nvidia-container-toolkit" class="headerlink" title="nvidia-container-toolkit"></a><code>nvidia-container-toolkit</code></h3><p><code>nvidia-container-toolkit</code> 被 <code>runC</code> 在 <code>PreStart Hook</code> 的时候调用，此时 Container 已经被创建，但是还没有启动。<code>nvidia-container-toolkit</code> 的主要作用是搜集信息(比如 container 的 roofs 路径)，搜集在 <a href="https://github.com/opencontainers/runtime-spec/blob/master/config.md#configuration-schema-example=" target="_blank" rel="external nofollow noopener noreferrer">config.json</a> 的信息，拼凑起来 <code>nvidia-container-cli</code> 的参数，并调用 <code>nvidia-container-cli</code>，调用参数为：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">/usr/bin/nvidia-container-cli</span><br><span class="line">--load-kmods <span class="comment"># Load kernel modules</span></span><br><span class="line">--debbug=/var/<span class="built_in">log</span>/nvidia-container-toolkit.log <span class="comment"># Log debug information configure </span></span><br><span class="line">  --ldconfig=@/sbin/ldconfig </span><br><span class="line">  ---device=all <span class="comment"># Device UUID(s) or index(es) to isolate</span></span><br><span class="line">  --compute <span class="comment"># Enable compute capability</span></span><br><span class="line">  --utility <span class="comment"># Enable utility capability</span></span><br><span class="line">  --video <span class="comment"># Enable video capability</span></span><br><span class="line">  --require=cuda&gt;=9.0</span><br><span class="line">  --pid=3409 <span class="comment"># Container PID</span></span><br><span class="line">  /var/lib/ddocker/overlay2/f5a884006ac0e5a1390809bf09209d9e47f2a400d305048b358d7fed735ef799/merged</span><br></pre></td></tr></table></figure><h3 id="nvidia-container-runtime"><a href="#nvidia-container-runtime" class="headerlink" title="nvidia-container-runtime"></a><code>nvidia-container-runtime</code></h3><p>在执行 <code>docker run</code> 的时候，加上 <code>--runtime=nvidia</code> 参数，就会将 docker 的 runtime 从 runC 变成 <code>nvidia-container-runtime</code>。<code>nvidia-docker-runtime</code> 本质上就是对 <code>runC</code> 的一个简单封装，它把 <code>runC</code> Spec 当作输入，将 <code>nvidia-container-toolkit</code> 作为 <code>PreStart</code> Hook，然后调用 <code>runC</code>。</p><figure class="highlight go"><figcaption><span>github.com/NVIDIA/nvidia-container-runtime/src/main.go</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">addNVIDIAHook</span><span class="params">(spec *specs.Spec)</span> <span class="title">error</span></span> &#123;</span><br><span class="line">path, err := exec.LookPath(<span class="string">"nvidia-container-runtime-hook"</span>)</span><br><span class="line">args := []<span class="keyword">string</span>&#123;path&#125;</span><br><span class="line">spec.Hooks.Prestart = <span class="built_in">append</span>(spec.Hooks.Prestart, specs.Hook&#123;</span><br><span class="line">Path: path,</span><br><span class="line">Args: <span class="built_in">append</span>(args, <span class="string">"prestart"</span>),</span><br><span class="line">&#125;)</span><br><span class="line"></span><br><span class="line"><span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">execRunc</span><span class="params">()</span></span> &#123;</span><br><span class="line">    runcPath, err := exec.LookPath(<span class="string">"docker-runc"</span>) <span class="comment">//没找到</span></span><br><span class="line">    runcPath, err = exec.LookPath(<span class="string">"runc"</span>)</span><br><span class="line">    syscall.Exec(runcPath, <span class="built_in">append</span>([]<span class="keyword">string</span>&#123;runcPath&#125;, os.Args[<span class="number">1</span>:]...), os.Environ())</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">main</span><span class="params">()</span></span> &#123;</span><br><span class="line">    addNVIDIAHook(&amp;spec)</span><br><span class="line">    execRunc()</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>注意，这里的 <code>nvidia-container-runtime-hook</code> 实际上就是执行 <code>/usr/bin/nvidia-container-toolkit</code> 的软链接。</p><p>当安装了 <code>nvidia-container-runtime</code> 之后，需要修改 Docker 的 <code>daemon.json</code> 来使其生效，或者显示制定 <code>--runtime</code> 参数。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">/etc/docker/daemon.json</span><br><span class="line">&#123;</span><br><span class="line"><span class="string">"default-runtime"</span>: <span class="string">"nvidia"</span>,</span><br><span class="line"><span class="string">"runtimes"</span>: &#123;</span><br><span class="line">    <span class="string">"nvidia"</span>: &#123;</span><br><span class="line">        <span class="string">"path"</span>: <span class="string">"/usr/bin/nvidia-container-runtime"</span>,</span><br><span class="line">        <span class="string">"runtimeArgs"</span>: []</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="nvidia-docker2"><a href="#nvidia-docker2" class="headerlink" title="nvidia-docker2"></a><code>nvidia-docker2</code></h3><p><code>nvidia-docker2</code> 是整个 NVIDIA Container Toolkit 中唯一与 docker 相关的包，它的作用在用户 <code>docker run/create</code> 的时候，添加 <code>--runtime=nvidia</code> 的参数，然后调用上面的 <code>nvidia-container-runtime</code> 进行后面的一系列操作，将 GPU 注入到容器中。它也支持设置 <code>NV_GPU</code> 参数来指定哪一个 GPU 来注射到 容器中。</p><p><code>nvidia-docker</code> 本质上就是一个 Shell 脚本，内容如下所示：</p><figure class="highlight bash"><figcaption><span>github.com/NVIDIA/nvidia-docker/nvidia-docker</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#! /bin/bash</span></span><br><span class="line"><span class="comment"># Copyright (c) 2017-2018, NVIDIA CORPORATION. All rights reserved.</span></span><br><span class="line"></span><br><span class="line">NV_DOCKER=<span class="variable">$&#123;NV_DOCKER:-"docker"&#125;</span></span><br><span class="line"></span><br><span class="line">DOCKER_ARGS=()</span><br><span class="line">NV_DOCKER_ARGS=()</span><br><span class="line"><span class="keyword">while</span> [ <span class="variable">$#</span> -gt 0 ]; <span class="keyword">do</span></span><br><span class="line">    arg=<span class="variable">$1</span></span><br><span class="line">    <span class="built_in">shift</span></span><br><span class="line">    DOCKER_ARGS+=(<span class="string">"<span class="variable">$arg</span>"</span>)</span><br><span class="line">    <span class="keyword">case</span> <span class="variable">$arg</span> <span class="keyword">in</span></span><br><span class="line">        run|create)</span><br><span class="line">            NV_DOCKER_ARGS+=(<span class="string">"--runtime=nvidia"</span>)</span><br><span class="line">            <span class="keyword">if</span> [ ! -z <span class="string">"<span class="variable">$&#123;NV_GPU&#125;</span>"</span> ]; <span class="keyword">then</span></span><br><span class="line">                NV_DOCKER_ARGS+=(-e NVIDIA_VISIBLE_DEVICES=<span class="string">"<span class="variable">$&#123;NV_GPU// /,&#125;</span>"</span>)</span><br><span class="line">            <span class="keyword">fi</span></span><br><span class="line">            <span class="built_in">break</span></span><br><span class="line">        ;;</span><br><span class="line">        version)</span><br><span class="line">            <span class="built_in">printf</span> <span class="string">"NVIDIA Docker: @VERSION@\n"</span></span><br><span class="line">            <span class="built_in">break</span></span><br><span class="line">        ;;</span><br><span class="line">        --)</span><br><span class="line">            <span class="built_in">break</span></span><br><span class="line">        ;;</span><br><span class="line">    <span class="keyword">esac</span></span><br><span class="line"><span class="keyword">done</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> [ ! -z <span class="variable">$NV_DEBUG</span> ]; <span class="keyword">then</span></span><br><span class="line">    <span class="built_in">set</span> -x</span><br><span class="line"><span class="keyword">fi</span></span><br><span class="line"></span><br><span class="line"><span class="built_in">exec</span> <span class="variable">$NV_DOCKER</span> <span class="string">"<span class="variable">$&#123;DOCKER_ARGS[@]&#125;</span>"</span> <span class="string">"<span class="variable">$&#123;NV_DOCKER_ARGS[@]&#125;</span>"</span> <span class="string">"<span class="variable">$@</span>"</span></span><br></pre></td></tr></table></figure><h2 id="部署验证"><a href="#部署验证" class="headerlink" title="部署验证"></a>部署验证</h2><p>这里仍然基于腾讯云的 CentOS 7机器为例演示如何在安装配置 <code>NVIDIA Container Toolkit</code>，对于更多的平台可以参考其<a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html" target="_blank" rel="external nofollow noopener noreferrer">官方文档</a>。</p><h3 id="安装-Docker-CE"><a href="#安装-Docker-CE" class="headerlink" title="安装 Docker CE"></a>安装 Docker CE</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ curl https://get.docker.com | sh \</span><br><span class="line">  &amp;&amp; sudo systemctl start docker \</span><br><span class="line">  &amp;&amp; sudo systemctl <span class="built_in">enable</span> docker</span><br></pre></td></tr></table></figure><h3 id="安装-NVIDIA-Container-Toolkit"><a href="#安装-NVIDIA-Container-Toolkit" class="headerlink" title="安装 NVIDIA Container Toolkit"></a>安装 NVIDIA Container Toolkit</h3><p>Setup the <code>stable</code> repository and the GPG key:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ distribution=$(. /etc/os-release;<span class="built_in">echo</span> <span class="variable">$ID</span><span class="variable">$VERSION_ID</span>) \</span><br><span class="line">   &amp;&amp; curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \</span><br><span class="line">   &amp;&amp; curl -s -L https://nvidia.github.io/nvidia-docker/<span class="variable">$distribution</span>/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list</span><br></pre></td></tr></table></figure><p>Install the <code>nvidia-docker2</code> package (and dependencies) after updating the package listing:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ sudo apt-get update</span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ sudo apt-get install -y nvidia-docker2</span><br></pre></td></tr></table></figure><p>Restart the Docker daemon to complete the installation after setting the default runtime:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ sudo systemctl restart docker</span><br></pre></td></tr></table></figure><p>At this point, a working setup can be tested by running a base CUDA container:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi</span><br></pre></td></tr></table></figure><p>This should result in a console output shown below:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |</span><br><span class="line">|-------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                               |                      |               MIG M. |</span><br><span class="line">|===============================+======================+======================|</span><br><span class="line">|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |</span><br><span class="line">| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |</span><br><span class="line">|                               |                      |                  N/A |</span><br><span class="line">+-------------------------------+----------------------+----------------------+</span><br><span class="line"></span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                  |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |</span><br><span class="line">|        ID   ID                                                   Usage      |</span><br><span class="line">|=============================================================================|</span><br><span class="line">|  No running processes found                                                 |</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><h3 id="配置-NVIDIA-Runtime"><a href="#配置-NVIDIA-Runtime" class="headerlink" title="配置 NVIDIA Runtime"></a>配置 NVIDIA Runtime</h3><p>To register the <code>nvidia</code> runtime, use the method below that is best suited to your environment. You might need to merge the new argument with your existing configuration. Three options are available:</p><h3 id="Systemd-drop-in-file"><a href="#Systemd-drop-in-file" class="headerlink" title="Systemd drop-in file"></a>Systemd drop-in file</h3><figure class="highlight awk"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ sudo mkdir -p <span class="regexp">/etc/</span>systemd<span class="regexp">/system/</span>docker.service.d</span><br></pre></td></tr></table></figure><figure class="highlight groovy"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">$ sudo tee <span class="regexp">/etc/</span>systemd<span class="regexp">/system/</span>docker.service.d/override.conf &lt;&lt;EOF</span><br><span class="line">[Service]</span><br><span class="line">ExecStart=</span><br><span class="line">ExecStart=<span class="regexp">/usr/</span>bin<span class="regexp">/dockerd --host=fd:/</span><span class="regexp">/ --add-runtime=nvidia=/</span>usr<span class="regexp">/bin/</span>nvidia-container-runtime</span><br><span class="line">EOF</span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ sudo systemctl daemon-reload \</span><br><span class="line">  &amp;&amp; sudo systemctl restart docker</span><br></pre></td></tr></table></figure><h3 id="Daemon-configuration-file"><a href="#Daemon-configuration-file" class="headerlink" title="Daemon configuration file"></a>Daemon configuration file</h3><p>The <code>nvidia</code> runtime can also be registered with Docker using the <code>daemon.json</code> configuration file:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">$ sudo tee /etc/docker/daemon.json &lt;&lt;EOF</span><br><span class="line">&#123;</span><br><span class="line">    <span class="string">"runtimes"</span>: &#123;</span><br><span class="line">        <span class="string">"nvidia"</span>: &#123;</span><br><span class="line">            <span class="string">"path"</span>: <span class="string">"/usr/bin/nvidia-container-runtime"</span>,</span><br><span class="line">            <span class="string">"runtimeArgs"</span>: []</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line">EOF</span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo pkill -SIGHUP dockerd</span><br></pre></td></tr></table></figure><p>You can optionally reconfigure the default runtime by adding the following to <code>/etc/docker/daemon.json</code>:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="string">"default-runtime"</span>: <span class="string">"nvidia"</span></span><br></pre></td></tr></table></figure><h3 id="Command-Line"><a href="#Command-Line" class="headerlink" title="Command Line"></a>Command Line</h3><p>Use <code>dockerd</code> to add the <code>nvidia</code> runtime:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ sudo dockerd --add-runtime=nvidia=/usr/bin/nvidia-container-runtime [...]</span><br></pre></td></tr></table></figure><h2 id="在k8s中管理GPU"><a href="#在k8s中管理GPU" class="headerlink" title="在k8s中管理GPU"></a>在k8s中管理GPU</h2><p>为了在 k8s 中管理和使用GPU，我们除了需要配置 <code>NVIDIA Container Toolkit</code>，还需要安装NVIDIA推出的 <a href="https://github.com/NVIDIA/k8s-device-plugin" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA/k8s-device-plugin</a>，具体安装可以参考 <a href="../3f069334">我的这篇博文</a>。上面的步骤加起来显得还是有些繁琐，如果你直接使用腾讯云 TKE 的话，在集群添加装有GPU的Node时候，就会自动帮你安装配置好  <code>NVIDIA Container Toolkit</code> 和  <code>NVIDIA/k8s-device-plugin</code>，十分方便。接下来我们以Tensorflow为例，演示在 k8s 环境运行有GPU的Tensorflow。</p><p>单机版的Tensorflow，执行 <code>kubectl apply -f tensorflow.yaml</code>来运行 <code>Jupiter Notebook</code>。</p><figure class="highlight yaml"><figcaption><span>tensorflow.yaml</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">apps/v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Deployment</span></span><br><span class="line"><span class="attr">metadata:</span> </span><br><span class="line">  <span class="attr">name:</span> <span class="string">tensorflow</span></span><br><span class="line">  <span class="attr">labels:</span></span><br><span class="line">    <span class="attr">k8s-app:</span> <span class="string">tensorflow</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">replicas:</span> <span class="number">1</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">matchLabels:</span></span><br><span class="line">      <span class="attr">k8s-app:</span> <span class="string">tensorflow</span></span><br><span class="line">  <span class="attr">template:</span></span><br><span class="line">    <span class="attr">metadata:</span></span><br><span class="line">      <span class="attr">labels:</span></span><br><span class="line">        <span class="attr">k8s-app:</span> <span class="string">tensorflow</span></span><br><span class="line">    <span class="attr">spec:</span></span><br><span class="line">      <span class="attr">containers:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="attr">name:</span> <span class="string">tensorflow</span></span><br><span class="line">        <span class="attr">image:</span> <span class="string">tensorflow/tensorflow:2.2.1-gpu-py3-jupyter</span></span><br><span class="line">        <span class="attr">ports:</span></span><br><span class="line">        <span class="bullet">-</span> <span class="attr">containerPort:</span> <span class="number">8888</span></span><br><span class="line">        <span class="attr">resources:</span></span><br><span class="line">          <span class="attr">limits:</span></span><br><span class="line">            <span class="attr">cpu:</span> <span class="number">4</span></span><br><span class="line">            <span class="attr">memory:</span> <span class="string">2Gi</span></span><br><span class="line">          <span class="attr">requests:</span></span><br><span class="line">            <span class="attr">cpu:</span> <span class="number">2</span></span><br><span class="line">            <span class="attr">memory:</span> <span class="string">1Gi</span></span><br><span class="line"><span class="meta">---</span></span><br><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Service</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">    <span class="attr">name:</span> <span class="string">jupyter-service</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">type:</span> <span class="string">NodePort</span></span><br><span class="line">  <span class="attr">ports:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">port:</span> <span class="number">80</span></span><br><span class="line">    <span class="attr">targetPort:</span> <span class="number">8888</span></span><br><span class="line">    <span class="attr">name:</span> <span class="string">tensorflow</span></span><br><span class="line">  <span class="attr">selector:</span></span><br><span class="line">    <span class="attr">k8s-app:</span> <span class="string">tensorflow</span></span><br></pre></td></tr></table></figure><p>我们看到容器很快运行起来，根据 <code>http:&lt;nodeIP&gt;:&lt;nodePort&gt;</code> 可以访问到 <code>Jupiter Notebook</code>，但是显示需要token：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-22_tensorflow-jupiter.png"></p><p>查看 <code>Tensorflow</code> 日志，可以获得 token：<code>aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20</code></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line">[root@VM-1-14-centos single]<span class="comment"># kubectl get pods</span></span><br><span class="line">NAME                          READY   STATUS    RESTARTS   AGE</span><br><span class="line">tensorflow-6cbc85744b-c567p   1/1     Running   0          7m37s</span><br><span class="line">[root@VM-1-14-centos single]<span class="comment"># kubectl logs tensorflow-6cbc85744b-c567p</span></span><br><span class="line"></span><br><span class="line">________                               _______________</span><br><span class="line">___  __/__________________________________  ____/__  /________      __</span><br><span class="line">__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /</span><br><span class="line">_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ /</span><br><span class="line">/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">WARNING: You are running this container as root, <span class="built_in">which</span> can cause new files <span class="keyword">in</span></span><br><span class="line">mounted volumes to be created as the root user on your host machine.</span><br><span class="line"></span><br><span class="line">To avoid this, run the container by specifying your user<span class="string">'s userid:</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">$ docker run -u $(id -u):$(id -g) args...</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">[I 04:47:52.083 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret</span></span><br><span class="line"><span class="string">[I 04:47:52.315 NotebookApp] Serving notebooks from local directory: /tf</span></span><br><span class="line"><span class="string">[I 04:47:52.315 NotebookApp] Jupyter Notebook 6.1.4 is running at:</span></span><br><span class="line"><span class="string">[I 04:47:52.315 NotebookApp] http://tensorflow-6cbc85744b-c567p:8888/?token=aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20</span></span><br><span class="line"><span class="string">[I 04:47:52.315 NotebookApp]  or http://127.0.0.1:8888/?token=aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20</span></span><br><span class="line"><span class="string">[I 04:47:52.315 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).</span></span><br><span class="line"><span class="string">[C 04:47:52.319 NotebookApp]</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    To access the notebook, open this file in a browser:</span></span><br><span class="line"><span class="string">        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html</span></span><br><span class="line"><span class="string">    Or copy and paste one of these URLs:</span></span><br><span class="line"><span class="string">        http://tensorflow-6cbc85744b-c567p:8888/?token=aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20</span></span><br><span class="line"><span class="string">     or http://127.0.0.1:8888/?token=aa06c9f12d80adac1a6288b97bf8030522cecc92202dbb20</span></span><br><span class="line"><span class="string">[I 04:49:28.692 NotebookApp] 302 GET / (172.16.0.193) 0.57ms</span></span><br><span class="line"><span class="string">[I 04:49:28.700 NotebookApp] 302 GET /tree? (172.16.0.193) 0.67ms</span></span><br></pre></td></tr></table></figure><p>登陆之后即可看到 <code>Jupiter Notebook</code>：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-22_tensorflow-jupiter.png"></p><p>新建Notebook，运行命令如下：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-22_tensorflow-gpu.png"></p><p>可以看到，TensorFlow 支持在GPU上的运算</p><ul><li><code>&quot;/device:GPU:0&quot;</code>：TensorFlow 可见的机器上第一个 GPU 的速记表示法。</li><li><code>&quot;/job:localhost/replica:0/task:0/device:GPU:0&quot;</code>：TensorFlow 可见的机器上第一个 GPU 的完全限定名称。</li></ul><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://cloud.tencent.com/developer/article/1005137" target="_blank" rel="external nofollow noopener noreferrer">https://cloud.tencent.com/developer/article/1005137</a></li><li><a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA Container Toolkit</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;我们在 &lt;a href=&quot;https://houmin.cc/posts/5004f8e5/&quot;&gt;GPU 与 CUDA 编程入门&lt;/a&gt; 这篇博客中初步介绍了如何Linux上使用GPU的方法，随着容器和k8s的迅猛发展，人们对于在容器中使用GPU的需求越发强烈。本文将基于前文，继续介绍如何在容器中使用GPU，进一步地，介绍在Kubernetes中如何调度GPU，并以Tensorflow为例，介绍如何基于Docker搭建部署了GPU的深度学习开发环境。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-22_nvidia-gpu-docker.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="GPU" scheme="https://houmin.cc/tags/GPU/"/>
    
      <category term="k8s" scheme="https://houmin.cc/tags/k8s/"/>
    
      <category term="docker" scheme="https://houmin.cc/tags/docker/"/>
    
      <category term="container" scheme="https://houmin.cc/tags/container/"/>
    
      <category term="Nvidia" scheme="https://houmin.cc/tags/Nvidia/"/>
    
      <category term="tensorflow" scheme="https://houmin.cc/tags/tensorflow/"/>
    
  </entry>
  
  <entry>
    <title>【Kubernetes】Device Plugin</title>
    <link href="https://houmin.cc/posts/3f069334/"/>
    <id>https://houmin.cc/posts/3f069334/</id>
    <published>2020-11-16T07:31:42.000Z</published>
    <updated>2022-11-09T15:13:45.393Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>Kubernetes 原生支持对于CPU和内存资源的发现，但是有很多其他的设备 kubelet不能原生处理，比如GPU、FPGA、RDMA、存储设备和其他类似的异构计算资源设备。为了能够使用这些设备资源，我们需要进行各个设备的初始化和设置。按照 Kubernetes 的 <code>OutOfTree</code> 的哲学理念，我们不应该把各个厂商的设备初始化设置相关代码与 Kubernetes 核心代码放在一起。与之相反，我们需要一种机制能够让各个设备厂商向 Kubelet 上报设备资源，而不需要修改 Kubernetes 核心代码。这即是 <code>Device Plugin</code> 这一机制的来源，本文将介绍 Device Plugin 的实现原理，并介绍其使用。</p><a id="more"></a><h2 id="Device-插件原理"><a href="#Device-插件原理" class="headerlink" title="Device 插件原理"></a>Device 插件原理</h2><p>Device Plugin 实际上是一个 gPRC server，Device 插件一般推荐使用 DaemonSet 的方式部署，并将 <code>/var/lib/kubelet/device-plugins</code> 以 Volume 的形式挂载到容器中。当然，也可以手动运行的方式来部署，但这样就没有失败自动恢复的功能了。</p><p>为了能够使用某个厂商的特定设备，一般有两步：</p><ul><li><code>kubectl create -f http://vendor.com/device-plugin-daemonset.yaml</code></li><li>执行 <code>kubectl describe nodes</code>的时候，相关设备会出现在node status中：<code>vendor-domain/vendor-device</code></li></ul><p>当 Device Plugin 向 kubelet 注册后，kubelet 就通过 RPC 与 Device Plugin 交互：</p><ul><li><code>ListAndWatch()</code> ：让 kubelet 发现设备资源和对应属性，并且在设备资源发生变动的时候接收通知</li><li><code>Allocate()</code> ：kubelet 在创建容器前通过 Allocate来申请相关设备资源</li></ul><p><img alt="Process" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_k8s-device-plugin.png"></p><h3 id="Registration"><a href="#Registration" class="headerlink" title="Registration"></a>Registration</h3><p>为了向 kubelet 告知 Device Plugin 的存在，Device Plugin 必须向 kubelet 发出注册请求，这之后 kubelet 才会和 Device Plugin 通过 <code>gRPC</code>交互，具体过程如下：</p><ul><li>Device Plugin 向 Kubelet 发送一个 <code>RegisterRequest</code>的请求</li><li>Kubelet 收到 <code>RegisterRequest</code> 请求后，返回一个 <code>RegisterResponse</code>，如果Kubelet碰到任何错误，会把错误附在Response中</li><li>如果 Device Plugin 没有收到任何错误，则启动他的 gRPC server</li></ul><p>插件启动后要持续监控 Kubelet 的状态，并在 Kubelet 重启后重新注册自己。比如，Kubelet 刚启动后会清空 <code>/var/lib/kubelet/device-plugins/</code> 目录，所以插件作者可以监控自己监听的 unix socket 是否被删除了，并根据此事件重新注册自己</p><h3 id="Unix-Socket"><a href="#Unix-Socket" class="headerlink" title="Unix Socket"></a>Unix Socket</h3><p>Device Plugin 和 Kubelet 通过在一个 Unix Socket上使用 gRPC 交互，当启动 gRPC server的时候，Device Plugin 将会在 <code>/var/lib/kubelet/device-plugins/</code>  这个 HostPath 创建一个 UnixSocket，比如 <code>/var/lib/kubelet/device-plugins/nvidiaGPU.sock</code>。</p><p>在实现 Device 插件时需要注意</p><ul><li>插件启动时，需要通过 <code>/var/lib/kubelet/device-plugins/kubelet.sock</code> 向 Kubelet 注册，同时提供插件的 Unix Socket 名称、API 的版本号和插件名称（格式为 <code>vendor-domain/resource</code>，如 <code>nvidia.com/gpu</code>）。Kubelet 会将这些设备暴露到 Node 状态中，方便后续调度器使用</li><li>插件启动后向 Kubelet 发送插件列表、按需分配设备并持续监控设备的实时状态</li></ul><h3 id="Protocol-Overview"><a href="#Protocol-Overview" class="headerlink" title="Protocol Overview"></a>Protocol Overview</h3><p><img alt="Protocol Overview" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_k8s-device-plugin-protocol.png"></p><h3 id="API-specification"><a href="#API-specification" class="headerlink" title="API specification"></a>API specification</h3><figure class="highlight protobuf"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Registration is the service advertised by the Kubelet</span></span><br><span class="line"><span class="comment">// Only when Kubelet answers with a success code to a Register Request</span></span><br><span class="line"><span class="comment">// may Device Plugins start their service</span></span><br><span class="line"><span class="comment">// Registration may fail when device plugin version is not supported by</span></span><br><span class="line"><span class="comment">// Kubelet or the registered resourceName is already taken by another</span></span><br><span class="line"><span class="comment">// active device plugin. Device plugin is expected to terminate upon registration failure</span></span><br><span class="line"><span class="class"><span class="keyword">service</span> <span class="title">Registration</span> </span>&#123;</span><br><span class="line">    <span class="function"><span class="keyword">rpc</span> Register(RegisterRequest) <span class="keyword">returns</span> (Empty) &#123;&#125;</span></span><br><span class="line"><span class="function">&#125;</span></span><br><span class="line"><span class="function"></span></span><br><span class="line"><span class="function">message DevicePluginOptions &#123;</span></span><br><span class="line"><span class="function">  // Indicates if PreStartContainer call is required before each container start</span></span><br><span class="line"><span class="function">    bool pre_start_required = 1</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">RegisterRequest</span> </span>&#123;</span><br><span class="line">    <span class="comment">// Version of the API the Device Plugin was built against</span></span><br><span class="line">    <span class="built_in">string</span> version = <span class="number">1</span>;</span><br><span class="line">    <span class="comment">// Name of the unix socket the device plugin is listening on</span></span><br><span class="line">    <span class="comment">// PATH = path.Join(DevicePluginPath, endpoint)</span></span><br><span class="line">    <span class="built_in">string</span> endpoint = <span class="number">2</span>;</span><br><span class="line">    <span class="comment">// Schedulable resource name. As of now it's expected to be a DNS Label</span></span><br><span class="line">    <span class="built_in">string</span> resource_name = <span class="number">3</span>;</span><br><span class="line">    <span class="comment">// Options to be communicated with Device Manager</span></span><br><span class="line">    options = <span class="number">4</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">Empty</span> </span>&#123;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// DevicePlugin is the service advertised by Device Plugins</span></span><br><span class="line"><span class="class"><span class="keyword">service</span> <span class="title">DevicePlugin</span> </span>&#123;</span><br><span class="line">    <span class="comment">// GetDevicePluginOptions returns options to be communicated with Device</span></span><br><span class="line">    <span class="comment">// Manager</span></span><br><span class="line">    <span class="function"><span class="keyword">rpc</span> GetDevicePluginOptions(Empty) <span class="keyword">returns</span> (DevicePluginOptions) &#123;&#125;</span></span><br><span class="line"><span class="function"></span></span><br><span class="line"><span class="function">    // ListAndWatch <span class="keyword">returns</span> a stream of List of Devices</span></span><br><span class="line"><span class="function">    // Whenever a Device state change or a Device disapears, ListAndWatch</span></span><br><span class="line"><span class="function">    // <span class="keyword">returns</span> the new list</span></span><br><span class="line"><span class="function">    <span class="keyword">rpc</span> ListAndWatch(Empty) <span class="keyword">returns</span> (stream ListAndWatchResponse) &#123;&#125;</span></span><br><span class="line"><span class="function"></span></span><br><span class="line"><span class="function">    // Allocate is called during container creation so that the Device</span></span><br><span class="line"><span class="function">    // Plugin can run device specific operations and instruct Kubelet</span></span><br><span class="line"><span class="function">    // of the steps to make the Device available in the container</span></span><br><span class="line"><span class="function">    <span class="keyword">rpc</span> Allocate(AllocateRequest) <span class="keyword">returns</span> (AllocateResponse) &#123;&#125;</span></span><br><span class="line"><span class="function"></span></span><br><span class="line"><span class="function">    // PreStartContainer is called, if indicated by Device Plugin during registeration phase,</span></span><br><span class="line"><span class="function">    // before each container start. Device plugin can run device specific operations</span></span><br><span class="line"><span class="function">    // such as reseting the device before making devices available to the container</span></span><br><span class="line"><span class="function">    <span class="keyword">rpc</span> PreStartContainer(PreStartContainerRequest) <span class="keyword">returns</span> (PreStartContainerResponse) &#123;&#125;</span></span><br><span class="line"><span class="function">&#125;</span></span><br><span class="line"><span class="function"></span></span><br><span class="line"><span class="function">// ListAndWatch <span class="keyword">returns</span> a stream of List of Devices</span></span><br><span class="line"><span class="function">// Whenever a Device state change or a Device disapears, ListAndWatch</span></span><br><span class="line"><span class="function">// <span class="keyword">returns</span> the new list</span></span><br><span class="line"><span class="function">message ListAndWatchResponse &#123;</span></span><br><span class="line"><span class="function">    repeated Device devices = 1</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">/* E.g:</span><br><span class="line">* struct Device &#123;</span><br><span class="line">*    ID: <span class="string">"GPU-fef8089b-4820-abfc-e83e-94318197576e"</span>,</span><br><span class="line">*    State: <span class="string">"Healthy"</span>,</span><br><span class="line">*&#125; */</span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">Device</span> </span>&#123;</span><br><span class="line">    <span class="comment">// A unique ID assigned by the device plugin used</span></span><br><span class="line">    <span class="comment">// to identify devices during the communication</span></span><br><span class="line">    <span class="comment">// Max length of this field is 63 characters</span></span><br><span class="line">    <span class="built_in">string</span> ID = <span class="number">1</span>;</span><br><span class="line">    <span class="comment">// Health of the device, can be healthy or unhealthy, see constants.go</span></span><br><span class="line">    <span class="built_in">string</span> health = <span class="number">2</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// - PreStartContainer is expected to be called before each container start if indicated by plugin during registration phase.</span></span><br><span class="line"><span class="comment">// - PreStartContainer allows kubelet to pass reinitialized devices to containers.</span></span><br><span class="line"><span class="comment">// - PreStartContainer allows Device Plugin to run device specific operations on</span></span><br><span class="line"><span class="comment">//   the Devices requested</span></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">PreStartContainerRequest</span> </span>&#123;</span><br><span class="line">    <span class="keyword">repeated</span> <span class="built_in">string</span> devicesIDs = <span class="number">1</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// PreStartContainerResponse will be send by plugin in response to PreStartContainerRequest</span></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">PreStartContainerResponse</span> </span>&#123;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// - Allocate is expected to be called during pod creation since allocation</span></span><br><span class="line"><span class="comment">//   failures for any container would result in pod startup failure.</span></span><br><span class="line"><span class="comment">// - Allocate allows kubelet to exposes additional artifacts in a pod's</span></span><br><span class="line"><span class="comment">//   environment as directed by the plugin.</span></span><br><span class="line"><span class="comment">// - Allocate allows Device Plugin to run device specific operations on</span></span><br><span class="line"><span class="comment">//   the Devices requested</span></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">AllocateRequest</span> </span>&#123;</span><br><span class="line">    <span class="keyword">repeated</span> ContainerAllocateRequest container_requests = <span class="number">1</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">ContainerAllocateRequest</span> </span>&#123;</span><br><span class="line">    <span class="keyword">repeated</span> <span class="built_in">string</span> devicesIDs = <span class="number">1</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// AllocateResponse includes the artifacts that needs to be injected into</span></span><br><span class="line"><span class="comment">// a container for accessing 'deviceIDs' that were mentioned as part of</span></span><br><span class="line"><span class="comment">// 'AllocateRequest'.</span></span><br><span class="line"><span class="comment">// Failure Handling:</span></span><br><span class="line"><span class="comment">// if Kubelet sends an allocation request for dev1 and dev2.</span></span><br><span class="line"><span class="comment">// Allocation on dev1 succeeds but allocation on dev2 fails.</span></span><br><span class="line"><span class="comment">// The Device plugin should send a ListAndWatch update and fail the</span></span><br><span class="line"><span class="comment">// Allocation request</span></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">AllocateResponse</span> </span>&#123;</span><br><span class="line">    <span class="keyword">repeated</span> ContainerAllocateResponse container_responses = <span class="number">1</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">ContainerAllocateResponse</span> </span>&#123;</span><br><span class="line">    <span class="comment">// List of environment variable to be set in the container to access one of more devices.</span></span><br><span class="line">    map&lt;<span class="built_in">string</span>, <span class="built_in">string</span>&gt; envs = <span class="number">1</span>;</span><br><span class="line">    <span class="comment">// Mounts for the container.</span></span><br><span class="line">    <span class="keyword">repeated</span> Mount mounts = <span class="number">2</span>;</span><br><span class="line">    <span class="comment">// Devices for the container.</span></span><br><span class="line">    <span class="keyword">repeated</span> DeviceSpec devices = <span class="number">3</span>;</span><br><span class="line">    <span class="comment">// Container annotations to pass to the container runtime</span></span><br><span class="line">    map&lt;<span class="built_in">string</span>, <span class="built_in">string</span>&gt; annotations = <span class="number">4</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// Mount specifies a host volume to mount into a container.</span></span><br><span class="line"><span class="comment">// where device library or tools are installed on host and container</span></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">Mount</span> </span>&#123;</span><br><span class="line">    <span class="comment">// Path of the mount within the container.</span></span><br><span class="line">    <span class="built_in">string</span> container_path = <span class="number">1</span>;</span><br><span class="line">    <span class="comment">// Path of the mount on the host.</span></span><br><span class="line">    <span class="built_in">string</span> host_path = <span class="number">2</span>;</span><br><span class="line">    <span class="comment">// If set, the mount is read-only.</span></span><br><span class="line">    <span class="built_in">bool</span> read_only = <span class="number">3</span>;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// DeviceSpec specifies a host device to mount into a container.</span></span><br><span class="line"><span class="class"><span class="keyword">message</span> <span class="title">DeviceSpec</span> </span>&#123;</span><br><span class="line">  <span class="comment">// Path of the device within the container.</span></span><br><span class="line">  <span class="built_in">string</span> container_path = <span class="number">1</span>;</span><br><span class="line">  <span class="comment">// Path of the device on the host.</span></span><br><span class="line">  <span class="built_in">string</span> host_path = <span class="number">2</span>;</span><br><span class="line">  <span class="comment">// Cgroups permissions of the device, candidates are one or more of</span></span><br><span class="line">  <span class="comment">// * r - allows container to read from the specified device.</span></span><br><span class="line">  <span class="comment">// * w - allows container to write to the specified device.</span></span><br><span class="line">  <span class="comment">// * m - allows container to create device files that do not yet exist.</span></span><br><span class="line">  <span class="built_in">string</span> permissions = <span class="number">3</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="插件生命周期管理"><a href="#插件生命周期管理" class="headerlink" title="插件生命周期管理"></a>插件生命周期管理</h3><p>插件启动时，以grpc的形式通过/var/lib/kubelet/device-plugins/kubelet.sock向Kubelet注册，同时提供插件的监听Unix Socket，API版本号和设备名称（比如nvidia.com/gpu）。Kubelet将会把这些设备暴露到Node状态中，以Extended Resource的要求发送到API server中，后续Scheduler会根据这些信息进行调度。</p><p>插件启动后，Kubelet会建立一个到插件的listAndWatch长连接，当插件检测到某个设备不健康的时候，就会主动通知Kubelet。此时如果这个设备处于空闲状态，Kubelet就会将其挪出可分配列表；如果该设备已经被某个pod使用，Kubelet就会将该Pod杀掉</p><p>插件启动后可以利用Kubelet的socket持续检查Kubelet的状态，如果Kubelet重启，插件也会相应的重启，并且重新向Kubelet注册自己</p><h2 id="NVIDIA-Device-Plugin"><a href="#NVIDIA-Device-Plugin" class="headerlink" title="NVIDIA Device Plugin"></a>NVIDIA Device Plugin</h2><p>NVIDIA 提供了一个基于 Device Plugins 接口的 GPU 设备插件 <a href="https://github.com/NVIDIA/k8s-device-plugin" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA/k8s-device-plugin</a>。</p><p>部署</p><figure class="highlight sh"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml</span><br></pre></td></tr></table></figure><p>创建 Pod 时请求 GPU 资源</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="attr">apiVersion:</span> <span class="string">v1</span></span><br><span class="line"><span class="attr">kind:</span> <span class="string">Pod</span></span><br><span class="line"><span class="attr">metadata:</span></span><br><span class="line">  <span class="attr">name:</span> <span class="string">pod1</span></span><br><span class="line"><span class="attr">spec:</span></span><br><span class="line">  <span class="attr">restartPolicy:</span> <span class="string">OnFailure</span></span><br><span class="line">  <span class="attr">containers:</span></span><br><span class="line">  <span class="bullet">-</span> <span class="attr">image:</span> <span class="string">nvidia/cuda</span></span><br><span class="line">    <span class="attr">name:</span> <span class="string">pod1-ctr</span></span><br><span class="line">    <span class="attr">command:</span> <span class="string">["sleep"]</span></span><br><span class="line">    <span class="attr">args:</span> <span class="string">["100000"]</span></span><br><span class="line"></span><br><span class="line">    <span class="attr">resources:</span></span><br><span class="line">      <span class="attr">limits:</span></span><br><span class="line">        <span class="attr">nvidia.com/gpu:</span> <span class="number">1</span></span><br></pre></td></tr></table></figure><p>注意：使用该插件时需要配置 <a href="https://github.com/NVIDIA/nvidia-docker/" target="_blank" rel="external nofollow noopener noreferrer">nvidia-docker 2.0</a>，并配置 <code>nvidia</code> 为默认运行时 （即配置 docker daemon 的选项 <code>--default-runtime=nvidia</code>）。nvidia-docker 2.0 的安装方法为（以 Ubuntu Xenial 为例，其他系统的安装方法可以参考 <a href="http://nvidia.github.io/nvidia-docker/" target="_blank" rel="external nofollow noopener noreferrer">这里</a>）：</p><p>整个Kubernetes调度GPU的过程如下：</p><ul><li>GPU Device plugin 部署到GPU节点上，通过 <code>ListAndWatch</code> 接口，上报注册节点的GPU信息和对应的DeviceID。 </li><li>当有声明 <code>nvidia.com/gpu</code> 的GPU Pod创建出现，调度器会综合考虑GPU设备的空闲情况，将Pod调度到有充足GPU设备的节点上。</li><li>节点上的kubelet 启动Pod时，根据request中的声明调用各个Device plugin 的 allocate接口， 由于容器声明了GPU。 kubelet 根据之前 <code>ListAndWatch</code> 接口收到的Device信息，选取合适的设备，DeviceID 作为参数，调用GPU DevicePlugin的 <code>Allocate</code> 接口</li><li>GPU DevicePlugin ，接收到调用，将DeviceID 转换为 <code>NVIDIA_VISIBLE_DEVICES</code> 环境变量，返回kubelet</li><li>kubelet将环境变量注入到Pod， 启动容器</li><li>容器启动时， <code>gpu-container-runtime</code> 调用 <code>gpu-containers-runtime-hook</code> </li><li><code>gpu-containers-runtime-hook</code> 根据容器的 <code>NVIDIA_VISIBLE_DEVICES</code> 环境变量，转换为 <code>--devices</code> 参数，调用 <code>nvidia-container-cli prestart</code> </li><li><code>nvidia-container-cli</code> 根据 <code>--devices</code> ，将GPU设备映射到容器中。 并且将宿主机的Nvidia Driver Lib 的so文件也映射到容器中。 此时容器可以通过这些so文件，调用宿主机的Nvidia Driver。</li></ul><p>在前面 <code>API Specification</code> 中，通过 <code>Protobuf</code> 定义了 <code>DevicePlugin</code> 应该提供的服务，在 <code>Kubelet</code> 中会调用 <code>DevicePluginClient</code> 来使用对应的服务，这里的 <code>DevicePluginClient</code> 即是通过 <code>Protobuf</code> 自动生成的代码。</p><figure class="highlight go"><figcaption><span>k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.pb.go</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> DevicePluginClient <span class="keyword">interface</span> &#123;</span><br><span class="line">    GetDevicePluginOptions(ctx context.Context, in *Empty, opts ...grpc.CallOption) (*DevicePluginOptions, error)</span><br><span class="line">    ListAndWatch(ctx context.Context, in *Empty, opts ...grpc.CallOption) (DevicePlugin_ListAndWatchClient, error)</span><br><span class="line">    Allocate(ctx context.Context, in *AllocateRequest, opts ...grpc.CallOption) (*AllocateResponse, error)</span><br><span class="line">    PreStartContainer(ctx context.Context, in *PreStartContainerRequest, opts ...grpc.CallOption) (*PreStartContainerResponse, error)</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>在 <code>NVIDIA/k8s-device-plugin</code> 中，我们可以看到上面不同服务的具体实现：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *NvidiaDevicePlugin)</span> <span class="title">GetDevicePluginOptions</span><span class="params">(context.Context, *pluginapi.Empty)</span> <span class="params">(*pluginapi.DevicePluginOptions, error)</span></span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *NvidiaDevicePlugin)</span> <span class="title">ListAndWatch</span><span class="params">(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer)</span> <span class="title">error</span></span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *NvidiaDevicePlugin)</span> <span class="title">Allocate</span><span class="params">(ctx context.Context, reqs *pluginapi.AllocateRequest)</span> <span class="params">(*pluginapi.AllocateResponse, error)</span></span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *NvidiaDevicePlugin)</span> <span class="title">PreStartContainer</span><span class="params">(context.Context, *pluginapi.PreStartContainerRequest)</span> <span class="params">(*pluginapi.PreStartContainerResponse, error)</span></span></span><br></pre></td></tr></table></figure><p>对 <code>NVIDIA/k8s-device-plugin</code> 来说，这里的关键数据结构为 <code>NvidiaDevicePlugin</code>，它实现了 <code>Device Plugin</code> 架构定义的API：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> NvidiaDevicePlugin <span class="keyword">struct</span> &#123;</span><br><span class="line">    ResourceManager</span><br><span class="line">    resourceName     <span class="keyword">string</span></span><br><span class="line">    deviceListEnvvar <span class="keyword">string</span></span><br><span class="line">    allocatePolicy   gpuallocator.Policy</span><br><span class="line">    socket           <span class="keyword">string</span></span><br><span class="line"></span><br><span class="line">    server        *grpc.Server</span><br><span class="line">    cachedDevices []*Device</span><br><span class="line">    health        <span class="keyword">chan</span> *Device</span><br><span class="line">    stop          <span class="keyword">chan</span> <span class="keyword">interface</span>&#123;&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>下面根据 <code>Device Plugin</code> 的生命周期，依次分析每个部分的实现机制。</p><h3 id="NVIDIA-DevicePlugin-启动"><a href="#NVIDIA-DevicePlugin-启动" class="headerlink" title="NVIDIA DevicePlugin 启动"></a>NVIDIA DevicePlugin 启动</h3><p><code>NVIDIA</code> 的 <code>k8s-device-plugin</code> 启动之后逻辑如下，总的来说干了三件事：</p><ul><li>Serve：启动 <code>gRPC server</code>  </li><li>Register：向 <code>Kubelet</code> 注册给定的 <code>resourceName</code></li><li>CheckHealth：执行设备的健康检查逻辑，当检查到不健康的设备时，写到 <code>unhealthy</code> 的 channel 中</li></ul><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *NvidiaDevicePlugin)</span> <span class="title">Start</span><span class="params">()</span> <span class="title">error</span></span> &#123;</span><br><span class="line">    m.initialize()</span><br><span class="line"></span><br><span class="line">    err := m.Serve()</span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line"></span><br><span class="line">    err = m.Register()</span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line">  </span><br><span class="line">  <span class="keyword">go</span> m.CheckHealth(m.stop, m.cachedDevices, m.health)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="Serve"><a href="#Serve" class="headerlink" title="Serve"></a>Serve</h4><p><code>Serve</code> 监听在<code>/var/lib/kubelet/device-plugins/nvidia-gpu.sock</code> 这 个 <code>Unix Socket</code>，并且启动了 <code>gRPC server</code>，其他的就是启动失败重试的逻辑了。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *NvidiaDevicePlugin)</span> <span class="title">Serve</span><span class="params">()</span> <span class="title">error</span></span> &#123;</span><br><span class="line">    os.Remove(m.socket)</span><br><span class="line">    sock, err := net.Listen(<span class="string">"unix"</span>, m.socket)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> err</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    pluginapi.RegisterDevicePluginServer(m.server, m)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">go</span> <span class="function"><span class="keyword">func</span><span class="params">()</span></span> &#123;</span><br><span class="line">        lastCrashTime := time.Now()</span><br><span class="line">        restartCount := <span class="number">0</span></span><br><span class="line">        <span class="keyword">for</span> &#123;</span><br><span class="line">            log.Printf(<span class="string">"Starting GRPC server for '%s'"</span>, m.resourceName)</span><br><span class="line">            err := m.server.Serve(sock)</span><br><span class="line">            <span class="keyword">if</span> err == <span class="literal">nil</span> &#123;</span><br><span class="line">                <span class="keyword">break</span></span><br><span class="line">            &#125;</span><br><span class="line"></span><br><span class="line">            log.Printf(<span class="string">"GRPC server for '%s' crashed with error: %v"</span>, m.resourceName, err)</span><br><span class="line"></span><br><span class="line">            <span class="comment">// restart if it has not been too often</span></span><br><span class="line">            <span class="comment">// i.e. if server has crashed more than 5 times and it didn't last more than one hour each time</span></span><br><span class="line">            <span class="keyword">if</span> restartCount &gt; <span class="number">5</span> &#123;</span><br><span class="line">                <span class="comment">// quit</span></span><br><span class="line">                log.Fatalf(<span class="string">"GRPC server for '%s' has repeatedly crashed recently. Quitting"</span>, m.resourceName)</span><br><span class="line">            &#125;</span><br><span class="line">            timeSinceLastCrash := time.Since(lastCrashTime).Seconds()</span><br><span class="line">            lastCrashTime = time.Now()</span><br><span class="line">            <span class="keyword">if</span> timeSinceLastCrash &gt; <span class="number">3600</span> &#123;</span><br><span class="line">                <span class="comment">// it has been one hour since the last crash.. reset the count</span></span><br><span class="line">                <span class="comment">// to reflect on the frequency</span></span><br><span class="line">                restartCount = <span class="number">1</span></span><br><span class="line">            &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">                restartCount++</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;()</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Wait for server to start by launching a blocking connexion</span></span><br><span class="line">    conn, err := m.dial(m.socket, <span class="number">5</span>*time.Second)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> err</span><br><span class="line">    &#125;</span><br><span class="line">    conn.Close()</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="Register"><a href="#Register" class="headerlink" title="Register"></a>Register</h4><p><code>Register</code> 通过和 <code>/var/lib/kubelet/device-plugins/kubelet.sock</code> 这个 <code>Unix Socket</code> 向 <code>Kubelet</code> 注册，传递了 <code>DevicePlugin</code> 的 <code>Unix Socket</code> 的 Endpoint、资源的名称、API的版本号等信息。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *NvidiaDevicePlugin)</span> <span class="title">Register</span><span class="params">()</span> <span class="title">error</span></span> &#123;</span><br><span class="line">    conn, err := m.dial(pluginapi.KubeletSocket, <span class="number">5</span>*time.Second)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> err</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">defer</span> conn.Close()</span><br><span class="line"></span><br><span class="line">    client := pluginapi.NewRegistrationClient(conn)</span><br><span class="line">    reqt := &amp;pluginapi.RegisterRequest&#123;</span><br><span class="line">        Version:      pluginapi.Version,</span><br><span class="line">        Endpoint:     path.Base(m.socket),</span><br><span class="line">        ResourceName: m.resourceName,</span><br><span class="line">        Options: &amp;pluginapi.DevicePluginOptions&#123;</span><br><span class="line">            GetPreferredAllocationAvailable: (m.allocatePolicy != <span class="literal">nil</span>),</span><br><span class="line">        &#125;,</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    _, err = client.Register(context.Background(), reqt)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> err</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="CheckHealth"><a href="#CheckHealth" class="headerlink" title="CheckHealth"></a>CheckHealth</h4><p>这里调用了 <code>nvml.NewEventSet</code> 来监听 GPU 是否发生变化的事件，并且将 <code>unhealthy Device</code>  传递给 <code>m.health</code> 这个<code>channel</code>。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">checkHealth</span><span class="params">(stop &lt;-<span class="keyword">chan</span> <span class="keyword">interface</span>&#123;&#125;, devices []*Device, unhealthy <span class="keyword">chan</span>&lt;- *Device)</span></span> &#123;</span><br><span class="line">    disableHealthChecks := strings.ToLower(os.Getenv(envDisableHealthChecks))</span><br><span class="line">    <span class="keyword">if</span> disableHealthChecks == <span class="string">"all"</span> &#123;</span><br><span class="line">        disableHealthChecks = allHealthChecks</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> strings.Contains(disableHealthChecks, <span class="string">"xids"</span>) &#123;</span><br><span class="line">        <span class="keyword">return</span></span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    eventSet := nvml.NewEventSet()</span><br><span class="line">    <span class="keyword">defer</span> nvml.DeleteEventSet(eventSet)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> _, d := <span class="keyword">range</span> devices &#123;</span><br><span class="line">        gpu, _, _, err := nvml.ParseMigDeviceUUID(d.ID)</span><br><span class="line">        <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">            gpu = d.ID</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        err = nvml.RegisterEventForDevice(eventSet, nvml.XidCriticalError, gpu)</span><br><span class="line">        <span class="keyword">if</span> err != <span class="literal">nil</span> &amp;&amp; strings.HasSuffix(err.Error(), <span class="string">"Not Supported"</span>) &#123;</span><br><span class="line">            log.Printf(<span class="string">"Warning: %s is too old to support healthchecking: %s. Marking it unhealthy."</span>, d.ID, err)</span><br><span class="line">            unhealthy &lt;- d</span><br><span class="line">            <span class="keyword">continue</span></span><br><span class="line">        &#125;</span><br><span class="line">        check(err)</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> &#123;</span><br><span class="line">        <span class="keyword">select</span> &#123;</span><br><span class="line">        <span class="keyword">case</span> &lt;-stop:</span><br><span class="line">            <span class="keyword">return</span></span><br><span class="line">        <span class="keyword">default</span>:</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        e, err := nvml.WaitForEvent(eventSet, <span class="number">5000</span>)</span><br><span class="line">        <span class="keyword">if</span> err != <span class="literal">nil</span> &amp;&amp; e.Etype != nvml.XidCriticalError &#123;</span><br><span class="line">            <span class="keyword">continue</span></span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="comment">// <span class="doctag">FIXME:</span> formalize the full list and document it.</span></span><br><span class="line">        <span class="comment">// http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4</span></span><br><span class="line">        <span class="comment">// Application errors: the GPU should still be healthy</span></span><br><span class="line">        <span class="keyword">if</span> e.Edata == <span class="number">31</span> || e.Edata == <span class="number">43</span> || e.Edata == <span class="number">45</span> &#123;</span><br><span class="line">            <span class="keyword">continue</span></span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> e.UUID == <span class="literal">nil</span> || <span class="built_in">len</span>(*e.UUID) == <span class="number">0</span> &#123;</span><br><span class="line">            <span class="comment">// All devices are unhealthy</span></span><br><span class="line">            log.Printf(<span class="string">"XidCriticalError: Xid=%d, All devices will go unhealthy."</span>, e.Edata)</span><br><span class="line">            <span class="keyword">for</span> _, d := <span class="keyword">range</span> devices &#123;</span><br><span class="line">                unhealthy &lt;- d</span><br><span class="line">            &#125;</span><br><span class="line">            <span class="keyword">continue</span></span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">for</span> _, d := <span class="keyword">range</span> devices &#123;</span><br><span class="line">            <span class="comment">// Please see https://github.com/NVIDIA/gpu-monitoring-tools/blob/148415f505c96052cb3b7fdf443b34ac853139ec/bindings/go/nvml/nvml.h#L1424</span></span><br><span class="line">            <span class="comment">// for the rationale why gi and ci can be set as such when the UUID is a full GPU UUID and not a MIG device UUID.</span></span><br><span class="line">            gpu, gi, ci, err := nvml.ParseMigDeviceUUID(d.ID)</span><br><span class="line">            <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">                gpu = d.ID</span><br><span class="line">                gi = <span class="number">0xFFFFFFFF</span></span><br><span class="line">                ci = <span class="number">0xFFFFFFFF</span></span><br><span class="line">            &#125;</span><br><span class="line"></span><br><span class="line">            <span class="keyword">if</span> gpu == *e.UUID &amp;&amp; gi == *e.GpuInstanceId &amp;&amp; ci == *e.ComputeInstanceId &#123;</span><br><span class="line">                log.Printf(<span class="string">"XidCriticalError: Xid=%d on Device=%s, the device will go unhealthy."</span>, e.Edata, d.ID)</span><br><span class="line">                unhealthy &lt;- d</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="Kubelet-DeviceManager"><a href="#Kubelet-DeviceManager" class="headerlink" title="Kubelet DeviceManager"></a>Kubelet DeviceManager</h3><h4 id="DeviceManager-启动"><a href="#DeviceManager-启动" class="headerlink" title="DeviceManager 启动"></a>DeviceManager 启动</h4><figure class="highlight go"><figcaption><span>kubernetes/pkg/kubelet/cm/devicemanager/manager.go</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> ManagerImpl <span class="keyword">struct</span> &#123;</span><br><span class="line">    socketname <span class="keyword">string</span></span><br><span class="line">    socketdir  <span class="keyword">string</span></span><br><span class="line"></span><br><span class="line">    endpoints <span class="keyword">map</span>[<span class="keyword">string</span>]endpointInfo <span class="comment">// Key is ResourceName</span></span><br><span class="line">    mutex     sync.Mutex</span><br><span class="line"></span><br><span class="line">    server *grpc.Server</span><br><span class="line">    wg     sync.WaitGroup</span><br><span class="line"></span><br><span class="line">    <span class="comment">// activePods is a method for listing active pods on the node</span></span><br><span class="line">    <span class="comment">// so the amount of pluginResources requested by existing pods</span></span><br><span class="line">    <span class="comment">// could be counted when updating allocated devices</span></span><br><span class="line">    activePods ActivePodsFunc</span><br><span class="line"></span><br><span class="line">    <span class="comment">// sourcesReady provides the readiness of kubelet configuration sources such as apiserver update readiness.</span></span><br><span class="line">    <span class="comment">// We use it to determine when we can purge inactive pods from checkpointed state.</span></span><br><span class="line">    sourcesReady config.SourcesReady</span><br><span class="line"></span><br><span class="line">    <span class="comment">// callback is used for updating devices' states in one time call.</span></span><br><span class="line">    <span class="comment">// e.g. a new device is advertised, two old devices are deleted and a running device fails.</span></span><br><span class="line">    callback monitorCallback</span><br><span class="line"></span><br><span class="line">    <span class="comment">// allDevices is a map by resource name of all the devices currently registered to the device manager</span></span><br><span class="line">    allDevices <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">map</span>[<span class="keyword">string</span>]pluginapi.Device</span><br><span class="line"></span><br><span class="line">    <span class="comment">// healthyDevices contains all of the registered healthy resourceNames and their exported device IDs.</span></span><br><span class="line">    healthyDevices <span class="keyword">map</span>[<span class="keyword">string</span>]sets.String</span><br><span class="line"></span><br><span class="line">    <span class="comment">// unhealthyDevices contains all of the unhealthy devices and their exported device IDs.</span></span><br><span class="line">    unhealthyDevices <span class="keyword">map</span>[<span class="keyword">string</span>]sets.String</span><br><span class="line"></span><br><span class="line">    <span class="comment">// allocatedDevices contains allocated deviceIds, keyed by resourceName.</span></span><br><span class="line">    allocatedDevices <span class="keyword">map</span>[<span class="keyword">string</span>]sets.String</span><br><span class="line"></span><br><span class="line">    <span class="comment">// podDevices contains pod to allocated device mapping.</span></span><br><span class="line">    podDevices        podDevices</span><br><span class="line">    checkpointManager checkpointmanager.CheckpointManager</span><br><span class="line"></span><br><span class="line">    <span class="comment">// List of NUMA Nodes available on the underlying machine</span></span><br><span class="line">    numaNodes []<span class="keyword">int</span></span><br><span class="line"></span><br><span class="line">    <span class="comment">// Store of Topology Affinties that the Device Manager can query.</span></span><br><span class="line">    topologyAffinityStore topologymanager.Store</span><br><span class="line"></span><br><span class="line">    <span class="comment">// devicesToReuse contains devices that can be reused as they have been allocated to</span></span><br><span class="line">    <span class="comment">// init containers.</span></span><br><span class="line">    devicesToReuse PodReusableDevices</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>Device Manager</code> 在 <code>kubelet</code> 启动时的 <code>NewContainerManager</code> 中创建,属于 <code>containerManager</code> 的子模块。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">NewContainerManager</span><span class="params">(mountUtil mount.Interface, cadvisorInterface cadvisor.Interface, nodeConfig NodeConfig, failSwapOn <span class="keyword">bool</span>, devicePluginEnabled <span class="keyword">bool</span>, recorder record.EventRecorder)</span> <span class="params">(ContainerManager, error)</span></span> &#123;</span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line">  </span><br><span class="line">    klog.Infof(<span class="string">"Creating device plugin manager: %t"</span>, devicePluginEnabled)</span><br><span class="line">    <span class="keyword">if</span> devicePluginEnabled &#123;</span><br><span class="line">        cm.deviceManager, err = devicemanager.NewManagerImpl(numaNodeInfo, cm.topologyManager)</span><br><span class="line">        cm.topologyManager.AddHintProvider(cm.deviceManager)</span><br><span class="line">    &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">        cm.deviceManager, err = devicemanager.NewManagerStub()</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">  <span class="comment">// ...</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>具体创建 <code>DeviceManager</code> 的代码如下：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">newManagerImpl</span><span class="params">(socketPath <span class="keyword">string</span>, numaNodeInfo cputopology.NUMANodeInfo, topologyAffinityStore topologymanager.Store)</span> <span class="params">(*ManagerImpl, error)</span></span> &#123;</span><br><span class="line">    klog.V(<span class="number">2</span>).Infof(<span class="string">"Creating Device Plugin manager at %s"</span>, socketPath)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> socketPath == <span class="string">""</span> || !filepath.IsAbs(socketPath) &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(errBadSocket+<span class="string">" %s"</span>, socketPath)</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">var</span> numaNodes []<span class="keyword">int</span></span><br><span class="line">    <span class="keyword">for</span> node := <span class="keyword">range</span> numaNodeInfo &#123;</span><br><span class="line">        numaNodes = <span class="built_in">append</span>(numaNodes, node)</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    dir, file := filepath.Split(socketPath)</span><br><span class="line">    manager := &amp;ManagerImpl&#123;</span><br><span class="line">        endpoints: <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]endpointInfo),</span><br><span class="line"></span><br><span class="line">        socketname:            file,</span><br><span class="line">        socketdir:             dir,</span><br><span class="line">        allDevices:            <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">map</span>[<span class="keyword">string</span>]pluginapi.Device),</span><br><span class="line">        healthyDevices:        <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]sets.String),</span><br><span class="line">        unhealthyDevices:      <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]sets.String),</span><br><span class="line">        allocatedDevices:      <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]sets.String),</span><br><span class="line">        podDevices:            <span class="built_in">make</span>(podDevices),</span><br><span class="line">        numaNodes:             numaNodes,</span><br><span class="line">        topologyAffinityStore: topologyAffinityStore,</span><br><span class="line">        devicesToReuse:        <span class="built_in">make</span>(PodReusableDevices),</span><br><span class="line">    &#125;</span><br><span class="line">    manager.callback = manager.genericDeviceUpdateCallback</span><br><span class="line"></span><br><span class="line">    <span class="comment">// The following structures are populated with real implementations in manager.Start()</span></span><br><span class="line">    <span class="comment">// Before that, initializes them to perform no-op operations.</span></span><br><span class="line">    manager.activePods = <span class="function"><span class="keyword">func</span><span class="params">()</span> []*<span class="title">v1</span>.<span class="title">Pod</span></span> &#123; <span class="keyword">return</span> []*v1.Pod&#123;&#125; &#125;</span><br><span class="line">    manager.sourcesReady = &amp;sourcesReadyStub&#123;&#125;</span><br><span class="line">    checkpointManager, err := checkpointmanager.NewCheckpointManager(dir)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"failed to initialize checkpoint manager: %v"</span>, err)</span><br><span class="line">    &#125;</span><br><span class="line">    manager.checkpointManager = checkpointManager</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> manager, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>其中除了构建 <code>DeviceManager</code> 相关的结构之外，另外做的一个事情就是注册了一个 <code>callback</code>，用来处理对应 <code>devices</code> 的<code>add</code>，<code>delete</code>，<code>update</code> 事件。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *ManagerImpl)</span> <span class="title">genericDeviceUpdateCallback</span><span class="params">(resourceName <span class="keyword">string</span>, devices []pluginapi.Device)</span></span> &#123;</span><br><span class="line">    m.mutex.Lock()</span><br><span class="line">    m.healthyDevices[resourceName] = sets.NewString()</span><br><span class="line">    m.unhealthyDevices[resourceName] = sets.NewString()</span><br><span class="line">    m.allDevices[resourceName] = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]pluginapi.Device)</span><br><span class="line">    <span class="keyword">for</span> _, dev := <span class="keyword">range</span> devices &#123;</span><br><span class="line">        m.allDevices[resourceName][dev.ID] = dev</span><br><span class="line">        <span class="keyword">if</span> dev.Health == pluginapi.Healthy &#123;</span><br><span class="line">            m.healthyDevices[resourceName].Insert(dev.ID)</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            m.unhealthyDevices[resourceName].Insert(dev.ID)</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    m.mutex.Unlock()</span><br><span class="line">    <span class="keyword">if</span> err := m.writeCheckpoint(); err != <span class="literal">nil</span> &#123;</span><br><span class="line">        klog.Errorf(<span class="string">"writing checkpoint encountered %v"</span>, err)</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接下来到了 <code>DeviceManager</code> 启动的方法，它读取了 <code>checkpoint file</code> 中的数据，恢复 <code>ManagerImpl</code>中的相关数据，包括：</p><ul><li>podDevices</li><li>allocatedDevices</li><li>healthyDevices</li><li>unhealthyDevices</li><li>endpoints</li></ul><p>然后将 <code>/var/lib/kubelet/device-plugins/</code> 下面的除了 <code>checkpiont文件</code> 的所有文件清空，也就是清空所有的socket文件，包括自己的 <code>kubelet.sock</code>，以及其他所有之前的 <code>DevicePlugin</code> 的socket文件。最后创建 <code>kubelet.sock</code> 并启动 <code>gRPC Server</code>对外提供gRPC服务，其中 <code>Register()</code>用于 <code>DevicePlugin</code> 调用进行插件注册。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *ManagerImpl)</span> <span class="title">Start</span><span class="params">(activePods ActivePodsFunc, sourcesReady config.SourcesReady)</span> <span class="title">error</span></span> &#123;</span><br><span class="line">    klog.V(<span class="number">2</span>).Infof(<span class="string">"Starting Device Plugin manager"</span>)</span><br><span class="line"></span><br><span class="line">    m.activePods = activePods</span><br><span class="line">    m.sourcesReady = sourcesReady</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Loads in allocatedDevices information from disk.</span></span><br><span class="line">    err := m.readCheckpoint()</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        klog.Warningf(<span class="string">"Continue after failing to read checkpoint file. Device allocation info may NOT be up-to-date. Err: %v"</span>, err)</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    socketPath := filepath.Join(m.socketdir, m.socketname)</span><br><span class="line">    <span class="keyword">if</span> err = os.MkdirAll(m.socketdir, <span class="number">0750</span>); err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> err</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> selinux.SELinuxEnabled() &#123;</span><br><span class="line">        <span class="keyword">if</span> err := selinux.SetFileLabel(m.socketdir, config.KubeletPluginsDirSELinuxLabel); err != <span class="literal">nil</span> &#123;</span><br><span class="line">            klog.Warningf(<span class="string">"Unprivileged containerized plugins might not work. Could not set selinux context on %s: %v"</span>, m.socketdir, err)</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Removes all stale sockets in m.socketdir. Device plugins can monitor</span></span><br><span class="line">    <span class="comment">// this and use it as a signal to re-register with the new Kubelet.</span></span><br><span class="line">    <span class="keyword">if</span> err := m.removeContents(m.socketdir); err != <span class="literal">nil</span> &#123;</span><br><span class="line">        klog.Errorf(<span class="string">"Fail to clean up stale contents under %s: %v"</span>, m.socketdir, err)</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    s, err := net.Listen(<span class="string">"unix"</span>, socketPath)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        klog.Errorf(errListenSocket+<span class="string">" %v"</span>, err)</span><br><span class="line">        <span class="keyword">return</span> err</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    m.wg.Add(<span class="number">1</span>)</span><br><span class="line">    m.server = grpc.NewServer([]grpc.ServerOption&#123;&#125;...)</span><br><span class="line"></span><br><span class="line">    pluginapi.RegisterRegistrationServer(m.server, m)</span><br><span class="line">    <span class="keyword">go</span> <span class="function"><span class="keyword">func</span><span class="params">()</span></span> &#123;</span><br><span class="line">        <span class="keyword">defer</span> m.wg.Done()</span><br><span class="line">        m.server.Serve(s)</span><br><span class="line">    &#125;()</span><br><span class="line"></span><br><span class="line">    klog.V(<span class="number">2</span>).Infof(<span class="string">"Serving device plugin registration server on %q"</span>, socketPath)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="DeviceManager-注册"><a href="#DeviceManager-注册" class="headerlink" title="DeviceManager 注册"></a>DeviceManager 注册</h4><p><code>DeviceManager</code> 接收到 <code>DevicePlugin</code>的 RegisterRequest请求，其结构体如下</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> RegisterRequest <span class="keyword">struct</span> &#123;</span><br><span class="line">   Version <span class="keyword">string</span></span><br><span class="line">   Endpoint <span class="keyword">string</span> </span><br><span class="line">   ResourceName <span class="keyword">string</span> </span><br><span class="line">   Options   *DevicePluginOptions </span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>检查注册的device Name、version是否符合 <code>Extended Resource</code> 的规则，Name不能属于kubernetes.i  o，得有自己的domain，比如<code>nvidia.com</code></p><p>根据 <code>endpoint</code> 信息创建 <code>EndpointImpl</code> 对象，即根据 <code>endpoint</code> 建立 <code>socket</code> 连接：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *ManagerImpl)</span> <span class="title">RegisterPlugin</span><span class="params">(pluginName <span class="keyword">string</span>, endpoint <span class="keyword">string</span>, versions []<span class="keyword">string</span>)</span> <span class="title">error</span></span> &#123;</span><br><span class="line">    klog.V(<span class="number">2</span>).Infof(<span class="string">"Registering Plugin %s at endpoint %s"</span>, pluginName, endpoint)</span><br><span class="line"></span><br><span class="line">    e, err := newEndpointImpl(endpoint, pluginName, m.callback)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> fmt.Errorf(<span class="string">"failed to dial device plugin with socketPath %s: %v"</span>, endpoint, err)</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    options, err := e.client.GetDevicePluginOptions(context.Background(), &amp;pluginapi.Empty&#123;&#125;)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> fmt.Errorf(<span class="string">"failed to get device plugin options: %v"</span>, err)</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    m.registerEndpoint(pluginName, options, e)</span><br><span class="line">    <span class="keyword">go</span> m.runEndpoint(pluginName, e)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>下面是 <code>endPointsImpl</code>  的具体实现：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> endpointImpl <span class="keyword">struct</span> &#123;</span><br><span class="line">    client     pluginapi.DevicePluginClient</span><br><span class="line">    clientConn *grpc.ClientConn</span><br><span class="line"></span><br><span class="line">    socketPath   <span class="keyword">string</span></span><br><span class="line">    resourceName <span class="keyword">string</span></span><br><span class="line">    stopTime     time.Time</span><br><span class="line"></span><br><span class="line">    mutex sync.Mutex</span><br><span class="line">    cb    monitorCallback</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="title">newEndpointImpl</span><span class="params">(socketPath, resourceName <span class="keyword">string</span>, callback monitorCallback)</span> <span class="params">(*endpointImpl, error)</span></span> &#123;</span><br><span class="line">    client, c, err := dial(socketPath)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        klog.Errorf(<span class="string">"Can't create new endpoint with path %s err %v"</span>, socketPath, err)</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> &amp;endpointImpl&#123;</span><br><span class="line">        client:     client,</span><br><span class="line">        clientConn: c,</span><br><span class="line"></span><br><span class="line">        socketPath:   socketPath,</span><br><span class="line">        resourceName: resourceName,</span><br><span class="line"></span><br><span class="line">        cb: callback,</span><br><span class="line">    &#125;, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>执行 <code>EndpointImpl</code> 对象的 <code>run()</code>，在 <code>run</code>方法中:</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(e *endpointImpl)</span> <span class="title">run</span><span class="params">()</span></span> &#123;</span><br><span class="line">    stream, err := e.client.ListAndWatch(context.Background(), &amp;pluginapi.Empty&#123;&#125;)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        klog.Errorf(errListAndWatch, e.resourceName, err)</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span></span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> &#123;</span><br><span class="line">        response, err := stream.Recv()</span><br><span class="line">        <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">            klog.Errorf(errListAndWatch, e.resourceName, err)</span><br><span class="line">            <span class="keyword">return</span></span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        devs := response.Devices</span><br><span class="line">        klog.V(<span class="number">2</span>).Infof(<span class="string">"State pushed for device plugin %s"</span>, e.resourceName)</span><br><span class="line"></span><br><span class="line">        <span class="keyword">var</span> newDevs []pluginapi.Device</span><br><span class="line">        <span class="keyword">for</span> _, d := <span class="keyword">range</span> devs &#123;</span><br><span class="line">            newDevs = <span class="built_in">append</span>(newDevs, *d)</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        e.callback(e.resourceName, newDevs)</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><ul><li>调用 <code>DevicePlugin</code> 的<code>ListAndWatch gRPC</code> 接口，通过长连接持续获取 <code>ListAndWatch gRPC stream</code></li><li>从 <code>stream</code> 流中获取的devices详情列表然后调用Endpoint的 <code>callback</code>，也就是 <code>ManagerImpl</code> 注册的callback方法<code>genericDeviceUpdateCallback</code>进行Device Manager的缓存更新并写到checkpoint文件中</li><li>run()是通过协程启动的，持续获取device server的ListAndWatch结果，持续更新device状态</li><li>当获取异常时，deviceManager断开连接，将device设置为不健康的状态。</li></ul><h3 id="ListAndWatch"><a href="#ListAndWatch" class="headerlink" title="ListAndWatch"></a>ListAndWatch</h3><p>看一下 <code>DevicePlugin</code> 实现的 <code>ListAndWatch</code>，先是立马返回device详情列表，然后开启协程，一旦感知device的健康状态发生变化了，更新 <code>device</code> 详情列表再次返回给 <code>deviceManager</code>。回想起健康检查，<code>DevicePlugin</code> 的 <code>CheckHealth</code> 就就会将设备的健康状态传递给 <code>m.health</code> 这个 <code>channel</code>。</p><figure class="highlight js"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">func (m *NvidiaDevicePlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin_ListAndWatchServer) error &#123;</span><br><span class="line">    s.Send(&amp;pluginapi.ListAndWatchResponse&#123;<span class="attr">Devices</span>: m.apiDevices()&#125;)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> &#123;</span><br><span class="line">        select &#123;</span><br><span class="line">        <span class="keyword">case</span> <span class="xml"><span class="tag">&lt;<span class="name">-m.stop:</span></span></span></span><br><span class="line"><span class="xml">            return nil</span></span><br><span class="line">        case d := &lt;-m.health:</span><br><span class="line">            // FIXME: there is no way to recover from the Unhealthy state.</span><br><span class="line">            d.Health = pluginapi.Unhealthy</span><br><span class="line">            log.Printf("'%s' device marked unhealthy: %s", m.resourceName, d.ID)</span><br><span class="line">            s.Send(&amp;pluginapi.ListAndWatchResponse&#123;Devices: m.apiDevices()&#125;)</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>那么问题来了，<code>DevicePlugin</code> 是如何知道有多少 <code>Device</code> 的呢？我们看看 <code>apiDevices</code> 的实现：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *NvidiaDevicePlugin)</span> <span class="title">apiDevices</span><span class="params">()</span> []*<span class="title">pluginapi</span>.<span class="title">Device</span></span> &#123;</span><br><span class="line">    <span class="keyword">var</span> pdevs []*pluginapi.Device</span><br><span class="line">    <span class="keyword">for</span> _, d := <span class="keyword">range</span> m.cachedDevices &#123;</span><br><span class="line">        pdevs = <span class="built_in">append</span>(pdevs, &amp;d.Device)</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> pdevs</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>这里的 <code>cachedDevices</code> 是通过 <code>ResourceManager</code> 获得的 <code>Device</code> 信息，其具体通过 <code>GpuDeviceManager</code> 结构来实现，可以看到它们是调用了 <code>nvml</code> 库而实现的。这里还有一个 <code>MigDeviceManager</code> 本质上相同，不再概述。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(g *GpuDeviceManager)</span> <span class="title">Devices</span><span class="params">()</span> []*<span class="title">Device</span></span> &#123;</span><br><span class="line">    n, err := nvml.GetDeviceCount()</span><br><span class="line">    check(err)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">var</span> devs []*Device</span><br><span class="line">    <span class="keyword">for</span> i := <span class="keyword">uint</span>(<span class="number">0</span>); i &lt; n; i++ &#123;</span><br><span class="line">        d, err := nvml.NewDeviceLite(i)</span><br><span class="line">        check(err)</span><br><span class="line"></span><br><span class="line">        migEnabled, err := d.IsMigEnabled()</span><br><span class="line">        check(err)</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> migEnabled &amp;&amp; g.skipMigEnabledGPUs &#123;</span><br><span class="line">            <span class="keyword">continue</span></span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        devs = <span class="built_in">append</span>(devs, buildDevice(d))</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> devs</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="Allocation"><a href="#Allocation" class="headerlink" title="Allocation"></a>Allocation</h3><p><code>kubelet</code> 接收到被调度到本节点的pods后</p><h4 id="HandlePodAdditions"><a href="#HandlePodAdditions" class="headerlink" title="HandlePodAdditions"></a>HandlePodAdditions</h4><p>当 Node 上的 <code>Kubelet</code> 监听到有新的 <code>Pod</code> 创建时，会调用 <code>HandlerPodAdditions</code> 来处理 <code>Pod</code> 创建的事件。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(kl *Kubelet)</span> <span class="title">syncLoopIteration</span><span class="params">(configCh &lt;-<span class="keyword">chan</span> kubetypes.PodUpdate, handler SyncHandler,</span></span></span><br><span class="line"><span class="function"><span class="params">    syncCh &lt;-<span class="keyword">chan</span> time.Time, housekeepingCh &lt;-<span class="keyword">chan</span> time.Time, plegCh &lt;-<span class="keyword">chan</span> *pleg.PodLifecycleEvent)</span> <span class="title">bool</span></span> &#123;</span><br><span class="line">    <span class="keyword">select</span> &#123;</span><br><span class="line">    <span class="keyword">case</span> u, open := &lt;-configCh:</span><br><span class="line">        <span class="keyword">switch</span> u.Op &#123;</span><br><span class="line">        <span class="keyword">case</span> kubetypes.ADD:</span><br><span class="line">            klog.V(<span class="number">2</span>).Infof(<span class="string">"SyncLoop (ADD, %q): %q"</span>, u.Source, format.Pods(u.Pods))</span><br><span class="line">            handler.HandlePodAdditions(u.Pods)</span><br><span class="line">        <span class="keyword">case</span> kubetypes.UPDATE:</span><br><span class="line">            klog.V(<span class="number">2</span>).Infof(<span class="string">"SyncLoop (UPDATE, %q): %q"</span>, u.Source, format.PodsWithDeletionTimestamps(u.Pods))</span><br><span class="line">            handler.HandlePodUpdates(u.Pods)</span><br><span class="line">    <span class="comment">// ...</span></span><br><span class="line">        &#125;</span><br><span class="line">    <span class="keyword">case</span> e := &lt;-plegCh:</span><br><span class="line">    <span class="comment">// ...</span></span><br><span class="line">  &#125;</span><br><span class="line">  <span class="keyword">return</span> <span class="literal">true</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接下来进一步看下 <code>HandlerPodAdditions</code> 的实现，对于传入的每一个 <code>Pod</code> ，如果它没有被 <code>terminate</code>，则通过 <code>canAdmitPod</code> 检查是否可以允许该 <code>Pod</code> 创建。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(kl *Kubelet)</span> <span class="title">HandlePodAdditions</span><span class="params">(pods []*v1.Pod)</span></span> &#123;</span><br><span class="line">    start := kl.clock.Now()</span><br><span class="line">    sort.Sort(sliceutils.PodsByCreationTime(pods))</span><br><span class="line">    <span class="keyword">for</span> _, pod := <span class="keyword">range</span> pods &#123;</span><br><span class="line">        existingPods := kl.podManager.GetPods()</span><br><span class="line">        kl.podManager.AddPod(pod)</span><br><span class="line">   </span><br><span class="line">    <span class="comment">// ...</span></span><br><span class="line">        <span class="keyword">if</span> !kl.podIsTerminated(pod) &#123;</span><br><span class="line">            activePods := kl.filterOutTerminatedPods(existingPods)</span><br><span class="line">            <span class="keyword">if</span> ok, reason, message := kl.canAdmitPod(activePods, pod); !ok &#123;</span><br><span class="line">                kl.rejectPod(pod, reason, message)</span><br><span class="line">                <span class="keyword">continue</span></span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="comment">// ...</span></span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>canAdmitPod</code> 里面，<code>Kubelet</code> 将会依次执行每一个 <code>admit handler</code> 来看 Pod 能否通过。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// "pod" is new pod, while "pods" are all admitted pods</span></span><br><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(kl *Kubelet)</span> <span class="title">canAdmitPod</span><span class="params">(pods []*v1.Pod, pod *v1.Pod)</span> <span class="params">(<span class="keyword">bool</span>, <span class="keyword">string</span>, <span class="keyword">string</span>)</span></span> &#123;</span><br><span class="line">    attrs := &amp;lifecycle.PodAdmitAttributes&#123;Pod: pod, OtherPods: pods&#125;</span><br><span class="line">    <span class="keyword">for</span> _, podAdmitHandler := <span class="keyword">range</span> kl.admitHandlers &#123;</span><br><span class="line">        <span class="keyword">if</span> result := podAdmitHandler.Admit(attrs); !result.Admit &#123;</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">false</span>, result.Reason, result.Message</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">true</span>, <span class="string">""</span>, <span class="string">""</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>admitHandlers</code> 是一个 <code>PodAdmitHandler</code> 的切片，其接口如下：</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">type</span> PodAdmitHandler <span class="keyword">interface</span> &#123;</span><br><span class="line">    <span class="comment">// Admit evaluates if a pod can be admitted.</span></span><br><span class="line">    Admit(attrs *PodAdmitAttributes) PodAdmitResult</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>Kubelet</code> 在创建的时候会添加一系列的 <code>PodAdmitHandler</code> 用于检查，对pod的资源做一些准入判断，比如：</p><ul><li><code>evictionAdmitHandler</code> :当节点有内存压力时，拒绝创建best effort的pod，还有其它条件先略过</li><li><code>TopologyPodAdmitHandler</code>：拒绝创建因为Topology locality冲突而无法分配资源的pod</li></ul><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">  klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)</span><br><span class="line">  klet.admitHandlers.AddPodAdmitHandler(klet.containerManager.GetAllocateResourcesPodAdmitHandler())</span><br><span class="line"><span class="comment">// ...</span></span><br></pre></td></tr></table></figure><p>与我们 <code>DevicePlugin</code> 相关的则是 <code>containerManager</code> 的 <code>resourceAllocator</code>，这里会分别调用 <code>DeviceManager</code> 和 <code>CpuManager</code> 的 <code>Allocate</code> 函数，看是否能够申请到相关的资源。这里会对 Pod 的每一个 <code>InitContainer</code> 和 <code>Container</code>检查，看能否申请到。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *resourceAllocator)</span> <span class="title">Admit</span><span class="params">(attrs *lifecycle.PodAdmitAttributes)</span> <span class="title">lifecycle</span>.<span class="title">PodAdmitResult</span></span> &#123;</span><br><span class="line">    pod := attrs.Pod</span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> _, container := <span class="keyword">range</span> <span class="built_in">append</span>(pod.Spec.InitContainers, pod.Spec.Containers...) &#123;</span><br><span class="line">        err := m.deviceManager.Allocate(pod, &amp;container)</span><br><span class="line">        <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">            <span class="keyword">return</span> lifecycle.PodAdmitResult&#123;</span><br><span class="line">                Message: fmt.Sprintf(<span class="string">"Allocate failed due to %v, which is unexpected"</span>, err),</span><br><span class="line">                Reason:  <span class="string">"UnexpectedAdmissionError"</span>,</span><br><span class="line">                Admit:   <span class="literal">false</span>,</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> m.cpuManager != <span class="literal">nil</span> &#123;</span><br><span class="line">            err = m.cpuManager.Allocate(pod, &amp;container)</span><br><span class="line">            <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">                <span class="keyword">return</span> lifecycle.PodAdmitResult&#123;</span><br><span class="line">                    Message: fmt.Sprintf(<span class="string">"Allocate failed due to %v, which is unexpected"</span>, err),</span><br><span class="line">                    Reason:  <span class="string">"UnexpectedAdmissionError"</span>,</span><br><span class="line">                    Admit:   <span class="literal">false</span>,</span><br><span class="line">                &#125;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> lifecycle.PodAdmitResult&#123;Admit: <span class="literal">true</span>&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接下来我们看 <code>ManagerImpl</code> 的 <code>Allocate</code> 函数实现。</p><h4 id="ManagerImpl-Allocate"><a href="#ManagerImpl-Allocate" class="headerlink" title="ManagerImpl.Allocate"></a>ManagerImpl.Allocate</h4><ul><li>allocateContainerResources为Pod中的init container分配devices，并更新deviceManager中PodDevices缓存；</li><li><code>allocateContainerResources为</code> Pod中的regular container分配devices，并更新deviceManager中PodDevices缓存<ul><li>每次在为Pod分配devices之前，都去检查一下此时的active pods，并与podDevices缓存中的pods进行比对，将已经terminated的Pods的devices从podDevices中删除，即进行了devices的GC操作。</li><li>从 <code>healthyDevices</code> 中随机分配对应数量的devices给该Pod，并注意更新allocatedDevices，否则会导致一个device被分配给多个Pod。</li><li>拿到devices后，就通过Grpc调用 <code>DevicePlugin</code> 的 <code>Allocate</code>方法，<code>DevicePlugin</code> 返回 <code>ContainerAllocateResponse</code> (包括注入的环境变量、挂载信息、Annotations)，<code>deviceManager</code> </li><li>根据 <code>pod uuid</code> 和 <code>container name</code> 将返回的信息存入 <code>podDevices</code> 缓存，更新 <code>podDevices</code> 缓存信息，并将<code>deviceManager</code> 中缓存数据更新到 <code>checkpoint</code> 文件中。</li></ul></li></ul><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *ManagerImpl)</span> <span class="title">Allocate</span><span class="params">(pod *v1.Pod, container *v1.Container)</span> <span class="title">error</span></span> &#123;</span><br><span class="line">    <span class="keyword">if</span> _, ok := m.devicesToReuse[<span class="keyword">string</span>(pod.UID)]; !ok &#123;</span><br><span class="line">        m.devicesToReuse[<span class="keyword">string</span>(pod.UID)] = <span class="built_in">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]sets.String)</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">// If pod entries to m.devicesToReuse other than the current pod exist, delete them.</span></span><br><span class="line">    <span class="keyword">for</span> podUID := <span class="keyword">range</span> m.devicesToReuse &#123;</span><br><span class="line">        <span class="keyword">if</span> podUID != <span class="keyword">string</span>(pod.UID) &#123;</span><br><span class="line">            <span class="built_in">delete</span>(m.devicesToReuse, podUID)</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">// Allocate resources for init containers first as we know the caller always loops</span></span><br><span class="line">    <span class="comment">// through init containers before looping through app containers. Should the caller</span></span><br><span class="line">    <span class="comment">// ever change those semantics, this logic will need to be amended.</span></span><br><span class="line">    <span class="keyword">for</span> _, initContainer := <span class="keyword">range</span> pod.Spec.InitContainers &#123;</span><br><span class="line">        <span class="keyword">if</span> container.Name == initContainer.Name &#123;</span><br><span class="line">            <span class="keyword">if</span> err := m.allocateContainerResources(pod, container, m.devicesToReuse[<span class="keyword">string</span>(pod.UID)]); err != <span class="literal">nil</span> &#123;</span><br><span class="line">                <span class="keyword">return</span> err</span><br><span class="line">            &#125;</span><br><span class="line">            m.podDevices.addContainerAllocatedResources(<span class="keyword">string</span>(pod.UID), container.Name, m.devicesToReuse[<span class="keyword">string</span>(pod.UID)])</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> err := m.allocateContainerResources(pod, container, m.devicesToReuse[<span class="keyword">string</span>(pod.UID)]); err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> err</span><br><span class="line">    &#125;</span><br><span class="line">    m.podDevices.removeContainerAllocatedResources(<span class="keyword">string</span>(pod.UID), container.Name, m.devicesToReuse[<span class="keyword">string</span>(pod.UID)])</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">nil</span></span><br><span class="line"></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>接下来我们看 <code>allocateContainerResource</code> 的实现，因为扩展资源是<code>DevicePlugin</code> 所发现的，而扩展资源不允许过量提交，因此要求容器中的 <code>Request</code> 与 <code>Limits</code> 相等，并且 <code>DevicePlugin</code> 会遍历所有的 <code>Limits</code> 保证资源是充足的。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *ManagerImpl)</span> <span class="title">allocateContainerResources</span><span class="params">(pod *v1.Pod, container *v1.Container, devicesToReuse <span class="keyword">map</span>[<span class="keyword">string</span>]sets.String)</span> <span class="title">error</span></span> &#123;</span><br><span class="line">    podUID := <span class="keyword">string</span>(pod.UID)</span><br><span class="line">    contName := container.Name</span><br><span class="line">    allocatedDevicesUpdated := <span class="literal">false</span></span><br><span class="line">    <span class="comment">// Extended resources are not allowed to be overcommitted.</span></span><br><span class="line">    <span class="comment">// Since device plugin advertises extended resources,</span></span><br><span class="line">    <span class="comment">// therefore Requests must be equal to Limits and iterating</span></span><br><span class="line">    <span class="comment">// over the Limits should be sufficient.</span></span><br><span class="line">    <span class="keyword">for</span> k, v := <span class="keyword">range</span> container.Resources.Limits &#123;</span><br><span class="line">        resource := <span class="keyword">string</span>(k)</span><br><span class="line">        needed := <span class="keyword">int</span>(v.Value())</span><br><span class="line">        klog.V(<span class="number">3</span>).Infof(<span class="string">"needs %d %s"</span>, needed, resource)</span><br><span class="line">        <span class="keyword">if</span> !m.isDevicePluginResource(resource) &#123;</span><br><span class="line">            <span class="keyword">continue</span></span><br><span class="line">        &#125;</span><br><span class="line">        <span class="comment">// Updates allocatedDevices to garbage collect any stranded resources</span></span><br><span class="line">        <span class="comment">// before doing the device plugin allocation.</span></span><br><span class="line">        <span class="keyword">if</span> !allocatedDevicesUpdated &#123;</span><br><span class="line">            m.UpdateAllocatedDevices()</span><br><span class="line">            allocatedDevicesUpdated = <span class="literal">true</span></span><br><span class="line">        &#125;</span><br><span class="line">        allocDevices, err := m.devicesToAllocate(podUID, contName, resource, needed, devicesToReuse[resource])</span><br><span class="line">        <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">            <span class="keyword">return</span> err</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="keyword">if</span> allocDevices == <span class="literal">nil</span> || <span class="built_in">len</span>(allocDevices) &lt;= <span class="number">0</span> &#123;</span><br><span class="line">            <span class="keyword">continue</span></span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        startRPCTime := time.Now()</span><br><span class="line">        <span class="comment">// Manager.Allocate involves RPC calls to device plugin, which</span></span><br><span class="line">        <span class="comment">// could be heavy-weight. Therefore we want to perform this operation outside</span></span><br><span class="line">        <span class="comment">// mutex lock. Note if Allocate call fails, we may leave container resources</span></span><br><span class="line">        <span class="comment">// partially allocated for the failed container. We rely on UpdateAllocatedDevices()</span></span><br><span class="line">        <span class="comment">// to garbage collect these resources later. Another side effect is that if</span></span><br><span class="line">        <span class="comment">// we have X resource A and Y resource B in total, and two containers, container1</span></span><br><span class="line">        <span class="comment">// and container2 both require X resource A and Y resource B. Both allocation</span></span><br><span class="line">        <span class="comment">// requests may fail if we serve them in mixed order.</span></span><br><span class="line">        <span class="comment">// <span class="doctag">TODO:</span> may revisit this part later if we see inefficient resource allocation</span></span><br><span class="line">        <span class="comment">// in real use as the result of this. Should also consider to parallelize device</span></span><br><span class="line">        <span class="comment">// plugin Allocate grpc calls if it becomes common that a container may require</span></span><br><span class="line">        <span class="comment">// resources from multiple device plugins.</span></span><br><span class="line">        m.mutex.Lock()</span><br><span class="line">        eI, ok := m.endpoints[resource]</span><br><span class="line">        m.mutex.Unlock()</span><br><span class="line">        <span class="keyword">if</span> !ok &#123;</span><br><span class="line">            m.mutex.Lock()</span><br><span class="line">            m.allocatedDevices = m.podDevices.devices()</span><br><span class="line">            m.mutex.Unlock()</span><br><span class="line">            <span class="keyword">return</span> fmt.Errorf(<span class="string">"unknown Device Plugin %s"</span>, resource)</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        devs := allocDevices.UnsortedList()</span><br><span class="line">        <span class="comment">// <span class="doctag">TODO:</span> refactor this part of code to just append a ContainerAllocationRequest</span></span><br><span class="line">        <span class="comment">// in a passed in AllocateRequest pointer, and issues a single Allocate call per pod.</span></span><br><span class="line">        klog.V(<span class="number">3</span>).Infof(<span class="string">"Making allocation request for devices %v for device plugin %s"</span>, devs, resource)</span><br><span class="line">        resp, err := eI.e.allocate(devs)</span><br><span class="line">        metrics.DevicePluginAllocationDuration.WithLabelValues(resource).Observe(metrics.SinceInSeconds(startRPCTime))</span><br><span class="line">        <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">            <span class="comment">// In case of allocation failure, we want to restore m.allocatedDevices</span></span><br><span class="line">            <span class="comment">// to the actual allocated state from m.podDevices.</span></span><br><span class="line">            m.mutex.Lock()</span><br><span class="line">            m.allocatedDevices = m.podDevices.devices()</span><br><span class="line">            m.mutex.Unlock()</span><br><span class="line">            <span class="keyword">return</span> err</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> <span class="built_in">len</span>(resp.ContainerResponses) == <span class="number">0</span> &#123;</span><br><span class="line">            <span class="keyword">return</span> fmt.Errorf(<span class="string">"no containers return in allocation response %v"</span>, resp)</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="comment">// Update internal cached podDevices state.</span></span><br><span class="line">        m.mutex.Lock()</span><br><span class="line">        m.podDevices.insert(podUID, contName, resource, allocDevices, resp.ContainerResponses[<span class="number">0</span>])</span><br><span class="line">        m.mutex.Unlock()</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Checkpoints device to container allocation information.</span></span><br><span class="line">    <span class="keyword">return</span> m.writeCheckpoint()</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>我们看到，这里通过 <code>resp, err := eI.e.allocate(devs)</code> 执行 RPC 调用，进入到了 <code>DevicePlugin</code> 的逻辑。这里有一个问题，<code>RPC</code> 远程调用中的 <code>deviceIDs</code> 参数是怎么来的呢？我们看到这里有一个 <code>devicesToAllocate</code>的调用。这里的主要逻辑如下：</p><ul><li>拿到对应Pod的对应容器已经申请的资源的设备列表，检查是否只申请了部分，如果只有一部分，那么报错</li><li>然后从 <code>resuableDevices</code> 结构中拿到可以使用的设备列表，如果可用的足够则返回，否则继续从 <code>healthyDevices</code> 中找</li><li>从 <code>healthyDevices</code> 去掉已经在使用的设备，然后检查是否足够，如果不够则报错</li><li>如果足够的话，根据是否有满足拓扑亲和性去拿到足够的设备列表</li></ul><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *ManagerImpl)</span> <span class="title">devicesToAllocate</span><span class="params">(podUID, contName, resource <span class="keyword">string</span>, required <span class="keyword">int</span>, reusableDevices sets.String)</span> <span class="params">(sets.String, error)</span></span> &#123;</span><br><span class="line">    m.mutex.Lock()</span><br><span class="line">    <span class="keyword">defer</span> m.mutex.Unlock()</span><br><span class="line">    needed := required</span><br><span class="line">    <span class="comment">// Gets list of devices that have already been allocated.</span></span><br><span class="line">    <span class="comment">// This can happen if a container restarts for example.</span></span><br><span class="line">    devices := m.podDevices.containerDevices(podUID, contName, resource)</span><br><span class="line">    <span class="keyword">if</span> devices != <span class="literal">nil</span> &#123;</span><br><span class="line">        klog.V(<span class="number">3</span>).Infof(<span class="string">"Found pre-allocated devices for resource %s container %q in Pod %q: %v"</span>, resource, contName, podUID, devices.List())</span><br><span class="line">        needed = needed - devices.Len()</span><br><span class="line">        <span class="comment">// A pod's resource is not expected to change once admitted by the API server,</span></span><br><span class="line">        <span class="comment">// so just fail loudly here. We can revisit this part if this no longer holds.</span></span><br><span class="line">        <span class="keyword">if</span> needed != <span class="number">0</span> &#123;</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"pod %q container %q changed request for resource %q from %d to %d"</span>, podUID, contName, resource, devices.Len(), required)</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> needed == <span class="number">0</span> &#123;</span><br><span class="line">        <span class="comment">// No change, no work.</span></span><br><span class="line">        <span class="keyword">return</span> <span class="literal">nil</span>, <span class="literal">nil</span></span><br><span class="line">    &#125;</span><br><span class="line">    klog.V(<span class="number">3</span>).Infof(<span class="string">"Needs to allocate %d %q for pod %q container %q"</span>, needed, resource, podUID, contName)</span><br><span class="line">    <span class="comment">// Needs to allocate additional devices.</span></span><br><span class="line">    <span class="keyword">if</span> _, ok := m.healthyDevices[resource]; !ok &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"can't allocate unregistered device %s"</span>, resource)</span><br><span class="line">    &#125;</span><br><span class="line">    devices = sets.NewString()</span><br><span class="line">    <span class="comment">// Allocates from reusableDevices list first.</span></span><br><span class="line">    <span class="keyword">for</span> device := <span class="keyword">range</span> reusableDevices &#123;</span><br><span class="line">        devices.Insert(device)</span><br><span class="line">        needed--</span><br><span class="line">        <span class="keyword">if</span> needed == <span class="number">0</span> &#123;</span><br><span class="line">            <span class="keyword">return</span> devices, <span class="literal">nil</span></span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">// Needs to allocate additional devices.</span></span><br><span class="line">    <span class="keyword">if</span> m.allocatedDevices[resource] == <span class="literal">nil</span> &#123;</span><br><span class="line">        m.allocatedDevices[resource] = sets.NewString()</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">// Gets Devices in use.</span></span><br><span class="line">    devicesInUse := m.allocatedDevices[resource]</span><br><span class="line">    <span class="comment">// Gets a list of available devices.</span></span><br><span class="line">    available := m.healthyDevices[resource].Difference(devicesInUse)</span><br><span class="line">    <span class="keyword">if</span> available.Len() &lt; needed &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"requested number of devices unavailable for %s. Requested: %d, Available: %d"</span>, resource, needed, available.Len())</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">// By default, pull devices from the unsorted list of available devices.</span></span><br><span class="line">    allocated := available.UnsortedList()[:needed]</span><br><span class="line">    <span class="comment">// If topology alignment is desired, update allocated to the set of devices</span></span><br><span class="line">    <span class="comment">// with the best alignment.</span></span><br><span class="line">    hint := m.topologyAffinityStore.GetAffinity(podUID, contName)</span><br><span class="line">    <span class="keyword">if</span> m.deviceHasTopologyAlignment(resource) &amp;&amp; hint.NUMANodeAffinity != <span class="literal">nil</span> &#123;</span><br><span class="line">        allocated = m.takeByTopology(resource, available, hint.NUMANodeAffinity, needed)</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">// Updates m.allocatedDevices with allocated devices to prevent them</span></span><br><span class="line">    <span class="comment">// from being allocated to other pods/containers, given that we are</span></span><br><span class="line">    <span class="comment">// not holding lock during the rpc call.</span></span><br><span class="line">    <span class="keyword">for</span> _, device := <span class="keyword">range</span> allocated &#123;</span><br><span class="line">        m.allocatedDevices[resource].Insert(device)</span><br><span class="line">        devices.Insert(device)</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">return</span> devices, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>RPC</code> 调用成功后，会将对应的 <code>Response</code> 记录到 <code>m.podDevices</code> 中。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(pdev podDevices)</span> <span class="title">insert</span><span class="params">(podUID, contName, resource <span class="keyword">string</span>, devices sets.String, resp *pluginapi.ContainerAllocateResponse)</span></span> &#123;</span><br><span class="line">    <span class="keyword">if</span> _, podExists := pdev[podUID]; !podExists &#123;</span><br><span class="line">        pdev[podUID] = <span class="built_in">make</span>(containerDevices)</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> _, contExists := pdev[podUID][contName]; !contExists &#123;</span><br><span class="line">        pdev[podUID][contName] = <span class="built_in">make</span>(resourceAllocateInfo)</span><br><span class="line">    &#125;</span><br><span class="line">    pdev[podUID][contName][resource] = deviceAllocateInfo&#123;</span><br><span class="line">        deviceIds: devices,</span><br><span class="line">        allocResp: resp,</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="DevicePlugin-Allocate"><a href="#DevicePlugin-Allocate" class="headerlink" title="DevicePlugin.Allocate"></a>DevicePlugin.Allocate</h4><p><code>Allocate</code> 接口给容器加上 <code>NVIDIA_VISIBLE_DEVICES</code> 环境变量，设置了相关的 <code>DeviceSpec</code>参数，将 <code>Response</code> 返回给 <code>Kubelet</code>。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *NvidiaDevicePlugin)</span> <span class="title">Allocate</span><span class="params">(ctx context.Context, reqs *pluginapi.AllocateRequest)</span> <span class="params">(*pluginapi.AllocateResponse, error)</span></span> &#123;</span><br><span class="line">    responses := pluginapi.AllocateResponse&#123;&#125;</span><br><span class="line">    <span class="keyword">for</span> _, req := <span class="keyword">range</span> reqs.ContainerRequests &#123;</span><br><span class="line">        <span class="keyword">for</span> _, id := <span class="keyword">range</span> req.DevicesIDs &#123;</span><br><span class="line">            <span class="keyword">if</span> !m.deviceExists(id) &#123;</span><br><span class="line">                <span class="keyword">return</span> <span class="literal">nil</span>, fmt.Errorf(<span class="string">"invalid allocation request for '%s': unknown device: %s"</span>, m.resourceName, id)</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        response := pluginapi.ContainerAllocateResponse&#123;&#125;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> *deviceListStrategyFlag == DeviceListStrategyEnvvar &#123;</span><br><span class="line">            response.Envs = m.apiEnvs(m.deviceListEnvvar, req.DevicesIDs)</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="keyword">if</span> *deviceListStrategyFlag == DeviceListStrategyVolumeMounts &#123;</span><br><span class="line">            response.Envs = m.apiEnvs(m.deviceListEnvvar, []<span class="keyword">string</span>&#123;deviceListAsVolumeMountsContainerPathRoot&#125;)</span><br><span class="line">            response.Mounts = m.apiMounts(req.DevicesIDs)</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="keyword">if</span> *passDeviceSpecs &#123;</span><br><span class="line">            response.Devices = m.apiDeviceSpecs(req.DevicesIDs)</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        responses.ContainerResponses = <span class="built_in">append</span>(responses.ContainerResponses, &amp;response)</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> &amp;responses, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>前面我们提到， Nvidia的 <code>gpu-container-runtime</code> 根据容器的 <code>NVIDIA_VISIBLE_DEVICES</code> 环境变量，会决定这个容器是否为GPU容器，并且可以使用哪些GPU设备。 而Nvidia GPU device plugin做的事情，就是根据kubelet 请求中的GPU DeviceId， 转换为 <code>NVIDIA_VISIBLE_DEVICES</code> 环境变量返回给kubelet， kubelet收到返回内容后，会自动将返回的环境变量注入到容器中。当容器中包含环境变量，启动时 <code>gpu-container-runtime</code> 会根据 <code>NVIDIA_VISIBLE_DEVICES</code> 里声明的设备信息，将设备映射到容器中，并将对应的Nvidia Driver Lib 也映射到容器中。</p><h3 id="Device-的使用"><a href="#Device-的使用" class="headerlink" title="Device 的使用"></a>Device 的使用</h3><p>在kubelet的 <code>GetResource</code> 中，会调用 <code>DeviceManager</code> 的 <code>GetDeviceRunContainerOptions</code>，并将这些 <code>options</code>添加到<code>kubecontainer.RunContainerOptions</code> 中。<code>RunContainerOptions</code> 包括 <code>Envs</code>、<code>Mounts</code>、<code>Devices</code>、<code>PortMappings</code>、<code>Annotations</code>等信息。kubelet调用 <code>GetResources()</code> 为启动<code>container</code>获取启动参数 <code>runtimeapi.ContainerConfig{Args...}</code></p><figure class="highlight go"><figcaption><span>kubernetes/pkg/kubelet/cm/container_manager_linux.go</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(cm *containerManagerImpl)</span> <span class="title">GetResources</span><span class="params">(pod *v1.Pod, container *v1.Container)</span> <span class="params">(*kubecontainer.RunContainerOptions, error)</span></span> &#123;</span><br><span class="line">    opts := &amp;kubecontainer.RunContainerOptions&#123;&#125;</span><br><span class="line">    <span class="comment">// Allocate should already be called during predicateAdmitHandler.Admit(),</span></span><br><span class="line">    <span class="comment">// just try to fetch device runtime information from cached state here</span></span><br><span class="line">    devOpts, err := cm.deviceManager.GetDeviceRunContainerOptions(pod, container)</span><br><span class="line">    <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">    &#125; <span class="keyword">else</span> <span class="keyword">if</span> devOpts == <span class="literal">nil</span> &#123;</span><br><span class="line">        <span class="keyword">return</span> opts, <span class="literal">nil</span></span><br><span class="line">    &#125;</span><br><span class="line">    opts.Devices = <span class="built_in">append</span>(opts.Devices, devOpts.Devices...)</span><br><span class="line">    opts.Mounts = <span class="built_in">append</span>(opts.Mounts, devOpts.Mounts...)</span><br><span class="line">    opts.Envs = <span class="built_in">append</span>(opts.Envs, devOpts.Envs...)</span><br><span class="line">    opts.Annotations = <span class="built_in">append</span>(opts.Annotations, devOpts.Annotations...)</span><br><span class="line">    <span class="keyword">return</span> opts, <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p><code>GetDeviceRunContainerOptions()</code> 根据 <code>pod uuid</code> 和 <code>container name</code> 从 <code>podDevices</code> 缓存（device的分配过程中会设置缓存数据）中取出Envs、Mounts、Devices、PortMappings、Annotations等信息，另外对于一些PreStartRequired为true的 <code>DevicePlugin</code>，deviceManager需要在启动container之前调用 <code>DevicePlugin</code>的 <code>PreStartContainer</code>grpc接口，做一些device的初始化工作，超时时间限制为30秒。</p><figure class="highlight go"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="keyword">func</span> <span class="params">(m *ManagerImpl)</span> <span class="title">GetDeviceRunContainerOptions</span><span class="params">(pod *v1.Pod, container *v1.Container)</span> <span class="params">(*DeviceRunContainerOptions, error)</span></span> &#123;</span><br><span class="line">    podUID := <span class="keyword">string</span>(pod.UID)</span><br><span class="line">    contName := container.Name</span><br><span class="line">    needsReAllocate := <span class="literal">false</span></span><br><span class="line">    <span class="keyword">for</span> k := <span class="keyword">range</span> container.Resources.Limits &#123;</span><br><span class="line">        resource := <span class="keyword">string</span>(k)</span><br><span class="line">        <span class="keyword">if</span> !m.isDevicePluginResource(resource) &#123;</span><br><span class="line">            <span class="keyword">continue</span></span><br><span class="line">        &#125;</span><br><span class="line">        err := m.callPreStartContainerIfNeeded(podUID, contName, resource)</span><br><span class="line">        <span class="keyword">if</span> err != <span class="literal">nil</span> &#123;</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="comment">// This is a device plugin resource yet we don't have cached</span></span><br><span class="line">        <span class="comment">// resource state. This is likely due to a race during node</span></span><br><span class="line">        <span class="comment">// restart. We re-issue allocate request to cover this race.</span></span><br><span class="line">        <span class="keyword">if</span> m.podDevices.containerDevices(podUID, contName, resource) == <span class="literal">nil</span> &#123;</span><br><span class="line">            needsReAllocate = <span class="literal">true</span></span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> needsReAllocate &#123;</span><br><span class="line">        klog.V(<span class="number">2</span>).Infof(<span class="string">"needs re-allocate device plugin resources for pod %s, container %s"</span>, podUID, container.Name)</span><br><span class="line">        <span class="keyword">if</span> err := m.Allocate(pod, container); err != <span class="literal">nil</span> &#123;</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">nil</span>, err</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    m.mutex.Lock()</span><br><span class="line">    <span class="keyword">defer</span> m.mutex.Unlock()</span><br><span class="line">    <span class="keyword">return</span> m.podDevices.deviceRunContainerOptions(<span class="keyword">string</span>(pod.UID), container.Name), <span class="literal">nil</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="Device-的状态管理"><a href="#Device-的状态管理" class="headerlink" title="Device 的状态管理"></a>Device 的状态管理</h3><p>device的状态管理涉及到以下3个部分：</p><ul><li>node上的device状态管理当kubelet更新node status时会调用GetCapacity更新device plugins对应的Resource信息。</li></ul><p>kubelet_node_status.go调用deviceManager的GetCapacity()获取device的状态，将device状态添加到node info并通过kube-apiserver存入etcd，GetCapacity()返回device server含有的所有device、已经分配给pod使用的device、pod不能使用的device即no-active的device kubelet_node_status.go根据返回的数据更新node info</p><ul><li>kubelet deviceManager服务的device状态管理其实在device的注册、device分配中都有讲解，即使用checkpoint机制默认是将podDevices以 PodDevicesEntry的格式存入<em>/var/lib/kubelet/device-plugins/kubelet_internal_checkpoint 文件</em></li></ul><figure class="highlight js"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">type PodDevicesEntry struct &#123;</span><br><span class="line">   PodUID        string</span><br><span class="line">   ContainerName string</span><br><span class="line">   ResourceName  string</span><br><span class="line">   DeviceIDs     []string</span><br><span class="line">   AllocResp     []byte     <span class="comment">//包含启动container时使用的Envs、Mounts、Devices、PortMappings、Annotations等信息</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>只要device的状态发生了变化（如注册新device、device被分配、device的健康状态发生变化、device被删除），就要将podDevices存入<em>kubelet_internal_checkpoint 文件。kubelet在启动或重启时，都需要读取kubelet_internal_checkpoint 文件里的数据，并以podDevices格式存入podDevices缓存。</em></p><ul><li><code>DevicePlugin</code> 上报device状态在device的注册部分已经讲解过，归纳为<ul><li><code>deviceManager</code> 注册完 <code>DevicePlugin</code> 后，会跟 <code>DevicePlugin</code> 建立长连接，持续获取 <code>DevicePlugin</code> 的ListAndWatch结果，持续更新device状态；</li><li>当获取异常时，<code>deviceManager</code>断开连接，将device设置为不健康的状态；</li><li><code>DevicePlugin</code> 默认会重启重新注册，重新上报device的状态</li></ul></li></ul><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-management/device-plugin.md" target="_blank" rel="external nofollow noopener noreferrer">Kubernetes device plugin design proposal</a></li><li><a href="https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/plugin-watcher.md" target="_blank" rel="external nofollow noopener noreferrer">Kubernetes plugin watcher design proposal</a></li><li><a href="https://github.com/NVIDIA/k8s-device-plugin" target="_blank" rel="external nofollow noopener noreferrer">Nvidia Device Plugin</a></li><li><a href="https://cloud.tencent.com/developer/article/1592800" target="_blank" rel="external nofollow noopener noreferrer">https://cloud.tencent.com/developer/article/1592800</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;Kubernetes 原生支持对于CPU和内存资源的发现，但是有很多其他的设备 kubelet不能原生处理，比如GPU、FPGA、RDMA、存储设备和其他类似的异构计算资源设备。为了能够使用这些设备资源，我们需要进行各个设备的初始化和设置。按照 Kubernetes 的 &lt;code&gt;OutOfTree&lt;/code&gt; 的哲学理念，我们不应该把各个厂商的设备初始化设置相关代码与 Kubernetes 核心代码放在一起。与之相反，我们需要一种机制能够让各个设备厂商向 Kubelet 上报设备资源，而不需要修改 Kubernetes 核心代码。这即是 &lt;code&gt;Device Plugin&lt;/code&gt; 这一机制的来源，本文将介绍 Device Plugin 的实现原理，并介绍其使用。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_k8s-device-plugin.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="GPU" scheme="https://houmin.cc/tags/GPU/"/>
    
      <category term="k8s" scheme="https://houmin.cc/tags/k8s/"/>
    
      <category term="device plugin" scheme="https://houmin.cc/tags/device-plugin/"/>
    
      <category term="RDMA" scheme="https://houmin.cc/tags/RDMA/"/>
    
      <category term="FPGA" scheme="https://houmin.cc/tags/FPGA/"/>
    
  </entry>
  
  <entry>
    <title>【异构计算】GPU 与 CUDA</title>
    <link href="https://houmin.cc/posts/5004f8e5/"/>
    <id>https://houmin.cc/posts/5004f8e5/</id>
    <published>2020-11-15T05:16:10.000Z</published>
    <updated>2022-11-09T15:13:45.394Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>随着近年来深度学习的爆发，原来被用于图形渲染的GPU被大量用于并行加速深度学习的模型训练中，在这个过程中 CUDA 作为 NVIDIA 推出的基于GPU的一个通用并行计算平台和编程模型也得到了广泛的使用。或许你已经十分了解 <a href="../b893097a/">现代CPU的体系架构</a>，但是对于GPU还不甚清晰，GPU的体系架构到底和CPU有何区别，CUDA模型是什么，我们该如何使用 CUDA 实现并行计算，本文将为你扫盲祛魅，本文中使用到的所有代码可以在我的 <a href="https://github.com/SimpCosm/cuda-tutorial" target="_blank" rel="external nofollow noopener noreferrer">Github</a> 中找到。</p><a id="more"></a><h2 id="GPU-体系架构"><a href="#GPU-体系架构" class="headerlink" title="GPU 体系架构"></a>GPU 体系架构</h2><h3 id="为什么我们需要-GPU"><a href="#为什么我们需要-GPU" class="headerlink" title="为什么我们需要 GPU"></a>为什么我们需要 GPU</h3><p>如前所述，GPU （Graphics Processing Unit）最开始只是用于游戏、视频中的图形渲染，而现在最热门的一个应用领域是在深度学习的加速计算上。为什么需要 GPU 来加速计算呢？我们知道，随着摩尔定律的发展，在过去五十年间CPU的性能获得了巨大的提升，不论是从芯片上晶体管数目，还是时钟频率，到后来的从单核处理器发展到后来的多核多处理器。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-18_moores-law-develop.jpg"></p><p>下图是过去五十年间各款CPU处理器上晶体管数目的变化，基本上满足每18个月提升一倍的规律，虽然现在看起来50十年后摩尔定律对CPU来说有停滞的迹象（这是另一个话题，此处不表）</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-18_moores-law.png"></p><p>在 CPU 算力快速提升的这五十年，人们需要的计算量也同时在迅猛发展着，从最开始的桌面互联网，到后来的移动互联网，以及5年前爆发的深度学习，无一不需要庞大的计算力。在这个过程中，仅仅依靠CPU的算力开始力有不逮，这个过程中像GPU、FPGA、DSP等异构计算单元开始得到广泛的应用。下面，我回归计算的本质，以GPU为例来分析为什么我们需要这些异构计算单元。</p><p>无论是 CPU 还是 GPU，我们可以把计算模型抽象为下面这张图，这也是典型的冯诺伊曼体系架构。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_computing.png"></p><p>影响计算能力的4个主要因素如下：</p><ul><li><strong>Parallel Processing</strong>：Amount of data processed at one time</li><li><strong>Clock Frequency</strong>：Processing speed on each data element</li><li><strong>Memory Bandwidth</strong>：Amount of data transferred at one time</li><li><strong>Memory Lantency</strong>：Time for each data element to be transferred</li></ul><p>对于CPU，依次分析这几个因素：</p><ul><li>为了提供并行处理能力，我们从单核单处理器发展到多核多处理器，每个时钟周期CPU也能够处理多条指令</li><li>因为CPU时钟频率和功率的关系  $ Power = k <em> ClockFrequency </em> Voltage^2 $ ，在CPU过去的发展历史中，通过提高CPU时钟频率可以变得更快，与此同时为了保持CPU功耗的正常，也需要不断降低电压。但是当主频逐渐逼近到 4GHz 时，电压已经不能再降低了，因为这已经到达了晶体管高低电平反转的极限，关于这部分的更多内容可以参考 <a href="../">摩尔定律</a> 。</li><li>现在CPU用的是常规的DDR内存，明显存在着内存带宽限制</li><li>从CPU到DDR内存的延时很高，2020年的时候大概有100ns，具体可以参考 <a href="../fb3d782a/">Key Numbers Every Programmer Should Know</a>。CPU通过其他的方式隐藏了这个问题：<ul><li>Large On-Chip Low-Latency Cache，大概1ns</li><li>MultiThreading</li><li>Out-of-order execution</li></ul></li></ul><p><img alt="Credit to https://queue.acm.org/detail.cfm?id=2181798" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_processor-frequency-scaling.png"></p><p>尽管现在CPU的能力还在发展，但是以上的问题极大的限制了其算力的提高，当前仅靠CPU已经不能够满足人们对庞大算力的需求了。因此我们需要其他的专用芯片来帮助CPU一起计算，这就是异构计算的来源。GPU等专用计算单元虽然工作频率较低，但具有更多的内核数和并行计算能力，总体性能/芯片面积比和性能/功耗比都很高。随着人工智能时代的降临，GPU从游戏走进了人们的视野。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_cpu-vs-gpu.png"></p><p>无论是CPU还是GPU，在进行计算时都需要用核心（Core）来做算术逻辑运算。核心中有ALU（逻辑运算单元）和寄存器等电路。在进行计算时，一个核心只能顺序执行某项任务。CPU作为通用计算芯片，不仅仅做算术逻辑计算，其很重要的一部分功能是做复杂的逻辑控制，一般而言CPU上的Core数目相对较少，数据中心的服务器一般也就40左右个CPU核心。但是GPU动辄有上千个核心，这些核心可以独立的进行算术逻辑计算，大大提高了并行计算处理能力。</p><p>GPU时代的最大获益者是NVIDIA，当然AMD他们家也有GPU产品，但是因为AMD并没有形成CUDA这样的软件生态导致深度学习中主要用的都是NVIDIA的GPU，后面的分析都将基于NVIDIA的GPU产品。NVIDIA 不同时代产品的芯片设计不同，每代产品背后有一个架构代号，架构均以著名的物理学家为名，以向先贤致敬，对于消费者而言，英伟达主要有两条产品线：</p><ul><li>消费级产品 GeForce系列：GeForce 2080 Ti…</li><li>高性能计算产品 Telsa系列：Telsa V100、Telsa P100、Telsa P40…</li></ul><p><img alt="NVIDIA GPU产品体系" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_nvidia-gpu.png"></p><h3 id="GPU-硬件模型"><a href="#GPU-硬件模型" class="headerlink" title="GPU 硬件模型"></a>GPU 硬件模型</h3><h4 id="Host-and-Device"><a href="#Host-and-Device" class="headerlink" title="Host and Device"></a>Host and Device</h4><p>GPU并不是一个独立运行的计算平台，而是需要与CPU的协同工作，可以看作是CPU的协处理器，因此当我们说GPU并行计算的时候，实质上是指的 <code>CPU+GPU</code> 的异构计算架构。由于CPU和GPU是分开的，在NVIDIA的设计理念里，CPU和主存被称为 <strong>Host</strong>，GPU和显存被称为 <strong>Device</strong>。Host 和 Device 概念会贯穿整个NVIDIA GPU编程。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_cpu-and-gpu.png"></p><p>基于 CPU + GPU 的异构计算平台可以优势互补，CPU负责处理逻辑复杂的串行程序，GPU重点处理数据密集型的并行计算程序，从而发挥最大功效。CUDA 程序中既包含 <strong>Host</strong> 程序，又包含 <strong>Device</strong> 程序，它们分别在CPU和GPU上运行。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_cuda-application.jpg"></p><p>同时， <strong>Host</strong> 与 <strong>Device</strong> 之间通过PCIe总线交互进行数据拷贝，典型的 CUDA 程序的执行流程如下：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_cuda-flow.jpg"></p><ol><li>初始化后，将数据从 Main Memory 拷贝到 GPU Memory</li><li>CPU 调用 CUDA 的核函数</li><li>GPU 的 CUDA Core 并行执行核函数</li><li>将 <strong>Device</strong> 上的运算结果拷贝到 <strong>Host</strong> 上</li></ol><p>GPU核心在做计算时，只能直接从显存中读写数据，程序员需要在代码中指明哪些数据需要从内存和显存之间相互拷贝。这些数据传输都是在总线上，因此总线的传输速度和带宽成了部分计算任务的瓶颈。当前最新的总线技术是NVLink，IBM的 Power CPU 和 NVIDIA 的高端显卡可以通过NVLink直接通信，Intel 的 CPU目前不支持NVLink，只能使用PCIe技术。同时，单台机器上的多张英伟达显卡也可以使用NVLink相互通信，适合多GPU卡并行计算的场景。</p><p><img alt="NVLink可以连接CPU和GPU" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_nvlink.png"></p><h4 id="Streaming-Multiprocessor"><a href="#Streaming-Multiprocessor" class="headerlink" title="Streaming Multiprocessor"></a>Streaming Multiprocessor</h4><p>在 NVIDIA 的设计里，一张GPU卡有多个Streaming Multiprocessor（<strong>SM</strong>），每个 SM 中有多个计算核心，SM 是运算和调度的基本单元。下图为当前计算力最强的显卡Tesla V100，密密麻麻的绿色小格子就是GPU小核心，多个小核心一起组成了一个SM。</p><p><img alt="Tesla V100 with 84 SM Units" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_nvidia-tesla-v100.png"></p><p>将 SM 放大，单个SM的结构如图所示：</p><p><img alt="Tesla V100 Streaming Multiprocessor(SM)" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_nvidia-tesla-v100-sm.png"></p><p>可以看到一个SM中包含了计算核心和存储部分，SM的核心组件包括CUDA核心，共享内存，寄存器等，SM可以并发地执行数百个线程，并发能力就取决于SM所拥有的资源数。</p><ul><li>针对不同计算的小核心（绿色小格子），包括优化深度学习的TENSOR CORE，32个64位浮点核心（FP64），64个整型核心(INT)，64个32位浮点核心(FP32)</li><li>计算核心直接从寄存器（Register）中读写数据</li><li>调度和分发器（Scheduler和Dispatch Unit）</li><li>L0和L1级缓存</li></ul><p>具体而言，SM中的FP32进行32位浮点加乘运算，INT进行整型加乘运算，SFU（Special Functional Unit）执行一些倒数和三角函数等运算。Tensor Core是 NVIDIA 新的微架构中提出的一种混合精度的计算核心。我们知道，当前深度神经网络中使用到最频繁的矩阵运算是： $ D = A \times B + C $。Tensor Core可以对 $ 4 \times 4  $ 的矩阵做上述运算。其中：</p><ul><li>涉及乘法的 A 和 B 使用FP16的16位浮点运算，精度较低</li><li>涉及加法的 C 和 D 使用FP16或FP32精度</li></ul><p>Tensor Core是在 Volta 架构开始提出的，使用Volta架构的V100在深度学习上的性能远超Pascal架构的P100。</p><p><img alt="Tensor Core是一种为优化深度学习计算核心" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_tensor-core.png"></p><h2 id="CUDA-编程模型"><a href="#CUDA-编程模型" class="headerlink" title="CUDA 编程模型"></a>CUDA 编程模型</h2><p>前面提到，NVIDIA 相对于 AMD 的一个巨大优势是它的 CUDA 软件生态，下图是 NVIDIA GPU 编程的软件栈，从底层的GPU驱动和CUDA 工具包，上面还提供了科学计算所必需的cuBLAS线性代数库，cuFFT快速傅里叶变换库以及cuDNN深度神经网络加速库，当前常见的 TensorFlow 和 PyTorch 深度学习框架底层大多都基于 cuDNN 库。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_gpu-software-stack.png"></p><h3 id="Hello-World"><a href="#Hello-World" class="headerlink" title="Hello World"></a>Hello World</h3><p>在进一步学习 CUDA 编程模型之前，我们首先配置好 CUDA 的运行环境，跑通 <code>Hello World</code> 从而对 CUDA 编程有一个直观的认识，这里使用的是腾讯云的 GPU 服务器，机器安装的是 CentOS 7 系统，CUDA 环境配置可以参考 <a href="https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html" target="_blank" rel="external nofollow noopener noreferrer">CUDA Installation Guide Linux</a> 。</p><p>根据上图的 NVIDIA GPU 软件栈，有了一个插上了 GPU 的服务器之后，我们首先查看机器上的 GPU，可以看到当前机器上装GPU是 <code>Tesla P40</code>：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">$ lspci | grep -i nvidia</span><br><span class="line">00:08.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)</span><br></pre></td></tr></table></figure><p>接下来在 <a href="https://developer.nvidia.com/cuda-downloads" target="_blank" rel="external nofollow noopener noreferrer">这里</a>下载 CUDA Toolkit，这里选择的是 <code>rpm local</code> 的安装方式：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">$ wget https://developer.download.nvidia.com/compute/cuda/11.1.1/local_installers/cuda-repo-rhel7-11-1-local-11.1.1_455.32.00-1.x86_64.rpm</span><br><span class="line">$ sudo rpm -i cuda-repo-rhel7-11-1-local-11.1.1_455.32.00-1.x86_64.rpm</span><br><span class="line">$ sudo yum clean all</span><br><span class="line">$ sudo yum -y install nvidia-driver-latest-dkms cuda</span><br><span class="line">$ sudo yum -y install cuda-drivers</span><br></pre></td></tr></table></figure><p>执行上面的安装操作之后，我们可以看到在 <code>/usr/lib64/</code> 看到 <code>libcuda.so</code> ：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">$ ls /usr/lib64 -al | grep cuda</span><br><span class="line">lrwxrwxrwx   1 root root        20 Nov 21 15:05 libcuda.so -&gt; libcuda.so.455.32.00</span><br><span class="line">lrwxrwxrwx   1 root root        20 Nov 21 15:05 libcuda.so.1 -&gt; libcuda.so.455.32.00</span><br><span class="line">-rwxr-xr-x   1 root root  21074296 Oct 15 06:58 libcuda.so.455.32.00</span><br></pre></td></tr></table></figure><p>下面是一些我们会经常用到的 CUDA 工具，你需要通过配置环境变量来使用他们：</p><figure class="highlight brainfuck"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">编译器：nvcc</span> <span class="comment">(C/C</span>++<span class="comment">)</span></span><br><span class="line"><span class="comment">调试器：nvcc</span><span class="literal">-</span><span class="comment">gdb</span></span><br><span class="line"><span class="comment">性能分析：nsight</span><span class="string">,</span> <span class="comment">nvprof</span></span><br><span class="line"><span class="comment">函数库：cublas</span><span class="string">,</span> <span class="comment">nvblas</span><span class="string">,</span> <span class="comment">cusolver</span><span class="string">,</span> <span class="comment">cufftw</span><span class="string">,</span> <span class="comment">cusparse</span><span class="string">,</span> <span class="comment">nvgraph</span></span><br></pre></td></tr></table></figure><p>设置环境变量如下：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">export</span> PATH=/usr/<span class="built_in">local</span>/cuda-11.1/bin<span class="variable">$&#123;PATH:+:$&#123;PATH&#125;</span>&#125;</span><br><span class="line">$ nvcc --version</span><br><span class="line">nvcc: NVIDIA (R) Cuda compiler driver</span><br><span class="line">Copyright (c) 2005-2020 NVIDIA Corporation</span><br><span class="line">Built on Mon_Oct_12_20:09:46_PDT_2020</span><br><span class="line">Cuda compilation tools, release 11.1, V11.1.105</span><br><span class="line">Build cuda_11.1.TC455_06.29190527_0</span><br></pre></td></tr></table></figure><p>除此之外，对于 64 位系统，需要设置 <code>LD_LIBRARY_PATH</code>：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ <span class="built_in">export</span> LD_LIBRARY_PATH=/usr/<span class="built_in">local</span>/cuda-11.1/lib64<span class="variable">$&#123;LD_LIBRARY_PATH:+:$&#123;LD_LIBRARY_PATH&#125;</span>&#125;</span><br></pre></td></tr></table></figure><p>这个时候可以确认驱动的版本：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ cat /proc/driver/nvidia/version</span><br><span class="line">NVRM version: NVIDIA UNIX x86_64 Kernel Module  455.32.00  Wed Oct 14 22:46:18 UTC 2020</span><br><span class="line">GCC version:  gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)</span><br></pre></td></tr></table></figure><p>可以使用<code>nvidia-smi</code>命令查看显卡情况，比如这台机器上几张显卡，CUDA版本，显卡上运行的进程等。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line">$ nvidia-smi</span><br><span class="line">Sat Nov 21 17:09:13 2020</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |</span><br><span class="line">|-------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                               |                      |               MIG M. |</span><br><span class="line">|===============================+======================+======================|</span><br><span class="line">|   0  Tesla P40           Off  | 00000000:00:08.0 Off |                    0 |</span><br><span class="line">| N/A   27C    P0    49W / 250W |      0MiB / 22919MiB |      3%      Default |</span><br><span class="line">|                               |                      |                  N/A |</span><br><span class="line">+-------------------------------+----------------------+----------------------+</span><br><span class="line"></span><br><span class="line">+-----------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                  |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |</span><br><span class="line">|        ID   ID                                                   Usage      |</span><br><span class="line">|=============================================================================|</span><br><span class="line">|  No running processes found                                                 |</span><br><span class="line">+-----------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><p><code>CUDA</code> 自己提供了一系列的代码示例，可以通过下面的方法安装：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">$ cuda-install-samples-11.1.sh &lt;dir&gt;</span><br></pre></td></tr></table></figure><p>在对应目录下，我们可以看到 <code>CUDA</code> 提供的源代码：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ ls NVIDIA_CUDA-11.1_Samples</span><br><span class="line">0_Simple     2_Graphics  4_Finance      6_Advanced       bin     EULA.txt  Makefile</span><br><span class="line">1_Utilities  3_Imaging   5_Simulations  7_CUDALibraries  common  LICENSE</span><br></pre></td></tr></table></figure><p>直接在这个目录下执行 <code>make</code>，可以在 <code>bin</code>目录下得到所有代码的二进制程序，选择其中的 <code>deviceQuery</code> 执行：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br></pre></td><td class="code"><pre><span class="line">$ ./deviceQuery</span><br><span class="line">./deviceQuery Starting...</span><br><span class="line"></span><br><span class="line"> CUDA Device Query (Runtime API) version (CUDART static linking)</span><br><span class="line"></span><br><span class="line">Detected 1 CUDA Capable device(s)</span><br><span class="line"></span><br><span class="line">Device 0: <span class="string">"Tesla P40"</span></span><br><span class="line">  CUDA Driver Version / Runtime Version          11.1 / 11.1</span><br><span class="line">  CUDA Capability Major/Minor version number:    6.1</span><br><span class="line">  Total amount of global memory:                 22919 MBytes (24032378880 bytes)</span><br><span class="line">  (30) Multiprocessors, (128) CUDA Cores/MP:     3840 CUDA Cores</span><br><span class="line">  GPU Max Clock rate:                            1531 MHz (1.53 GHz)</span><br><span class="line">  Memory Clock rate:                             3615 Mhz</span><br><span class="line">  Memory Bus Width:                              384-bit</span><br><span class="line">  L2 Cache Size:                                 3145728 bytes</span><br><span class="line">  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)</span><br><span class="line">  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers</span><br><span class="line">  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers</span><br><span class="line">  Total amount of constant memory:               65536 bytes</span><br><span class="line">  Total amount of shared memory per block:       49152 bytes</span><br><span class="line">  Total shared memory per multiprocessor:        98304 bytes</span><br><span class="line">  Total number of registers available per block: 65536</span><br><span class="line">  Warp size:                                     32</span><br><span class="line">  Maximum number of threads per multiprocessor:  2048</span><br><span class="line">  Maximum number of threads per block:           1024</span><br><span class="line">  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)</span><br><span class="line">  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)</span><br><span class="line">  Maximum memory pitch:                          2147483647 bytes</span><br><span class="line">  Texture alignment:                             512 bytes</span><br><span class="line">  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)</span><br><span class="line">  Run time <span class="built_in">limit</span> on kernels:                     No</span><br><span class="line">  Integrated GPU sharing Host Memory:            No</span><br><span class="line">  Support host page-locked memory mapping:       Yes</span><br><span class="line">  Alignment requirement <span class="keyword">for</span> Surfaces:            Yes</span><br><span class="line">  Device has ECC support:                        Enabled</span><br><span class="line">  Device supports Unified Addressing (UVA):      Yes</span><br><span class="line">  Device supports Managed Memory:                Yes</span><br><span class="line">  Device supports Compute Preemption:            Yes</span><br><span class="line">  Supports Cooperative Kernel Launch:            Yes</span><br><span class="line">  Supports MultiDevice Co-op Kernel Launch:      Yes</span><br><span class="line">  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 8</span><br><span class="line">  Compute Mode:</span><br><span class="line">     &lt; Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) &gt;</span><br><span class="line"></span><br><span class="line">deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.1, CUDA Runtime Version = 11.1, NumDevs = 1</span><br><span class="line">Result = PASS</span><br></pre></td></tr></table></figure><p>到现在，<code>CUDA Toolkit</code> 安装完毕，接下来通过编写一个简单的 <code>hello world</code> 来直观感受 CUDA 编程：</p><figure class="highlight c"><figcaption><span>hello.cu</span></figcaption><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdio.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="function">__global__ <span class="keyword">void</span> <span class="title">hello_from_gpu</span><span class="params">()</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="built_in">printf</span>( <span class="string">"\"Hello, world!\", says the GPU.\n"</span> );</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">hello_from_cpu</span><span class="params">()</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="built_in">printf</span>( <span class="string">"\"Hello, world!\", says the CPU.\n"</span> );</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="comment">// host code entrance</span></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">( <span class="keyword">int</span> argc, <span class="keyword">char</span> **argv )</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    hello_from_cpu();</span><br><span class="line">    hello_from_gpu &lt;&lt;&lt; <span class="number">2</span>, <span class="number">4</span>&gt;&gt;&gt;();</span><br><span class="line">    cudaDeviceReset();</span><br><span class="line">    <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>可以看到，CUDA 程序基本上和标准 C 语言程序一样，主要的区别在于 <code>__global__</code> 限定词 和 <code>&lt;&lt;&lt;... &gt;&gt;&gt;</code> 符号。其中 <code>__global__</code> 标记用来告诉编译器这段代码会运行在 <strong>Device</strong>  （GPU）上，它会被运行在 <strong>Host</strong> 上的代码调用，也被称作是在 <strong>Device</strong> 上线程中并行执行的核函数（Kernel），是在 <strong>Device</strong> 上线程中并行执行的函数。</p><p>当一个核函数被调用时，需要通过 <code>&lt;&lt;&lt;grid, block&gt;&gt;&gt;</code> 符号 来设置核函数执行时的配置，在 CUDA 的术语中，这称作 <code>kernel lauch</code>，在后面我们将深入介绍这部分。</p><p><code>hello world</code> 程序写完，我们以 <code>hello.cu</code> 这样的后缀名来保存，接下来使用 <code>nvcc</code> 来编译，整体上用法与 <code>gcc</code> 几乎一样：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">$ nvcc hello.cu -o hello</span><br><span class="line">$./hello</span><br><span class="line"><span class="string">"Hello, world!"</span>, says the CPU.</span><br><span class="line"><span class="string">"Hello, world!"</span>, says the GPU.</span><br><span class="line"><span class="string">"Hello, world!"</span>, says the GPU.</span><br><span class="line"><span class="string">"Hello, world!"</span>, says the GPU.</span><br><span class="line"><span class="string">"Hello, world!"</span>, says the GPU.</span><br><span class="line"><span class="string">"Hello, world!"</span>, says the GPU.</span><br><span class="line"><span class="string">"Hello, world!"</span>, says the GPU.</span><br><span class="line"><span class="string">"Hello, world!"</span>, says the GPU.</span><br><span class="line"><span class="string">"Hello, world!"</span>, says the GPU.</span><br></pre></td></tr></table></figure><p>可以看到，来自 CPU 的 <code>Hello World</code> 执行了一次，来自 GPU 的 <code>Hello World</code> 执行了8次。</p><h3 id="核函数与线程模型"><a href="#核函数与线程模型" class="headerlink" title="核函数与线程模型"></a>核函数与线程模型</h3><p>上文提到，为了实现 GPU 并行加速计算，我们需要在 <strong>Host</strong> 上执行 <code>kernel launch</code>，让 核函数 在 <strong>Device</strong> 上的多个线程并发执行。具体的方式就是在调用核函数的时候通过 <code>&lt;&lt;&lt;grid, block&gt;&gt;&gt;</code> 来指定核函数要执行的线程数量N，之后GPU上的N个Core会并行执行核函数，并且每个线程会分配一个唯一的线程号threadID，这个ID值可以通过核函数的内置变量<code>threadIdx</code>来获得。</p><p>CUDA将核函数所定义的运算称为<strong>线程（Thread）</strong>，多个线程组成一个<strong>块（Block）</strong>，多个块组成<strong>网格（Grid）</strong>。这样一个Grid可以定义成千上万个线程，也就解决了并行执行上万次操作的问题。 <code>&lt;&lt;&lt;grid, block&gt;&gt;&gt;</code> 中括号中第一个数字表示整个Grid有多少个Block，括号中第二个数字表示一个Block有多少个Thread。前面 <code>Hello World</code> 用 2 个Block，每个Block中有4个Thread，所以总共执行了8次。</p><p><img alt="Grid of Thread Blocks" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_cuda-thread-hierarchy.png"></p><p>实际上，线程（Thread）是一个编程上的软件概念。从硬件来看，Thread运行在一个CUDA核心上，多个Thread组成的Block运行在Streaming Multiprocessor（SM），多个Block组成的Grid运行在一个GPU显卡上。当一个 <code>kernel</code> 被执行时，它的gird中的线程块被分配到SM上，<strong>一个线程块只能在一个SM上被调度</strong>。SM一般可以调度多个线程块，这要看SM本身的能力。那么有可能一个 <code>kernel</code> 的各个线程块被分配多个SM，所以grid只是逻辑层，而SM才是执行的物理层。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_cuda-software-hardware-view.png"></p><p><code>grid</code> 和 <code>block</code>都是定义为<code>dim3</code>类型的变量，<code>dim3</code>可以看成是包含三个无符号整数（x，y，z）成员的结构体变量，在定义时，缺省值初始化为1。因此 <code>grid</code> 和 <code>block</code> 可以灵活地定义为 <code>1-dim</code>，<code>2-dim</code> 以及<code>3-dim</code> 结构，对于上图中结构（主要水平方向为x轴），定义的 <code>grid</code>和 <code>block</code> 如下所示， <code>kernel</code> 在调用时也必须通过<a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration" target="_blank" rel="external nofollow noopener noreferrer">执行配置</a><code>&lt;&lt;&lt;grid, block&gt;&gt;&gt;</code>来指定 <code>kernel</code> 所使用的线程数及结构。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">dim3 <span class="title">grid</span><span class="params">(<span class="number">3</span>, <span class="number">2</span>)</span></span>;</span><br><span class="line"><span class="function">dim3 <span class="title">block</span><span class="params">(<span class="number">5</span>, <span class="number">3</span>)</span></span>;</span><br><span class="line">kernel_fun&lt;&lt;&lt; grid, block &gt;&gt;&gt;(prams...);</span><br></pre></td></tr></table></figure><p>所以，一个线程需要两个内置的坐标变量<code>（blockIdx，threadIdx）</code>来唯一标识，它们都是<code>dim3</code>类型变量，其中blockIdx指明线程所在grid中的位置，而threaIdx指明线程所在block中的位置，如图中的 <code>Thread (1,1)</code> 满足：</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">threadIdx.x &#x3D; 1</span><br><span class="line">threadIdx.y &#x3D; 1</span><br><span class="line">blockIdx.x &#x3D; 1</span><br><span class="line">blockIdx.y &#x3D; 1</span><br></pre></td></tr></table></figure><p>不同的执行配置会影响GPU程序的速度，一般需要多次调试才能找到较好的执行配置，在实际编程中，执行配置<code>&lt;&lt;&lt;grid, block&gt;&gt;&gt;</code>应参考下面的方法：</p><ul><li>Block运行在SM上，不同硬件架构（Turing、Volta、Pascal…）的CUDA核心数不同，一般需要根据当前硬件来设置Block的大小<code>block</code>（执行配置中第二个参数）。一个Block中的Thread数最好是32、128、256的倍数。注意，限于当前硬件的设计，Block大小不能超过1024。</li><li>Grid的大小<code>grid</code>（执行配置中第一个参数），即一个Grid中Block的个数可以由总次数<code>N</code>除以<code>block</code>，并向上取整。</li></ul><p>例如，我们想并行启动1000个Thread，可以将blockDim设置为128，<code>1000 ÷ 128 = 7.8</code>，向上取整为8。使用时，执行配置可以写成<code>gpuWork&lt;&lt;&lt;8, 128&gt;&gt;&gt;()</code>，CUDA共启动<code>8 * 128 = 1024</code>个Thread，实际计算时只使用前1000个Thread，多余的24个Thread不进行计算。</p> <div class="note info">            <p>这几个变量比较容易混淆，再次明确一下：<code>block</code>是Block中Thread的个数，一个Block中的<code>threadIdx</code>最大不超过<code>block</code>；<code>grid</code>是Grid中Block的个数，一个Grid中的<code>blockIdx</code>最大不超过<code>grid</code>。</p>          </div><p>这几个变量比较容易混淆，再次明确一下：<code>block</code>是Block中Thread的个数，一个Block中的<code>threadIdx</code>最大不超过<code>block</code>；<code>grid</code>是Grid中Block的个数，一个Grid中的<code>blockIdx</code>最大不超过<code>grid</code>。</p><p> <code>kernel</code> 的这种线程组织结构天然适合vector，matrix等运算，我们将在后面实现向量加法和矩阵乘法。如我们将利用上图2-dim结构实现两个矩阵的加法，每个线程负责处理每个位置的两个元素相加，代码如下所示。线程块大小为(16, 16)，然后将 $ N*N $ 大小的矩阵均分为不同的线程块来执行加法运算。</p><p>SM采用的是<a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture" target="_blank" rel="external nofollow noopener noreferrer">SIMT</a> (Single-Instruction, Multiple-Thread，单指令多线程)架构，基本的执行单元是 <strong>线程束（wraps)</strong>，线程束包含32个线程，这些线程同时执行相同的指令，但是每个线程都包含自己的指令地址计数器和寄存器状态，也有自己独立的执行路径。</p><p>当线程块被划分到某个SM上时，它将进一步划分为多个线程束，因为这才是SM的基本执行单元，但是一个SM同时并发的线程束数是有限的。这是因为资源限制，SM要为每个线程块分配共享内存，而也要为每个线程束中的线程分配独立的寄存器。所以SM的配置会影响其所支持的线程块和线程束并发数量。由于SM的基本执行单元是包含32个线程的线程束，所以block大小一般要设置为32的倍数。<code>(16, 16)</code>的二维Block是一个常用的配置，共256个线程。之前也曾提到过，每个Block的Thread个数最好是128、256或512，这与GPU的硬件架构高度相关。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// Kernel定义</span></span><br><span class="line"><span class="function">__global__ <span class="keyword">void</span> <span class="title">MatAdd</span><span class="params">(<span class="keyword">float</span> A[N][N], <span class="keyword">float</span> B[N][N], <span class="keyword">float</span> C[N][N])</span> </span></span><br><span class="line"><span class="function"></span>&#123; </span><br><span class="line">    <span class="keyword">int</span> i = blockIdx.x * blockDim.x + threadIdx.x; </span><br><span class="line">    <span class="keyword">int</span> j = blockIdx.y * blockDim.y + threadIdx.y; </span><br><span class="line">    <span class="keyword">if</span> (i &lt; N &amp;&amp; j &lt; N) </span><br><span class="line">        C[i][j] = A[i][j] + B[i][j]; </span><br><span class="line">&#125;</span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span> </span></span><br><span class="line"><span class="function"></span>&#123; </span><br><span class="line">    ...</span><br><span class="line">    <span class="comment">// Kernel 线程配置</span></span><br><span class="line">    <span class="function">dim3 <span class="title">threadsPerBlock</span><span class="params">(<span class="number">16</span>, <span class="number">16</span>)</span></span>; </span><br><span class="line">    <span class="function">dim3 <span class="title">numBlocks</span><span class="params">(N / threadsPerBlock.x, N / threadsPerBlock.y)</span></span>;</span><br><span class="line">    <span class="comment">// kernel调用</span></span><br><span class="line">    MatAdd&lt;&lt;&lt;numBlocks, threadsPerBlock&gt;&gt;&gt;(A, B, C); </span><br><span class="line">    ...</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>线程块中的线程数是有限制的，现代GPUs的线程块可支持的线程数可达1024个。有时候，我们要知道一个线程在 <code>blcok</code> 中的全局ID，此时就必须还要知道 <code>block</code> 的组织结构，这是通过线程的内置变量 <code>blockDim</code>来获得。它获取线程块各个维度的大小。</p><ul><li>对于一个 <code>2-dim</code> 的block $ (D_x, D_y) $ ，线程  $ (x, y) $ 的ID值为 $ (x + y * D_x) $ </li><li>对于一个<code>3-dim</code> 的block  $ (D_x, D_y, D_z) $，线程 $(x, y, z)$  的ID值为 $ (x + y <em> D_z + z </em> D_z * D_y) $  </li></ul><p>另外线程还有内置变量 <code>gridDim</code>，用于获得网格块各个维度的大小。</p><h3 id="内存模型与管理"><a href="#内存模型与管理" class="headerlink" title="内存模型与管理"></a>内存模型与管理</h3><p>此外这里简单介绍一下CUDA的内存模型，如下图所示。可以看到，</p><ul><li>每个 <strong>Thread</strong> 有自己的私有本地内存（Local Memory）</li><li>每个 <strong>Block</strong> 有包含共享内存（Shared Memory），可以被线程块中所有线程共享，其生命周期与线程块一致</li><li>所有的 <strong>Thread</strong>  都可以访问全局内存（Global Memory）</li><li>访问一些只读内存块：常量内存（Constant Memory）和纹理内存（Texture Memory）</li><li>L1 Cache，L2 Cache</li></ul><div class="group-picture"><div class="group-picture-container"><div class="group-picture-row"><div class="group-picture-column" style="width: 50%;"><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_cuda-memory-sm.png"></div><div class="group-picture-column" style="width: 50%;"><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_cuda-memory-model.jpg"></div></div></div></div><p>下面简单介绍一下CUDA编程中内存管理常用的API。首先是在 <strong>Device</strong> 上分配内存的 <code>cudaMalloc</code> 、<code>cudaFree</code> 和 <code>cudaMemcpy</code>函数，分别对应C语言中的 <code>malloc</code>、<code>free</code>和 <code>memcpy</code>函数：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// 在 Device 上申请一定字节大小的显存，其中 `devPtr` 是指向所分配内存的指针</span></span><br><span class="line"><span class="function">cudaError_t <span class="title">cudaMalloc</span><span class="params">(<span class="keyword">void</span>** devPtr, <span class="keyword">size_t</span> <span class="built_in">size</span>)</span></span>;</span><br><span class="line"></span><br><span class="line"><span class="comment">// 在 Device 上释放一定大小的现存， `devPtr` 是指向所释放内存的指针</span></span><br><span class="line"><span class="function">cudaError_t <span class="title">cudaFree</span><span class="params">(<span class="keyword">void</span>* devPtr)</span></span>;</span><br><span class="line"></span><br><span class="line"><span class="comment">// 负责 Host 和 Device 之间数据通信，src指向数据源，dst是目标区域，count是复制的字节数，kind控制复制的方向</span></span><br><span class="line"><span class="comment">// 这里的 kind 有四种类型：</span></span><br><span class="line"><span class="comment">// - cudaMemcpyHostToHost</span></span><br><span class="line"><span class="comment">// - cudaMemcpyHostToDevice</span></span><br><span class="line"><span class="comment">// - cudaMemcpyDeviceToHost</span></span><br><span class="line"><span class="comment">// - cudaMemcpyDeviceToDevice</span></span><br><span class="line"><span class="function">cudaError_t <span class="title">cudaMemcpy</span><span class="params">(<span class="keyword">void</span>* dst, <span class="keyword">const</span> <span class="keyword">void</span>* src, <span class="keyword">size_t</span> count, cudaMemcpyKind kind)</span></span></span><br></pre></td></tr></table></figure><h2 id="CUDA-编程实战"><a href="#CUDA-编程实战" class="headerlink" title="CUDA 编程实战"></a>CUDA 编程实战</h2><p>知道了CUDA编程基础，接下来我们以两个向量的加法为例，介绍如何利用CUDA编程来实现GPU加速计算。</p><h3 id="CPU-向量加法：传统计算方法"><a href="#CPU-向量加法：传统计算方法" class="headerlink" title="CPU 向量加法：传统计算方法"></a>CPU 向量加法：传统计算方法</h3><p>我们首先来看利用 CPU 来计算向量加法该如何编程：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdlib.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;math.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;assert.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> N 10000000</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> MAX_ERR 1e-6</span></span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">void</span> <span class="title">vector_add</span><span class="params">(<span class="keyword">float</span> *out, <span class="keyword">float</span> *a, <span class="keyword">float</span> *b, <span class="keyword">int</span> n)</span> </span>&#123;</span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; n; i++)&#123;</span><br><span class="line">        out[i] = a[i] + b[i];</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span>&#123;</span><br><span class="line">    <span class="keyword">float</span> *a, *b, *out; </span><br><span class="line"></span><br><span class="line">    <span class="comment">// Allocate memory</span></span><br><span class="line">    a   = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    b   = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    out = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Initialize array</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        a[i] = <span class="number">1.0f</span>;</span><br><span class="line">        b[i] = <span class="number">2.0f</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Main function</span></span><br><span class="line">    vector_add(out, a, b, N);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Verification</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        assert(<span class="built_in">fabs</span>(out[i] - a[i] - b[i]) &lt; MAX_ERR);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="built_in">printf</span>(<span class="string">"out[0] = %f\n"</span>, out[<span class="number">0</span>]);</span><br><span class="line">    <span class="built_in">printf</span>(<span class="string">"PASSED\n"</span>);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="GPU-向量加法：一个Block一个Thread"><a href="#GPU-向量加法：一个Block一个Thread" class="headerlink" title="GPU 向量加法：一个Block一个Thread"></a>GPU 向量加法：一个Block一个Thread</h3><p>我们将 CPU 的向量加法转换成 CUDA 程序，使用 GPU 来计算，下面这段代码演示了如何使用 CUDA 编程规范来编写程序。实际上仍然只是使用一个 <code>core</code> 来进行计算，不仅没有提高并行度，反而还增加了数据拷贝的成本，显然相比原来的计算是会更慢的，这里主要作为演示。</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdlib.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;math.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;assert.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;cuda.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;cuda_runtime.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> N 10000000</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> MAX_ERR 1e-6</span></span><br><span class="line"></span><br><span class="line"><span class="function">__global__ <span class="keyword">void</span> <span class="title">vector_add</span><span class="params">(<span class="keyword">float</span> *out, <span class="keyword">float</span> *a, <span class="keyword">float</span> *b, <span class="keyword">int</span> n)</span> </span>&#123;</span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; n; i ++)&#123;</span><br><span class="line">        out[i] = a[i] + b[i];</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span>&#123;</span><br><span class="line">    <span class="keyword">float</span> *a, *b, *out;</span><br><span class="line">    <span class="keyword">float</span> *d_a, *d_b, *d_out; </span><br><span class="line"></span><br><span class="line">    <span class="comment">// Allocate host memory</span></span><br><span class="line">    a   = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    b   = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    out = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Initialize host arrays</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        a[i] = <span class="number">1.0f</span>;</span><br><span class="line">        b[i] = <span class="number">2.0f</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Allocate device memory</span></span><br><span class="line">    cudaMalloc((<span class="keyword">void</span>**)&amp;d_a, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    cudaMalloc((<span class="keyword">void</span>**)&amp;d_b, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    cudaMalloc((<span class="keyword">void</span>**)&amp;d_out, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Transfer data from host to device memory</span></span><br><span class="line">    cudaMemcpy(d_a, a, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N, cudaMemcpyHostToDevice);</span><br><span class="line">    cudaMemcpy(d_b, b, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N, cudaMemcpyHostToDevice);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Executing kernel </span></span><br><span class="line">    vector_add&lt;&lt;&lt;<span class="number">1</span>,<span class="number">1</span>&gt;&gt;&gt;(d_out, d_a, d_b, N);</span><br><span class="line">    </span><br><span class="line">    <span class="comment">// Transfer data back to host memory</span></span><br><span class="line">    cudaMemcpy(out, d_out, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N, cudaMemcpyDeviceToHost);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Verification</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        assert(<span class="built_in">fabs</span>(out[i] - a[i] - b[i]) &lt; MAX_ERR);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="built_in">printf</span>(<span class="string">"PASSED\n"</span>);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Deallocate device memory</span></span><br><span class="line">    cudaFree(d_a);</span><br><span class="line">    cudaFree(d_b);</span><br><span class="line">    cudaFree(d_out);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Deallocate host memory</span></span><br><span class="line">    <span class="built_in">free</span>(a); </span><br><span class="line">    <span class="built_in">free</span>(b); </span><br><span class="line">    <span class="built_in">free</span>(out);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="GPU-向量加法：一个Block多个Thread"><a href="#GPU-向量加法：一个Block多个Thread" class="headerlink" title="GPU 向量加法：一个Block多个Thread"></a>GPU 向量加法：一个Block多个Thread</h3><p>为了提高并行度，我们设置一个 <code>Block</code> 多个 <code>Thread</code> 同时进行计算，如下图所示总共有256个<code>Thread</code>，每个 Thread 负责处理 Vector 中的一部分。每一次迭代中，256个Thread分别计算 Vector 的这256个数，然后在下一次迭代中每个Thread往后推进256个数，继续计算。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_cuda-parallel_thread.png"></p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdlib.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;math.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;assert.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;cuda.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;cuda_runtime.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> N 10000000</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> MAX_ERR 1e-6</span></span><br><span class="line"></span><br><span class="line"><span class="function">__global__ <span class="keyword">void</span> <span class="title">vector_add</span><span class="params">(<span class="keyword">float</span> *out, <span class="keyword">float</span> *a, <span class="keyword">float</span> *b, <span class="keyword">int</span> n)</span> </span>&#123;</span><br><span class="line">    <span class="keyword">int</span> index = threadIdx.x;</span><br><span class="line">    <span class="keyword">int</span> stride = blockDim.x;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = index; i &lt; n; i += stride)&#123;</span><br><span class="line">        out[i] = a[i] + b[i];</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span>&#123;</span><br><span class="line">    <span class="keyword">float</span> *a, *b, *out;</span><br><span class="line">    <span class="keyword">float</span> *d_a, *d_b, *d_out; </span><br><span class="line"></span><br><span class="line">    <span class="comment">// Allocate host memory</span></span><br><span class="line">    a   = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    b   = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    out = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Initialize host arrays</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        a[i] = <span class="number">1.0f</span>;</span><br><span class="line">        b[i] = <span class="number">2.0f</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Allocate device memory </span></span><br><span class="line">    cudaMalloc((<span class="keyword">void</span>**)&amp;d_a, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    cudaMalloc((<span class="keyword">void</span>**)&amp;d_b, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    cudaMalloc((<span class="keyword">void</span>**)&amp;d_out, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Transfer data from host to device memory</span></span><br><span class="line">    cudaMemcpy(d_a, a, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N, cudaMemcpyHostToDevice);</span><br><span class="line">    cudaMemcpy(d_b, b, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N, cudaMemcpyHostToDevice);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Executing kernel </span></span><br><span class="line">    vector_add&lt;&lt;&lt;<span class="number">1</span>,<span class="number">256</span>&gt;&gt;&gt;(d_out, d_a, d_b, N);</span><br><span class="line">    </span><br><span class="line">    <span class="comment">// Transfer data back to host memory</span></span><br><span class="line">    cudaMemcpy(out, d_out, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N, cudaMemcpyDeviceToHost);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Verification</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        assert(<span class="built_in">fabs</span>(out[i] - a[i] - b[i]) &lt; MAX_ERR);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="built_in">printf</span>(<span class="string">"PASSED\n"</span>);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Deallocate device memory</span></span><br><span class="line">    cudaFree(d_a);</span><br><span class="line">    cudaFree(d_b);</span><br><span class="line">    cudaFree(d_out);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Deallocate host memory</span></span><br><span class="line">    <span class="built_in">free</span>(a); </span><br><span class="line">    <span class="built_in">free</span>(b); </span><br><span class="line">    <span class="built_in">free</span>(out);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>相比 CPU 程序，这里的并行度显著提高，GPU 计算的时间也大大减小。</p><h3 id="GPU-向量加法：多个Block多个Thread"><a href="#GPU-向量加法：多个Block多个Thread" class="headerlink" title="GPU 向量加法：多个Block多个Thread"></a>GPU 向量加法：多个Block多个Thread</h3><p>在上一个方案中，我们的256个Thread仍然需要计算多个数字，如果我们将并行度继续扩大，让每个Thread只需要计算Vector中的一个数，那么计算消耗时间将会更短。如下图所示，我们使用多个Block多个Thread，其中每个Block还是256个Thread，但是我们现在的Grid有多个Block，Block数字由Vector的长度除以BlockSize得到。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-21_cuda-parallel_block.png"></p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdlib.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;math.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;assert.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;cuda.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;cuda_runtime.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> N 10000000</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> MAX_ERR 1e-6</span></span><br><span class="line"></span><br><span class="line"><span class="function">__global__ <span class="keyword">void</span> <span class="title">vector_add</span><span class="params">(<span class="keyword">float</span> *out, <span class="keyword">float</span> *a, <span class="keyword">float</span> *b, <span class="keyword">int</span> n)</span> </span>&#123;</span><br><span class="line">    <span class="keyword">int</span> tid = blockIdx.x * blockDim.x + threadIdx.x;</span><br><span class="line">    </span><br><span class="line">    <span class="comment">// Handling arbitrary vector size</span></span><br><span class="line">    <span class="keyword">if</span> (tid &lt; n)&#123;</span><br><span class="line">        out[tid] = a[tid] + b[tid];</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span>&#123;</span><br><span class="line">    <span class="keyword">float</span> *a, *b, *out;</span><br><span class="line">    <span class="keyword">float</span> *d_a, *d_b, *d_out; </span><br><span class="line"></span><br><span class="line">    <span class="comment">// Allocate host memory</span></span><br><span class="line">    a   = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    b   = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    out = (<span class="keyword">float</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Initialize host arrays</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        a[i] = <span class="number">1.0f</span>;</span><br><span class="line">        b[i] = <span class="number">2.0f</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Allocate device memory </span></span><br><span class="line">    cudaMalloc((<span class="keyword">void</span>**)&amp;d_a, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    cudaMalloc((<span class="keyword">void</span>**)&amp;d_b, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    cudaMalloc((<span class="keyword">void</span>**)&amp;d_out, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Transfer data from host to device memory</span></span><br><span class="line">    cudaMemcpy(d_a, a, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N, cudaMemcpyHostToDevice);</span><br><span class="line">    cudaMemcpy(d_b, b, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N, cudaMemcpyHostToDevice);</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">    <span class="comment">// Executing kernel </span></span><br><span class="line">    <span class="keyword">int</span> block_size = <span class="number">256</span>;</span><br><span class="line">    <span class="keyword">int</span> grid_size = ((N + block_size - <span class="number">1</span>) / block_size);</span><br><span class="line">    vector_add&lt;&lt;&lt;grid_size,block_size&gt;&gt;&gt;(d_out, d_a, d_b, N);</span><br><span class="line">    </span><br><span class="line">    <span class="comment">// Transfer data back to host memory</span></span><br><span class="line">    cudaMemcpy(out, d_out, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N, cudaMemcpyDeviceToHost);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Verification</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        assert(<span class="built_in">fabs</span>(out[i] - a[i] - b[i]) &lt; MAX_ERR);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="built_in">printf</span>(<span class="string">"PASSED\n"</span>);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Deallocate device memory</span></span><br><span class="line">    cudaFree(d_a);</span><br><span class="line">    cudaFree(d_b);</span><br><span class="line">    cudaFree(d_out);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Deallocate host memory</span></span><br><span class="line">    <span class="built_in">free</span>(a); </span><br><span class="line">    <span class="built_in">free</span>(b); </span><br><span class="line">    <span class="built_in">free</span>(out);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h3 id="GPU-向量加法：Unified-Memory"><a href="#GPU-向量加法：Unified-Memory" class="headerlink" title="GPU 向量加法：Unified Memory"></a>GPU 向量加法：Unified Memory</h3><p>在上面的实现中，我们需要单独在 <strong>Host</strong> 和 <strong>Device</strong> 上进行内存分配，并且要进行数据拷贝，这是很容易出错的。好在CUDA 6.0引入统一内存（<a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd" target="_blank" rel="external nofollow noopener noreferrer">Unified Memory</a>）来避免这种麻烦，简单来说就是统一内存使用一个托管内存来共同管理 <strong>Host</strong> 和 <strong>Device</strong> 中的内存，并且自动在 <strong>Host</strong> 和 <strong>Device</strong> 中进行数据传输。CUDA中使用cudaMallocManaged函数分配托管内存：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">cudaError_t <span class="title">cudaMallocManaged</span><span class="params">(<span class="keyword">void</span> **devPtr, <span class="keyword">size_t</span> <span class="built_in">size</span>, <span class="keyword">unsigned</span> <span class="keyword">int</span> flag=<span class="number">0</span>)</span></span>;</span><br></pre></td></tr></table></figure><p>利用统一内存，可以将上面的程序简化如下：</p><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdlib.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;math.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;assert.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;cuda.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;cuda_runtime.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> N 10000000</span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">define</span> MAX_ERR 1e-6</span></span><br><span class="line"></span><br><span class="line"><span class="function">__global__ <span class="keyword">void</span> <span class="title">vector_add</span><span class="params">(<span class="keyword">float</span> *out, <span class="keyword">float</span> *a, <span class="keyword">float</span> *b, <span class="keyword">int</span> n)</span> </span>&#123;</span><br><span class="line">    <span class="keyword">int</span> tid = blockIdx.x * blockDim.x + threadIdx.x;</span><br><span class="line">    </span><br><span class="line">    <span class="comment">// Handling arbitrary vector size</span></span><br><span class="line">    <span class="keyword">if</span> (tid &lt; n)&#123;</span><br><span class="line">        out[tid] = a[tid] + b[tid];</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span></span>&#123;</span><br><span class="line">    <span class="comment">// Allocate managed memory</span></span><br><span class="line">    <span class="keyword">float</span> *x, *y, *z;</span><br><span class="line">    cudaMallocManaged((<span class="keyword">void</span>**)&amp;x, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    cudaMallocManaged((<span class="keyword">void</span>**)&amp;y, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line">    cudaMallocManaged((<span class="keyword">void</span>**)&amp;z, <span class="keyword">sizeof</span>(<span class="keyword">float</span>) * N);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Initialize host arrays</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        x[i] = <span class="number">1.0f</span>;</span><br><span class="line">        y[i] = <span class="number">2.0f</span>;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Executing kernel </span></span><br><span class="line">    <span class="keyword">int</span> block_size = <span class="number">256</span>;</span><br><span class="line">    <span class="keyword">int</span> grid_size = ((N + block_size - <span class="number">1</span>) / block_size);</span><br><span class="line">    vector_add&lt;&lt;&lt;grid_size,block_size&gt;&gt;&gt;(z, x, y, N);</span><br><span class="line">    </span><br><span class="line">    <span class="comment">// 同步 Device 保证结果能正确访问</span></span><br><span class="line">    cudaDeviceSynchronize();</span><br><span class="line">  </span><br><span class="line">    <span class="comment">// Verification</span></span><br><span class="line">    <span class="keyword">for</span>(<span class="keyword">int</span> i = <span class="number">0</span>; i &lt; N; i++)&#123;</span><br><span class="line">        assert(<span class="built_in">fabs</span>(out[i] - a[i] - b[i]) &lt; MAX_ERR);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="built_in">printf</span>(<span class="string">"PASSED\n"</span>);</span><br><span class="line"></span><br><span class="line">    <span class="comment">// Deallocate managed memory</span></span><br><span class="line">    cudaFree(x);</span><br><span class="line">    cudaFree(y);</span><br><span class="line">    cudaFree(z);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>相比之前的代码，使用统一内存更简洁了，值得注意的是 <code>kernel</code> 执行是与 <strong>Host</strong> 异步的，由于托管内存自动进行数据传输，这里要用<code>cudaDeviceSynchronize()</code> 函数保证 <strong>Device</strong> 和 <strong>Host</strong> 同步，这样后面才可以正确访问 <code>kernel</code> 计算的结果。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="http://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf" target="_blank" rel="external nofollow noopener noreferrer">An Introduction to Modern GPU Architecture</a></li><li><a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html" target="_blank" rel="external nofollow noopener noreferrer">NVIDIA CUDA 编程模型官方文档</a></li><li><a href="https://github.com/huiscliu/Tutorials/tree/master/CUDA编程入门" target="_blank" rel="external nofollow noopener noreferrer">CUDA编程入门</a></li><li><a href="http://www.mat.unimi.it/users/sansotte/cuda/CUDA_by_Example.pdf" target="_blank" rel="external nofollow noopener noreferrer">CUDA By Example</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;随着近年来深度学习的爆发，原来被用于图形渲染的GPU被大量用于并行加速深度学习的模型训练中，在这个过程中 CUDA 作为 NVIDIA 推出的基于GPU的一个通用并行计算平台和编程模型也得到了广泛的使用。或许你已经十分了解 &lt;a href=&quot;../b893097a/&quot;&gt;现代CPU的体系架构&lt;/a&gt;，但是对于GPU还不甚清晰，GPU的体系架构到底和CPU有何区别，CUDA模型是什么，我们该如何使用 CUDA 实现并行计算，本文将为你扫盲祛魅，本文中使用到的所有代码可以在我的 &lt;a href=&quot;https://github.com/SimpCosm/cuda-tutorial&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;Github&lt;/a&gt; 中找到。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-11-20_nvidia-tesla-v100.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="GPU" scheme="https://houmin.cc/tags/GPU/"/>
    
      <category term="CUDA" scheme="https://houmin.cc/tags/CUDA/"/>
    
      <category term="异构计算" scheme="https://houmin.cc/tags/%E5%BC%82%E6%9E%84%E8%AE%A1%E7%AE%97/"/>
    
  </entry>
  
  <entry>
    <title>政治坐标系</title>
    <link href="https://houmin.cc/posts/125bc0e5/"/>
    <id>https://houmin.cc/posts/125bc0e5/</id>
    <published>2020-04-09T09:16:26.000Z</published>
    <updated>2022-11-09T15:13:45.389Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><script src="//cdn.jsdelivr.net/npm/jquery@3/dist/jquery.min.js"></script><p>「政治坐标系」的概念来源于著名的<code>political compass</code>，用于表明一个人的政治倾向。这里是我的政治坐标测试，其中「中国政治坐标测试」最早是 2007 年北大未名 BBS 的同学们讨论制作的，并在后期根据中国实际情况进行了订正和修改，在 <a href="http://www.zuobiao.me/zuobiao2015/index.php/66331?lang=zh-Hans" target="_blank" rel="external nofollow noopener noreferrer">这里</a> 可以看到目前的版本。令我感到惊讶的是，居然在这个<a href="https://bbs.pku.edu.cn/v2/post-read-single.php?bid=1004&amp;type=3&amp;postid=5656284" target="_blank" rel="external nofollow noopener noreferrer">帖子</a>下面看到了<a href="http://blog.farmostwood.net/" target="_blank" rel="external nofollow noopener noreferrer">木遥</a>的踪迹，世界真小。</p><blockquote><p>需要强调说明的是，<strong>这个测试初始并且唯一的目标在于给使用者提供一个自我思考和认同的提示器。</strong></p><p>「公共政治议题讨论的阙失和长期的无限夸大式的政治宣传方式，使得很多人几乎是凭着脑海中浮现的口号来作出自己的选择，而完全不曾在理性上真正确认过自己的立场。」这是我对现实的悲观理解。这个问卷如此流行，足以反过来说明政治观点的分歧和相关观点在意识层面上（而非政策层面上）的讨论和争锋如何构成了公众生活的禁忌。网上关于这个测试的很多评论都反映出<strong>很多人并不习惯于拥有自己的观点，更不用说是在如此广泛的层面上。我相信这并非出自天性，而只是长期的怠惰使然。</strong></p></blockquote><p>与此同时，我也附上了来自英文「<a href="https://www.politicalcompass.org/test" target="_blank" rel="external nofollow noopener noreferrer">政治指南针</a>」网站的西方政治坐标测试，这份测试系统建立于西方政治价值体系基础之上，<strong>某些问题强烈的依赖于具体的西方社会环境，未必能够充分反映中国国情。</strong> 不管怎样，倒也可以提供一个自我思考的提示器。</p><a id="more"></a><p><img alt="Political Compass" data-src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Political_chart.svg/941px-Political_chart.svg.png"></p><h2 id="中国政治坐标"><a href="#中国政治坐标" class="headerlink" title="中国政治坐标"></a>中国政治坐标</h2><h3 id="测试试题"><a href="#测试试题" class="headerlink" title="测试试题"></a>测试试题</h3><p>整个测试有 50 道题，分别从政治、经济、文化三个方面界定。这里列出了我在今天的选择，具体打分可到原网页进行测试。</p><p><form id="chinese-questions">    <ol>        <li>如果人民没有受过民主教育，他们是不应该拥有普选权的。<br><input name="c1" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c1" type="radio" data-x="1" data-y="0" data-z="0">反对<br><input name="c1" type="radio" data-x="-1" data-y="0" data-z="0" checked>同意<br><input name="c1" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>人权高于主权。<br><input name="c2" type="radio" data-x="-2" data-y="0" data-z="0">强烈反对<br><input name="c2" type="radio" data-x="-1" data-y="0" data-z="0" checked>反对<br><input name="c2" type="radio" data-x="1" data-y="0" data-z="0">同意<br><input name="c2" type="radio" data-x="2" data-y="0" data-z="0">强烈同意<br></li>        <li>西方的多党制不适合中国国情。<br><input name="c3" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c3" type="radio" data-x="1" data-y="0" data-z="0">反对<br><input name="c3" type="radio" data-x="-1" data-y="0" data-z="0" checked>同意<br><input name="c3" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>由高校自主考试招生比全国统一考试招生更好。<br><input name="c4" type="radio" data-x="-2" data-y="0" data-z="0">强烈反对<br><input name="c4" type="radio" data-x="-1" data-y="0" data-z="0" checked>反对<br><input name="c4" type="radio" data-x="1" data-y="0" data-z="0">同意<br><input name="c4" type="radio" data-x="2" data-y="0" data-z="0">强烈同意<br></li>        <li>在中国照搬西方式的言论自由会导致社会失序。<br><input name="c5" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c5" type="radio" data-x="1" data-y="0" data-z="0" checked>反对<br><input name="c5" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c5" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>应该容许宗教人士在非宗教场所公开传教。<br><input name="c6" type="radio" data-x="-2" data-y="0" data-z="0" checked>强烈反对<br><input name="c6" type="radio" data-x="-1" data-y="0" data-z="0">反对<br><input name="c6" type="radio" data-x="1" data-y="0" data-z="0">同意<br><input name="c6" type="radio" data-x="2" data-y="0" data-z="0">强烈同意<br></li>        <li>无论中小学生或大学生，都应参加由国家统一安排的军训。<br><input name="c7" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c7" type="radio" data-x="1" data-y="0" data-z="0" checked>反对<br><input name="c7" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c7" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>国家的统一和领土完整是社会的最高利益。<br><input name="c8" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c8" type="radio" data-x="1" data-y="0" data-z="0" checked>反对<br><input name="c8" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c8" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>国家没有义务进行对外援助。<br><input name="c9" type="radio" data-x="-2" data-y="0" data-z="0">强烈反对<br><input name="c9" type="radio" data-x="-1" data-y="0" data-z="0">反对<br><input name="c9" type="radio" data-x="1" data-y="0" data-z="0" checked>同意<br><input name="c9" type="radio" data-x="2" data-y="0" data-z="0">强烈同意<br></li>        <li>哪怕经历了违反程序规定的审讯和取证过程，确实有罪的罪犯也应被处以死刑。<br><input name="c10" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c10" type="radio" data-x="1" data-y="0" data-z="0" checked>反对<br><input name="c10" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c10" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>国家领导人及开国领袖的形象可以作为文艺作品的丑化对象。<br><input name="c11" type="radio" data-x="-2" data-y="0" data-z="0">强烈反对<br><input name="c11" type="radio" data-x="-1" data-y="0" data-z="0" checked>反对<br><input name="c11" type="radio" data-x="1" data-y="0" data-z="0">同意<br><input name="c11" type="radio" data-x="2" data-y="0" data-z="0">强烈同意<br></li>        <li>当法律未能充分制止罪恶行为时，通过极端手段对犯罪进行制裁是可以容忍的。<br><input name="c12" type="radio" data-x="2" data-y="0" data-z="0" checked>强烈反对<br><input name="c12" type="radio" data-x="1" data-y="0" data-z="0">反对<br><input name="c12" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c12" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>应当允许媒体代表某一特定阶层或利益集团发言。<br><input name="c13" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c13" type="radio" data-x="1" data-y="0" data-z="0" checked>反对<br><input name="c13" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c13" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>如果国家综合实力许可，那么中国有权为了维护自己的利益而采取任何行动。<br><input name="c14" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c14" type="radio" data-x="1" data-y="0" data-z="0" checked>反对<br><input name="c14" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c14" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>条件允许的话应该武力统一台湾。<br><input name="c15" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c15" type="radio" data-x="1" data-y="0" data-z="0" checked>反对<br><input name="c15" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c15" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>国家应当采取措施培养和支持体育健儿在各种国际比赛场合为国争光。<br><input name="c16" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c16" type="radio" data-x="1" data-y="0" data-z="0" checked>反对<br><input name="c16" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c16" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>律师即使明知被辩护人的犯罪事实也应当尽力为其进行辩护。<br><input name="c17" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c17" type="radio" data-x="1" data-y="0" data-z="0">反对<br><input name="c17" type="radio" data-x="-1" data-y="0" data-z="0" checked>同意<br><input name="c17" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>以美国为首的西方国家不可能真正容许中国崛起成为一流强国。<br><input name="c18" type="radio" data-x="2" data-y="0" data-z="0">强烈反对<br><input name="c18" type="radio" data-x="1" data-y="0" data-z="0" checked>反对<br><input name="c18" type="radio" data-x="-1" data-y="0" data-z="0">同意<br><input name="c18" type="radio" data-x="-2" data-y="0" data-z="0">强烈同意<br></li>        <li>两个成年人之间自愿的性行为是其自由，无论其婚姻关系为何。<br><input name="c19" type="radio" data-x="0" data-y="0" data-z="-2">强烈反对<br><input name="c19" type="radio" data-x="0" data-y="0" data-z="-1">反对<br><input name="c19" type="radio" data-x="0" data-y="0" data-z="1" checked>同意<br><input name="c19" type="radio" data-x="0" data-y="0" data-z="2">强烈同意<br></li>        <li>不应当公开谈论自己的长辈的缺点。<br><input name="c20" type="radio" data-x="0" data-y="0" data-z="2">强烈反对<br><input name="c20" type="radio" data-x="0" data-y="0" data-z="1">反对<br><input name="c20" type="radio" data-x="0" data-y="0" data-z="-1" checked>同意<br><input name="c20" type="radio" data-x="0" data-y="0" data-z="-2">强烈同意<br></li>        <li>现代中国社会需要儒家思想。<br><input name="c21" type="radio" data-x="0" data-y="0" data-z="2">强烈反对<br><input name="c21" type="radio" data-x="0" data-y="0" data-z="1">反对<br><input name="c21" type="radio" data-x="0" data-y="0" data-z="-1" checked>同意<br><input name="c21" type="radio" data-x="0" data-y="0" data-z="-2">强烈同意<br></li>        <li>判断艺术作品的价值的根本标准是看是不是受到人民大众喜爱。<br><input name="c22" type="radio" data-x="0" data-y="0" data-z="-2">强烈反对<br><input name="c22" type="radio" data-x="0" data-y="0" data-z="-1" checked>反对<br><input name="c22" type="radio" data-x="0" data-y="0" data-z="1">同意<br><input name="c22" type="radio" data-x="0" data-y="0" data-z="2">强烈同意<br></li>        <li>即使有人口压力，国家和社会也无权干涉个人要不要孩子，要几个孩子。<br><input name="c23" type="radio" data-x="0" data-y="0" data-z="-2">强烈反对<br><input name="c23" type="radio" data-x="0" data-y="0" data-z="-1">反对<br><input name="c23" type="radio" data-x="0" data-y="0" data-z="1" checked>同意<br><input name="c23" type="radio" data-x="0" data-y="0" data-z="2">强烈同意<br></li>        <li>周易八卦能够有效的解释很多事情。<br><input name="c24" type="radio" data-x="0" data-y="0" data-z="2" checked>强烈反对<br><input name="c24" type="radio" data-x="0" data-y="0" data-z="1">反对<br><input name="c24" type="radio" data-x="0" data-y="0" data-z="-1">同意<br><input name="c24" type="radio" data-x="0" data-y="0" data-z="-2">强烈同意<br></li>        <li>中国传统医学对人体健康的观念比现代主流医学更高明。<br><input name="c25" type="radio" data-x="0" data-y="0" data-z="2">强烈反对<br><input name="c25" type="radio" data-x="0" data-y="0" data-z="1" checked>反对<br><input name="c25" type="radio" data-x="0" data-y="0" data-z="-1">同意<br><input name="c25" type="radio" data-x="0" data-y="0" data-z="-2">强烈同意<br></li>        <li>汉字无需人为推行简化。<br><input name="c26" type="radio" data-x="0" data-y="0" data-z="2">强烈反对<br><input name="c26" type="radio" data-x="0" data-y="0" data-z="1" checked>反对<br><input name="c26" type="radio" data-x="0" data-y="0" data-z="-1">同意<br><input name="c26" type="radio" data-x="0" data-y="0" data-z="-2">强烈同意<br></li>        <li>应当将中国传统文化的经典作品作为儿童基础教育读物。<br><input name="c27" type="radio" data-x="0" data-y="0" data-z="2" checked>强烈反对<br><input name="c27" type="radio" data-x="0" data-y="0" data-z="1">反对<br><input name="c27" type="radio" data-x="0" data-y="0" data-z="-1">同意<br><input name="c27" type="radio" data-x="0" data-y="0" data-z="-2">强烈同意<br></li>        <li>如果是出于自愿，我会认可我的孩子和同性结成伴侣关系。<br><input name="c28" type="radio" data-x="0" data-y="0" data-z="-2">强烈反对<br><input name="c28" type="radio" data-x="0" data-y="0" data-z="-1">反对<br><input name="c28" type="radio" data-x="0" data-y="0" data-z="1" checked>同意<br><input name="c28" type="radio" data-x="0" data-y="0" data-z="2">强烈同意<br></li>        <li>最低工资应由国家规定。<br><input name="c29" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c29" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c29" type="radio" data-x="0" data-y="-1" data-z="0" checked>同意<br><input name="c29" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>中国改革开放以来经济发展的成果很多都被一小群人占有了，大多数人没得到什么好处。<br><input name="c30" type="radio" data-x="0" data-y="2" data-z="0" checked>强烈反对<br><input name="c30" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c30" type="radio" data-x="0" data-y="-1" data-z="0">同意<br><input name="c30" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>在重大工程项目的决策中，个人利益应该为社会利益让路。<br><input name="c31" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c31" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c31" type="radio" data-x="0" data-y="-1" data-z="0" checked>同意<br><input name="c31" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>浪费粮食也是个人的自由。<br><input name="c32" type="radio" data-x="0" data-y="-2" data-z="0">强烈反对<br><input name="c32" type="radio" data-x="0" data-y="-1" data-z="0">反对<br><input name="c32" type="radio" data-x="0" data-y="1" data-z="0" checked>同意<br><input name="c32" type="radio" data-x="0" data-y="2" data-z="0">强烈同意<br></li>        <li>如果猪肉价格过高，政府应当干预。<br><input name="c33" type="radio" data-x="0" data-y="-2" data-z="0">强烈反对<br><input name="c33" type="radio" data-x="0" data-y="-1" data-z="0">反对<br><input name="c33" type="radio" data-x="0" data-y="1" data-z="0" checked>同意<br><input name="c33" type="radio" data-x="0" data-y="2" data-z="0">强烈同意<br></li>        <li>应当对国外同类产品征收高额关税来保护国内民族工业。<br><input name="c34" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c34" type="radio" data-x="0" data-y="1" data-z="0" checked>反对<br><input name="c34" type="radio" data-x="0" data-y="-1" data-z="0">同意<br><input name="c34" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>国有企业的利益属于国家利益。<br><input name="c35" type="radio" data-x="0" data-y="-2" data-z="0">强烈反对<br><input name="c35" type="radio" data-x="0" data-y="-1" data-z="0">反对<br><input name="c35" type="radio" data-x="0" data-y="1" data-z="0">同意<br><input name="c35" type="radio" data-x="0" data-y="2" data-z="0" checked>强烈同意<br></li>        <li>试图控制房地产价格会破坏经济发展。<br><input name="c36" type="radio" data-x="0" data-y="-2" data-z="0">强烈反对<br><input name="c36" type="radio" data-x="0" data-y="-1" data-z="0" checked>反对<br><input name="c36" type="radio" data-x="0" data-y="1" data-z="0">同意<br><input name="c36" type="radio" data-x="0" data-y="2" data-z="0">强烈同意<br></li>        <li>教育应当尽可能公立。<br><input name="c37" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c37" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c37" type="radio" data-x="0" data-y="-1" data-z="0">同意<br><input name="c37" type="radio" data-x="0" data-y="-2" data-z="0" checked>强烈同意<br></li>        <li>改善低收入者生活的首要手段是国家给予财政补贴和扶持。<br><input name="c38" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c38" type="radio" data-x="0" data-y="1" data-z="0" checked>反对<br><input name="c38" type="radio" data-x="0" data-y="-1" data-z="0">同意<br><input name="c38" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>有钱人理应获得更好的医疗服务。<br><input name="c39" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c39" type="radio" data-x="0" data-y="1" data-z="0" checked>反对<br><input name="c39" type="radio" data-x="0" data-y="-1" data-z="0">同意<br><input name="c39" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>高收入者应该公开自己的经济来源。<br><input name="c40" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c40" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c40" type="radio" data-x="0" data-y="-1" data-z="0" checked>同意<br><input name="c40" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>与其让国有企业亏损破产，不如转卖给资本家。<br><input name="c41" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c41" type="radio" data-x="0" data-y="1" data-z="0" checked>反对<br><input name="c41" type="radio" data-x="0" data-y="-1" data-z="0">同意<br><input name="c41" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>那些关系到国家安全、以及其他重要国计民生的领域，必须全部由国有企业掌控。<br><input name="c42" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c42" type="radio" data-x="0" data-y="1" data-z="0" checked>反对<br><input name="c42" type="radio" data-x="0" data-y="-1" data-z="0">同意<br><input name="c42" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>市场竞争中自然形成的垄断地位是无害的。<br><input name="c43" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c43" type="radio" data-x="0" data-y="1" data-z="0" checked>反对<br><input name="c43" type="radio" data-x="0" data-y="-1" data-z="0">同意<br><input name="c43" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>资本积累的过程总是伴随着对普通劳动人民利益的伤害。<br><input name="c44" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c44" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c44" type="radio" data-x="0" data-y="-1" data-z="0" checked>同意<br><input name="c44" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>应该允许中国公民同时具有外国国籍。<br><input name="c45" type="radio" data-x="-2" data-y="0" data-z="0">强烈反对<br><input name="c45" type="radio" data-x="-1" data-y="0" data-z="0">反对<br><input name="c45" type="radio" data-x="1" data-y="0" data-z="0" checked>同意<br><input name="c45" type="radio" data-x="2" data-y="0" data-z="0">强烈同意<br></li>        <li>政府应当提高粮食收购价格以增加农民收入。<br><input name="c46" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c46" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c46" type="radio" data-x="0" data-y="-1" data-z="0" checked>同意<br><input name="c46" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>为保障社会公平，对富人征税应采用更高的税率。<br><input name="c47" type="radio" data-x="0" data-y="2" data-z="0" checked>强烈反对<br><input name="c47" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c47" type="radio" data-x="0" data-y="-1" data-z="0">同意<br><input name="c47" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>在华外国资本应享受和民族资本同样的待遇。<br><input name="c48" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c48" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c48" type="radio" data-x="0" data-y="-1" data-z="0" checked>同意<br><input name="c48" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>私人应当可以拥有和买卖土地。<br><input name="c49" type="radio" data-x="0" data-y="2" data-z="0">强烈反对<br><input name="c49" type="radio" data-x="0" data-y="1" data-z="0">反对<br><input name="c49" type="radio" data-x="0" data-y="-1" data-z="0" checked>同意<br><input name="c49" type="radio" data-x="0" data-y="-2" data-z="0">强烈同意<br></li>        <li>发生重大社会安全事件时，即使认为信息公开会导致骚乱的风险，政府仍应该开放信息传播。<br><input name="c50" type="radio" data-x="-2" data-y="0" data-z="0">强烈反对<br><input name="c50" type="radio" data-x="-1" data-y="0" data-z="0">反对<br><input name="c50" type="radio" data-x="1" data-y="0" data-z="0" checked>同意<br><input name="c50" type="radio" data-x="2" data-y="0" data-z="0">强烈同意<br></li>    </ol></form><br><div class="note info">            <ul><li>政治观念坐标，负值为左，即威权主义 (Authoritarianism)，正值为右，即自由主义 (Libertarianism)。</li><li>社会文化观念坐标，负值为保守与复古派 (Conservatism)，正值为自由与激进派 (Liberalism)。</li><li>经济观念坐标，负值为左，即集体主义与福利主义 (Welfarism, Collectivism)，正值为右，即新自由主义(Neoliberalism)。</li></ul><p>三个维度的最大区间均为 [-2,2]。</p><p>本测试系统建立于中国政治价值体系基础之上，试图充分反映中国的特殊国情与政治文化。请注意，很多问题反映的是中国现实语境中的「左与右」，而非严格意义上的西方政治语汇中的「左与右」。</p>          </div></p><h3 id="测试反思"><a href="#测试反思" class="headerlink" title="测试反思"></a>测试反思</h3><p>整个测试做完，我的得分如下：</p><figure class="highlight angelscript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">政治立场坐标: <span class="number">0.4</span></span><br><span class="line">文化立场坐标: <span class="number">0.6</span></span><br><span class="line">经济立场坐标: <span class="number">-0.3</span></span><br></pre></td></tr></table></figure><p>什么意思呢？也就是说，我政治观念偏自由主义，社会文化观念偏自由主义，经济观念偏集体主义。这个测试结果和我在 <a href="https://www.idrlabs.com/" target="_blank" rel="external nofollow noopener noreferrer">IDRlabs</a>上面的<a href="https://www.idrlabs.com/cn/political-coordinates/test.php" target="_blank" rel="external nofollow noopener noreferrer">政治观点测试</a>大体类似，整体上政治文化偏自由，但是很明显经济方面自己的不确定性太大，整体上属于温和中间派。</p><p><img alt="IDRlabs Political Coordinate" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-04-09_political-compass.png"></p><p>倒也不是说通过这个测试就对我的左右进行了划分，把我划分成左派或右派。左右意识形态的纠葛在过去一百多年给人类社会带来撕扯与分裂，以至于对于左和右的定义国内国外都不太一样。一直以来，我的观点就是搁置意识形态上的争论，踏踏实实的讨论实际问题。但是搁置争论并不等于没有自己的观点，并不等于不去思考这些问题，而这个测试恰恰提供了这样的机会。</p><p>关于这五十个问题，做的时候有的并不是百分百的确定，很多问题涉及到经济问题。经济基础决定上层建筑，经济问题是可以用数学来解释的，过段时间等对于经济问题有了更多的理解后，再来做这个测试，或许答案又不一样了。</p><ul><li>人权与主权。教科书告诉我们主权高于人权，真的是这样吗？</li><li>真的应该公开所有的信息吗？一直以来我认为公开信息是能够促进事情更加透明的，现在我有些犹豫。</li><li>关注自主招生，应该认识到，自主招生确实是招到优秀同学的一种重要方式。但是我们也会看到这里面有很多不公平的现象产生，我选择公平。</li><li>国家的统一和领土完整真的是社会的最高利益吗？人民的幸福不才应该是吗？犹豫。</li><li>国家真的没有义务对外援助吗？大国责任呢？犹豫。</li><li>有罪的罪犯，取证不规范，这是程序正义的问题。现在是坚决反对的，即使他确实有问题，我们取证不规范，那和他又有什么区别呢？</li><li>国家领导人和开国领袖可以作为丑化对象吗？不能，原因说不清楚，觉得这样不好。</li><li>媒体不应该是尽量公正客观的吗？所以直觉认为媒体当然不应该为某一特定利益集团发言，又一转念，这不正是当下的显示吗？哪家媒体不是代表的某些人的利益的呢？</li><li>国籍问题是知识盲区，这个不是很懂，到底双重国籍有什么具体的影响。 </li><li>武统台湾，我们已经有条件了，但是这是我们最不愿意看到的，难道就没有别的办法吗，中国人这么聪明。 </li><li>现代社会需要儒家思想，当然是需要，但是看是那部分，君君臣臣那套还是算了。</li><li>判断艺术作品的价值真的是是否受到人民大众喜欢吗？不一定吧。</li><li>关于最低工资，这点并不太懂其背后的经济学原理。</li><li>个人利益应当为社会利益让路吗？犹豫。</li><li>粮食浪费也是个人自由吗？自由应该如何界定？</li><li>猪肉价格过高，政府应该干预吗？经济学上怎么说？</li><li>应该试图控制房地产吗？经济学上怎么说？</li><li>教育应该完全公立吗？从我的角度来说，我是赞成的。</li><li>有钱人应该公开自己的收入来源吗？直观感受是应该的。</li><li>涉及到国计民生的领域，真的都应该由国企掌控吗？</li><li>资本积累的过程看起来确实都伴随着对普通劳动人民利益的伤害。</li><li>政府应该高价格收购粮食吗？</li><li>私人应该可以自由买卖土地吗？是否会再次出现土地兼并的问题？ </li></ul><h2 id="西方政治坐标"><a href="#西方政治坐标" class="headerlink" title="西方政治坐标"></a>西方政治坐标</h2><h3 id="测试试题-1"><a href="#测试试题-1" class="headerlink" title="测试试题"></a>测试试题</h3><p>下面是我的测试结果，经济上偏自由，政治上偏自由。</p><p><form id="western-questions">    <p><strong>第一部分</strong>：你如何看待国家与世界。</p>    <ol>        <li>如果全球化无法避免，应该首先为人服务而不是跨国公司的利益。<br><input name="c1" type="radio" data-x="0.5" data-y="0">强烈反对<br><input name="c1" type="radio" data-x="0.25" data-y="0">反对<br><input name="c1" type="radio" data-x="-0.375" data-y="0" checked>同意<br><input name="c1" type="radio" data-x="-0.625" data-y="0">强烈同意<br></li>        <li>我会一直支持我的国家，无论它是对是错。<br><input name="c2" type="radio" data-x="0" data-y="-0.26">强烈反对<br><input name="c2" type="radio" data-x="0" data-y="-0.155" checked>反对<br><input name="c2" type="radio" data-x="0" data-y="0.15">同意<br><input name="c2" type="radio" data-x="0" data-y="0.255">强烈同意<br></li>        <li>没有人可以选择祖国，因此为祖国自豪很愚蠢。<br><input name="c3" type="radio" data-x="0" data-y="0.23">强烈反对<br><input name="c3" type="radio" data-x="0" data-y="0.13" checked>反对<br><input name="c3" type="radio" data-x="0" data-y="-0.13">同意<br><input name="c3" type="radio" data-x="0" data-y="-0.23">强烈同意<br></li>        <li>我的种族和其他种族相比有很多出众的优点。<br><input name="c4" type="radio" data-x="0" data-y="-0.23">强烈反对<br><input name="c4" type="radio" data-x="0" data-y="-0.13" checked>反对<br><input name="c4" type="radio" data-x="0" data-y="0.13">同意<br><input name="c4" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>敌人的敌人是朋友。<br><input name="c5" type="radio" data-x="0" data-y="-0.23">强烈反对<br><input name="c5" type="radio" data-x="0" data-y="-0.13">反对<br><input name="c5" type="radio" data-x="0" data-y="0.13" checked>同意<br><input name="c5" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>违反国际法的军事行动有时候是正当的。<br><input name="c6" type="radio" data-x="0" data-y="-0.2">强烈反对<br><input name="c6" type="radio" data-x="0" data-y="-0.1" checked>反对<br><input name="c6" type="radio" data-x="0" data-y="0.11">同意<br><input name="c6" type="radio" data-x="0" data-y="0.21">强烈同意<br></li>        <li>现在信息和娱乐已经令人忧虑的交融在一起。<br><input name="c7" type="radio" data-x="0" data-y="0.23">强烈反对<br><input name="c7" type="radio" data-x="0" data-y="0.13">反对<br><input name="c7" type="radio" data-x="0" data-y="-0.13">同意<br><input name="c7" type="radio" data-x="0" data-y="-0.23" checked>强烈同意<br></li>    </ol>    <p><strong>第二部分</strong>：经济问题。</p>    <ol>        <li>人民根本上说是通过阶级而不是国籍来区分的。<br><input name="c8" type="radio" data-x="0.5" data-y="0">强烈反对<br><input name="c8" type="radio" data-x="0.25" data-y="0">反对<br><input name="c8" type="radio" data-x="-0.375" data-y="0" checked>同意<br><input name="c8" type="radio" data-x="-0.625" data-y="0">强烈同意<br></li>        <li>控制通货膨胀比控制失业要重要。<br><input name="c9" type="radio" data-x="-0.5" data-y="0">强烈反对<br><input name="c9" type="radio" data-x="-0.25" data-y="0" checked>反对<br><input name="c9" type="radio" data-x="0.375" data-y="0">同意<br><input name="c9" type="radio" data-x="0.625" data-y="0">强烈同意<br></li>        <li>因为无法信任企业能够自觉保护环境，因此需要规章来规范它们。<br><input name="c10" type="radio" data-x="0.5" data-y="0">强烈反对<br><input name="c10" type="radio" data-x="0.25" data-y="0">反对<br><input name="c10" type="radio" data-x="-0.25" data-y="0">同意<br><input name="c10" type="radio" data-x="-0.5" data-y="0" checked>强烈同意<br></li>        <li>“各尽所能，各取所需”从根本上说是个好的想法。<br><input name="c11" type="radio" data-x="0.5" data-y="0">强烈反对<br><input name="c11" type="radio" data-x="0.25" data-y="0" checked>反对<br><input name="c11" type="radio" data-x="-0.375" data-y="0">同意<br><input name="c11" type="radio" data-x="-0.625" data-y="0">强烈同意<br></li>        <li>令人沮丧的是在我们的社会中一些最基本的事物比如饮用水现在都成为了瓶装、注明商标的商品。<br><input name="c12" type="radio" data-x="0.5" data-y="0">强烈反对<br><input name="c12" type="radio" data-x="0.25" data-y="0" checked>反对<br><input name="c12" type="radio" data-x="-0.5" data-y="0">同意<br><input name="c12" type="radio" data-x="-0.75" data-y="0">强烈同意<br></li>        <li>土地不应当被作为商品买卖。<br><input name="c13" type="radio" data-x="0.5" data-y="0">强烈反对<br><input name="c13" type="radio" data-x="0.25" data-y="0" checked>反对<br><input name="c13" type="radio" data-x="-0.5" data-y="0">同意<br><input name="c13" type="radio" data-x="-0.75" data-y="0">强烈同意<br></li>        <li>靠运作资金赚钱的人对社会的贡献比不上靠劳动赚钱的人。<br><input name="c14" type="radio" data-x="0.5" data-y="0">强烈反对<br><input name="c14" type="radio" data-x="0.25" data-y="0" checked>反对<br><input name="c14" type="radio" data-x="-0.375" data-y="0">同意<br><input name="c14" type="radio" data-x="-0.625" data-y="0">强烈同意<br></li>        <li>保护主义对贸易来说有时候是必须的。<br><input name="c15" type="radio" data-x="0.5" data-y="0">强烈反对<br><input name="c15" type="radio" data-x="0.25" data-y="0" checked>反对<br><input name="c15" type="radio" data-x="-0.5" data-y="0">同意<br><input name="c15" type="radio" data-x="-0.625" data-y="0">强烈同意<br></li>        <li>公司为它的股东赚取利润就是它仅有的社会职责。<br><input name="c16" type="radio" data-x="-0.5" data-y="0" checked>强烈反对<br><input name="c16" type="radio" data-x="-0.25" data-y="0">反对<br><input name="c16" type="radio" data-x="0.375" data-y="0">同意<br><input name="c16" type="radio" data-x="0.625" data-y="0">强烈同意<br></li>        <li>对富人征的税太高了。<br><input name="c17" type="radio" data-x="-0.5" data-y="0" checked>强烈反对<br><input name="c17" type="radio" data-x="-0.25" data-y="0">反对<br><input name="c17" type="radio" data-x="0.375" data-y="0">同意<br><input name="c17" type="radio" data-x="0.5" data-y="0">强烈同意<br></li>        <li>那些付得起钱的人应该有权获得更好的医疗服务。<br><input name="c18" type="radio" data-x="-0.5" data-y="0">强烈反对<br><input name="c18" type="radio" data-x="-0.25" data-y="0" checked>反对<br><input name="c18" type="radio" data-x="0.25" data-y="0">同意<br><input name="c18" type="radio" data-x="0.5" data-y="0">强烈同意<br></li>        <li>政府应该惩罚误导公众的商业行为。<br><input name="c19" type="radio" data-x="0.5" data-y="0">强烈反对<br><input name="c19" type="radio" data-x="0.25" data-y="0">反对<br><input name="c19" type="radio" data-x="-0.25" data-y="0">同意<br><input name="c19" type="radio" data-x="-0.375" data-y="0" checked>强烈同意<br></li>        <li>一个真正的自由市场需要对跨国大公司的垄断进行限制。<br><input name="c20" type="radio" data-x="0" data-y="0">强烈反对<br><input name="c20" type="radio" data-x="0" data-y="0">反对<br><input name="c20" type="radio" data-x="0" data-y="0">同意<br><input name="c20" type="radio" data-x="0" data-y="0" checked>强烈同意<br></li>        <li>市场越自由，人民越自由。<br><input name="c21" type="radio" data-x="-0.5" data-y="0" checked>强烈反对<br><input name="c21" type="radio" data-x="-0.25" data-y="0">反对<br><input name="c21" type="radio" data-x="0.5" data-y="0">同意<br><input name="c21" type="radio" data-x="0.75" data-y="0">强烈同意<br></li>    </ol>    <p><strong>第三部分</strong>：社会价值观。</p>    <ol>        <li>除非妇女的生命受到危及，否则总应该禁止堕胎。<br><input name="c22" type="radio" data-x="0" data-y="-0.2">强烈反对<br><input name="c22" type="radio" data-x="0" data-y="-0.1" checked>反对<br><input name="c22" type="radio" data-x="0" data-y="0.11">同意<br><input name="c22" type="radio" data-x="0" data-y="0.21">强烈同意<br></li>        <li>一切当权者都应该受到质询。<br><input name="c23" type="radio" data-x="0" data-y="0.23">强烈反对<br><input name="c23" type="radio" data-x="0" data-y="0.175">反对<br><input name="c23" type="radio" data-x="0" data-y="-0.13" checked>同意<br><input name="c23" type="radio" data-x="0" data-y="-0.23">强烈同意<br></li>        <li>以眼还眼，以牙还牙。<br><input name="c24" type="radio" data-x="0" data-y="-0.18">强烈反对<br><input name="c24" type="radio" data-x="0" data-y="-0.125" checked>反对<br><input name="c24" type="radio" data-x="0" data-y="0.08">同意<br><input name="c24" type="radio" data-x="0" data-y="0.18">强烈同意<br></li>        <li>税收不应该支持那些没法靠商业手段活下去的剧院和博物馆。<br><input name="c25" type="radio" data-x="-0.5" data-y="0" checked>强烈反对<br><input name="c25" type="radio" data-x="-0.25" data-y="0">反对<br><input name="c25" type="radio" data-x="0.5" data-y="0">同意<br><input name="c25" type="radio" data-x="0.625" data-y="0">强烈同意<br></li>        <li>学校不应该强制学生签到。<br><input name="c26" type="radio" data-x="0" data-y="0.26">强烈反对<br><input name="c26" type="radio" data-x="0" data-y="0.055">反对<br><input name="c26" type="radio" data-x="0" data-y="-0.15" checked>同意<br><input name="c26" type="radio" data-x="0" data-y="-0.255">强烈同意<br></li>        <li>所有人都有自己的权利，但让不同类型的人保持自己的个性对每个人来说都更好。<br><input name="c27" type="radio" data-x="0" data-y="-0.23">强烈反对<br><input name="c27" type="radio" data-x="0" data-y="-0.13">反对<br><input name="c27" type="radio" data-x="0" data-y="0.13" checked>同意<br><input name="c27" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>好父母有时候也不得不打孩子。<br><input name="c28" type="radio" data-x="0" data-y="-0.26">强烈反对<br><input name="c28" type="radio" data-x="0" data-y="-0.155">反对<br><input name="c28" type="radio" data-x="0" data-y="0.1" checked>同意<br><input name="c28" type="radio" data-x="0" data-y="0.255">强烈同意<br></li>        <li>孩子对父母有秘密很正常。<br><input name="c29" type="radio" data-x="0" data-y="0.23">强烈反对<br><input name="c29" type="radio" data-x="0" data-y="0.13">反对<br><input name="c29" type="radio" data-x="0" data-y="-0.08" checked>同意<br><input name="c29" type="radio" data-x="0" data-y="-0.23">强烈同意<br></li>        <li>大麻应该合法。<br><input name="c30" type="radio" data-x="0" data-y="0.2">强烈反对<br><input name="c30" type="radio" data-x="0" data-y="0.045" checked>反对<br><input name="c30" type="radio" data-x="0" data-y="-0.11">同意<br><input name="c30" type="radio" data-x="0" data-y="-0.21">强烈同意<br></li>        <li>学校的首要职能是让下一代人能找到工作。<br><input name="c31" type="radio" data-x="0" data-y="-0.26">强烈反对<br><input name="c31" type="radio" data-x="0" data-y="-0.155" checked>反对<br><input name="c31" type="radio" data-x="0" data-y="0.1">同意<br><input name="c31" type="radio" data-x="0" data-y="0.255">强烈同意<br></li>        <li>应当不允许有严重遗传疾病的残疾人生育。<br><input name="c32" type="radio" data-x="0" data-y="-0.29">强烈反对<br><input name="c32" type="radio" data-x="0" data-y="-0.185">反对<br><input name="c32" type="radio" data-x="0" data-y="0.17" checked>同意<br><input name="c32" type="radio" data-x="0" data-y="0.275">强烈同意<br></li>        <li>孩子最重要的事是学会遵守纪律。<br><input name="c33" type="radio" data-x="0" data-y="-0.26">强烈反对<br><input name="c33" type="radio" data-x="0" data-y="-0.155" checked>反对<br><input name="c33" type="radio" data-x="0" data-y="0.15">同意<br><input name="c33" type="radio" data-x="0" data-y="0.255">强烈同意<br></li>        <li>没有野蛮人和文明人，只有不同的文化。<br><input name="c34" type="radio" data-x="0" data-y="0.23">强烈反对<br><input name="c34" type="radio" data-x="0" data-y="0.175">反对<br><input name="c34" type="radio" data-x="0" data-y="-0.13" checked>同意<br><input name="c34" type="radio" data-x="0" data-y="-0.23">强烈同意<br></li>        <li>那些有能力工作却拒绝工作机会的人，不应该接受社会的资助。<br><input name="c35" type="radio" data-x="0" data-y="-0.23">强烈反对<br><input name="c35" type="radio" data-x="0" data-y="-0.13">反对<br><input name="c35" type="radio" data-x="0" data-y="0.13" checked>同意<br><input name="c35" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>当你遇到困难时，最好不要去想它，而是不断地做令人高兴的事。<br><input name="c36" type="radio" data-x="0" data-y="-0.2">强烈反对<br><input name="c36" type="radio" data-x="0" data-y="-0.1" checked>反对<br><input name="c36" type="radio" data-x="0" data-y="0.11">同意<br><input name="c36" type="radio" data-x="0" data-y="0.21">强烈同意<br></li>        <li>第一代移民永远无法完全融入他们的新国家。<br><input name="c37" type="radio" data-x="0" data-y="-0.23">强烈反对<br><input name="c37" type="radio" data-x="0" data-y="-0.075">反对<br><input name="c37" type="radio" data-x="0" data-y="0.13" checked>同意<br><input name="c37" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>有利于最成功的企业的事物，最终也总是有利于我们大家的。<br><input name="c38" type="radio" data-x="-0.5" data-y="0">强烈反对<br><input name="c38" type="radio" data-x="-0.25" data-y="0" checked>反对<br><input name="c38" type="radio" data-x="0.75" data-y="0">同意<br><input name="c38" type="radio" data-x="0.875" data-y="0">强烈同意<br></li>        <li>任何广播电视机构，无论它的内容有多独立，都不应该接受公共资金的支持。<br><input name="c39" type="radio" data-x="-0.5" data-y="0">强烈反对<br><input name="c39" type="radio" data-x="-0.375" data-y="0" checked>反对<br><input name="c39" type="radio" data-x="0.125" data-y="0">同意<br><input name="c39" type="radio" data-x="0.25" data-y="0">强烈同意<br></li>    </ol>    <p><strong>第四部分</strong>：你如何看待更广阔的社会。</p>    <p></p>    <ol>        <li>在反恐的名义下，公民自由被过度限制了。<br><input name="c40" type="radio" data-x="0" data-y="0.26">强烈反对<br><input name="c40" type="radio" data-x="0" data-y="0.155" checked>反对<br><input name="c40" type="radio" data-x="0" data-y="-0.1">同意<br><input name="c40" type="radio" data-x="0" data-y="-0.255">强烈同意<br></li>        <li>一党制国家的一个显著优点是它避免了在民主政体中耽误发展的所有那些争论。<br><input name="c41" type="radio" data-x="0" data-y="-0.29">强烈反对<br><input name="c41" type="radio" data-x="0" data-y="-0.135">反对<br><input name="c41" type="radio" data-x="0" data-y="0.17" checked>同意<br><input name="c41" type="radio" data-x="0" data-y="0.275">强烈同意<br></li>        <li>尽管在电子时代官方的监听更容易了，但只有坏人才需要对此担忧。<br><input name="c42" type="radio" data-x="0" data-y="-0.26">强烈反对<br><input name="c42" type="radio" data-x="0" data-y="-0.155">反对<br><input name="c42" type="radio" data-x="0" data-y="0.15" checked>同意<br><input name="c42" type="radio" data-x="0" data-y="0.255">强烈同意<br></li>        <li>对罪大恶极的犯罪分子，死刑不失为一种选项。<br><input name="c43" type="radio" data-x="0" data-y="-0.26">强烈反对<br><input name="c43" type="radio" data-x="0" data-y="-0.155">反对<br><input name="c43" type="radio" data-x="0" data-y="0.15" checked>同意<br><input name="c43" type="radio" data-x="0" data-y="0.255">强烈同意<br></li>        <li>在一个文明社会，一个人必须遵从上级、命令下级。<br><input name="c44" type="radio" data-x="0" data-y="-0.2">强烈反对<br><input name="c44" type="radio" data-x="0" data-y="-0.1" checked>反对<br><input name="c44" type="radio" data-x="0" data-y="0.11">同意<br><input name="c44" type="radio" data-x="0" data-y="0.21">强烈同意<br></li>        <li>那些什么也没有表达的抽象艺术根本就不应该被称为艺术。<br><input name="c45" type="radio" data-x="0" data-y="-0.26">强烈反对<br><input name="c45" type="radio" data-x="0" data-y="-0.155" checked>反对<br><input name="c45" type="radio" data-x="0" data-y="0.15">同意<br><input name="c45" type="radio" data-x="0" data-y="0.255">强烈同意<br></li>        <li>在刑事审判中，惩罚比改造更重要。<br><input name="c46" type="radio" data-x="0" data-y="-0.23" checked>强烈反对<br><input name="c46" type="radio" data-x="0" data-y="-0.13">反对<br><input name="c46" type="radio" data-x="0" data-y="0.13">同意<br><input name="c46" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>改造有些犯罪分子就是浪费时间。<br><input name="c47" type="radio" data-x="0" data-y="-0.26" checked>强烈反对<br><input name="c47" type="radio" data-x="0" data-y="-0.155">反对<br><input name="c47" type="radio" data-x="0" data-y="0.15">同意<br><input name="c47" type="radio" data-x="0" data-y="0.255">强烈同意<br></li>        <li>商人和制造业者比作家和艺术家更重要。<br><input name="c48" type="radio" data-x="0" data-y="-0.18">强烈反对<br><input name="c48" type="radio" data-x="0" data-y="-0.08" checked>反对<br><input name="c48" type="radio" data-x="0" data-y="0.08">同意<br><input name="c48" type="radio" data-x="0" data-y="0.18">强烈同意<br></li>        <li>母亲们可以有职业，但她们的首要职责是家庭主妇。<br><input name="c49" type="radio" data-x="0" data-y="-0.23">强烈反对<br><input name="c49" type="radio" data-x="0" data-y="-0.13" checked>反对<br><input name="c49" type="radio" data-x="0" data-y="0.13">同意<br><input name="c49" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>跨国公司正在不道德地开发发展中国家的植物基因资源。<br><input name="c50" type="radio" data-x="0" data-y="0.23">强烈反对<br><input name="c50" type="radio" data-x="0" data-y="0.13">反对<br><input name="c50" type="radio" data-x="0" data-y="-0.13" checked>同意<br><input name="c50" type="radio" data-x="0" data-y="-0.23">强烈同意<br></li>        <li>同现有体制和谐相处是成熟的重要一方面。<br><input name="c51" type="radio" data-x="0" data-y="-0.2">强烈反对<br><input name="c51" type="radio" data-x="0" data-y="-0.1">反对<br><input name="c51" type="radio" data-x="0" data-y="0.11" checked>同意<br><input name="c51" type="radio" data-x="0" data-y="0.21">强烈同意<br></li>    </ol>    <p></p>    <p><strong>第五部分</strong>：关于宗教。</p>    <p></p>    <ol>        <li>占星术精确地解释了很多东西。<br><input name="c52" type="radio" data-x="0" data-y="-0.23" checked>强烈反对<br><input name="c52" type="radio" data-x="0" data-y="-0.13">反对<br><input name="c52" type="radio" data-x="0" data-y="0.13">同意<br><input name="c52" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>你如果不信宗教就不可能道德。<br><input name="c53" type="radio" data-x="0" data-y="-0.2" checked>强烈反对<br><input name="c53" type="radio" data-x="0" data-y="-0.1">反对<br><input name="c53" type="radio" data-x="0" data-y="0.11">同意<br><input name="c53" type="radio" data-x="0" data-y="0.21">强烈同意<br></li>        <li>慈善捐助在帮助真正的弱势群体时做得比社会保障要好。<br><input name="c54" type="radio" data-x="-0.5" data-y="0">强烈反对<br><input name="c54" type="radio" data-x="-0.375" data-y="0" checked>反对<br><input name="c54" type="radio" data-x="0.625" data-y="0">同意<br><input name="c54" type="radio" data-x="0.75" data-y="0">强烈同意<br></li>        <li>有些人天生不走运。<br><input name="c55" type="radio" data-x="0" data-y="-0.23">强烈反对<br><input name="c55" type="radio" data-x="0" data-y="-0.13" checked>反对<br><input name="c55" type="radio" data-x="0" data-y="0.13">同意<br><input name="c55" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>我孩子的学校向他传授宗教价值观，这点非常重要。<br><input name="c56" type="radio" data-x="0" data-y="-0.2" checked>强烈反对<br><input name="c56" type="radio" data-x="0" data-y="-0.1">反对<br><input name="c56" type="radio" data-x="0" data-y="0.11">同意<br><input name="c56" type="radio" data-x="0" data-y="0.21">强烈同意<br></li>    </ol>    <p></p>    <p><strong>第六部分</strong>：关于性。</p>    <ol>        <li>婚姻之外的性是不道德的。<br><input name="c57" type="radio" data-x="0" data-y="-0.23">强烈反对<br><input name="c57" type="radio" data-x="0" data-y="-0.175">反对<br><input name="c57" type="radio" data-x="0" data-y="0.13" checked>同意<br><input name="c57" type="radio" data-x="0" data-y="0.23">强烈同意<br></li>        <li>一对稳定、相爱的同性伴侣，应有收养孩子的权利。<br><input name="c58" type="radio" data-x="0" data-y="0.23">强烈反对<br><input name="c58" type="radio" data-x="0" data-y="0.175">反对<br><input name="c58" type="radio" data-x="0" data-y="-0.13" checked>同意<br><input name="c58" type="radio" data-x="0" data-y="-0.23">强烈同意<br></li>        <li>由成年人自愿演出的色情影视应该对成人合法化。<br><input name="c59" type="radio" data-x="0" data-y="0.23">强烈反对<br><input name="c59" type="radio" data-x="0" data-y="0.13">反对<br><input name="c59" type="radio" data-x="0" data-y="-0.13" checked>同意<br><input name="c59" type="radio" data-x="0" data-y="-0.23">强烈同意<br></li>        <li>在私人卧室里两个成年人只见不管做什么，只要是自愿的，国家就管不着。<br><input name="c60" type="radio" data-x="0" data-y="0.26">强烈反对<br><input name="c60" type="radio" data-x="0" data-y="0.155">反对<br><input name="c60" type="radio" data-x="0" data-y="-0.15" checked>同意<br><input name="c60" type="radio" data-x="0" data-y="-0.255">强烈同意<br></li>        <li>没有人会天生同性恋。<br><input name="c61" type="radio" data-x="0" data-y="-0.26">强烈反对<br><input name="c61" type="radio" data-x="0" data-y="-0.155" checked>反对<br><input name="c61" type="radio" data-x="0" data-y="0.15">同意<br><input name="c61" type="radio" data-x="0" data-y="0.255">强烈同意<br></li>        <li>社会对性开放并没错，但现在已经开放的过分了。<br><input name="c62" type="radio" data-x="0" data-y="-0.2">强烈反对<br><input name="c62" type="radio" data-x="0" data-y="-0.1" checked>反对<br><input name="c62" type="radio" data-x="0" data-y="0.11">同意<br><input name="c62" type="radio" data-x="0" data-y="0.21">强烈同意<br></li>    </ol></form></p><div id="western-wrapper">    <button id="western-submit" type="button" class="button button-inverse button-rounded">提交</button>    <br>    <span id="western-answer" class="red">    政治立场坐标（左翼&lt;-&gt;右翼）-3.87，经济立场坐标（左翼&lt;-&gt;右翼）-2.42    </span></div><script type="text/javascript">  $(function() {    $("#western-submit").click(function(){      var t=0, a=0;      $('#western-questions input[name^="c"]:checked').each(function(){        t += parseFloat($(this).attr("data-x")),        a += parseFloat($(this).attr("data-y"))      }),      t = Math.round(100*t)/100,      a = Math.round(100*a)/100,      $("#western-answer").html("经济立场坐标（左翼<->右翼）"+ t + "，政治立场坐标（专制<->自由）" + a)    })  });</script><div class="note info">            <p>横坐标反映经济观念，负值为左（Communism, Collectivism），正值为右（Neo-Liberalism, Libertaranism）。纵坐标反映政治社会观念，负值为自由（Anarchism, Libertarian），正值为专制或保守（Facism, Authoritarian）。</p><p>本测试系统建立于西方政治价值体系基础之上，某些问题强烈的依赖于具体的西方社会环境，未必能够充分反映中国国情。根据周围人群的实验结果，中国人的测试结果普遍位于第三象限（即两坐标均为负值），平均值位于(-2,-2)附近。为了区分中国人习惯意义上的「左与右」，可以以(-2,-2)为坐标原点重新划分坐标平面，即经济坐标小于-2为左，反之为右。政治坐标小于-2为自由，反之为保守或专制。</p><p>下面是著名政治人物的坐标位置以供参考：</p><ul><li>第一象限（经济右，政治保守）：希特勒，撒切尔夫人，布什，布莱尔，希拉克。</li><li>第二象限（经济左，政治保守）：斯大林，萨达姆，教皇本笃十四世。</li><li>第三象限（经济左，政治自由）：甘地，达赖喇嘛，曼德拉。</li><li>第四象限（经济右，政治自由）：弗里德曼，哈耶克。</li></ul>          </div><h3 id="测试反思-1"><a href="#测试反思-1" class="headerlink" title="测试反思"></a>测试反思</h3><ul><li>我的种族和其他种族相比有很多出众的优点？我下意识想选择同意。但是真的是这样吗？</li><li>各尽所能，各取所需？人类的惰性</li><li>土地应该自由买卖吗？</li><li>以眼还眼以牙还牙，对吗？主观上会这么做。理性上为了更好的共处，应该放下。</li><li>学校应当强制学生签到吗？对于大学生，学不学是你的主观意愿。但是如果是义务教育，需要。</li><li>堕胎应当被允许吗？这应该是个人选择吧。 </li><li>大麻应该合法吗？介于毒品和香烟之间，但是更偏毒品，偏向于禁止大麻。</li><li>应当允许有严重遗传疾病的残疾人生育吗？不应该，遗传疾病生下来对于孩子也是痛苦，领养不好吗？</li><li>对于电子监听，我们需要担忧吗？ 对于绝大多数普通人，这应该不是问题。 </li></ul><h2 id="写在最后"><a href="#写在最后" class="headerlink" title="写在最后"></a>写在最后</h2><p>还是原来的观点，这个测试结果并不一定代表什么，但是可以作为参考。最重要的是，给自己提供了一个思考的机会。很多问题选择不够坚决，说明很多时候对这方面的思考欠缺。这个测试不应该是一次性的测试，随着人的动态变化，观点也在发生改变。在以后的时间，可以回头再看这些问题。</p>]]></content>
    
    <summary type="html">
    
      &lt;script src=&quot;//cdn.jsdelivr.net/npm/jquery@3/dist/jquery.min.js&quot;&gt;&lt;/script&gt;

&lt;p&gt;「政治坐标系」的概念来源于著名的&lt;code&gt;political compass&lt;/code&gt;，用于表明一个人的政治倾向。这里是我的政治坐标测试，其中「中国政治坐标测试」最早是 2007 年北大未名 BBS 的同学们讨论制作的，并在后期根据中国实际情况进行了订正和修改，在 &lt;a href=&quot;http://www.zuobiao.me/zuobiao2015/index.php/66331?lang=zh-Hans&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;这里&lt;/a&gt; 可以看到目前的版本。令我感到惊讶的是，居然在这个&lt;a href=&quot;https://bbs.pku.edu.cn/v2/post-read-single.php?bid=1004&amp;amp;type=3&amp;amp;postid=5656284&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;帖子&lt;/a&gt;下面看到了&lt;a href=&quot;http://blog.farmostwood.net/&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;木遥&lt;/a&gt;的踪迹，世界真小。&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;需要强调说明的是，&lt;strong&gt;这个测试初始并且唯一的目标在于给使用者提供一个自我思考和认同的提示器。&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;「公共政治议题讨论的阙失和长期的无限夸大式的政治宣传方式，使得很多人几乎是凭着脑海中浮现的口号来作出自己的选择，而完全不曾在理性上真正确认过自己的立场。」这是我对现实的悲观理解。这个问卷如此流行，足以反过来说明政治观点的分歧和相关观点在意识层面上（而非政策层面上）的讨论和争锋如何构成了公众生活的禁忌。网上关于这个测试的很多评论都反映出&lt;strong&gt;很多人并不习惯于拥有自己的观点，更不用说是在如此广泛的层面上。我相信这并非出自天性，而只是长期的怠惰使然。&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;与此同时，我也附上了来自英文「&lt;a href=&quot;https://www.politicalcompass.org/test&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;政治指南针&lt;/a&gt;」网站的西方政治坐标测试，这份测试系统建立于西方政治价值体系基础之上，&lt;strong&gt;某些问题强烈的依赖于具体的西方社会环境，未必能够充分反映中国国情。&lt;/strong&gt; 不管怎样，倒也可以提供一个自我思考的提示器。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-04-09_political-compass.png" type="image" />
    
    
      <category term="朝花夕拾" scheme="https://houmin.cc/categories/%E6%9C%9D%E8%8A%B1%E5%A4%95%E6%8B%BE/"/>
    
    
      <category term="politics" scheme="https://houmin.cc/tags/politics/"/>
    
      <category term="价值观" scheme="https://houmin.cc/tags/%E4%BB%B7%E5%80%BC%E8%A7%82/"/>
    
  </entry>
  
  <entry>
    <title>虚拟化技术概览</title>
    <link href="https://houmin.cc/posts/65866329/"/>
    <id>https://houmin.cc/posts/65866329/</id>
    <published>2020-04-07T01:04:08.000Z</published>
    <updated>2022-11-09T15:13:45.393Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>虚拟化的本质是<strong>抽象</strong>，虚拟化技术本质就是<strong>资源管理与优化</strong>技术。通过将计算机的各种物理资源，比如 <strong>CPU</strong>、<strong>内存</strong>以及磁盘空间、网络适配器等其他 <strong>I/O</strong> 设备，进行抽象转换，呈现出一个可供分割并且可以任意组合的多个计算机的配置环境。通过虚拟化技术，计算、网络、存储等计算机硬件资源得到更好的利用，而这些资源的虚拟形式将不受现有架设方式、地域或物理配置所限制。</p><a id="more"></a><h2 id="计算虚拟化"><a href="#计算虚拟化" class="headerlink" title="计算虚拟化"></a>计算虚拟化</h2><h3 id="理想数学模型-Turing-Machine"><a href="#理想数学模型-Turing-Machine" class="headerlink" title="理想数学模型 Turing Machine"></a>理想数学模型 Turing Machine</h3><p>在计算机领域，研究的一切问题都是 <strong>可计算问题（Computational Problem）</strong>。</p><blockquote><p><em>A computational problem</em> is collection of questions that computers might be able to solve.</p></blockquote><p>通过对问题可计算的判定，我们知道不管计算机的存储和计算能力有多强，有些问题总是不能够被解决的。对于那些可计算的问题，怎么解决呢？1936年，图灵在现代计算领域奠基性论文 「论可计算数及其在判定性问题上的应用」<a href="https://en.wikipedia.org/wiki/On_Computable_Numbers,_with_an_Application_to_the_Entscheidungsproblem" target="_blank" rel="external nofollow noopener noreferrer">On Computable Numbers, with an Application to the Entscheidungsproblem</a> 中提出 <a href="https://en.wikipedia.org/wiki/Turing_machine" target="_blank" rel="external nofollow noopener noreferrer">图灵机</a> 这一纸带和读写头表示的数学模型，并且证明了<strong>假设</strong>上述模型里所说的功能都能被以某种形式物理实现，<strong>那么</strong> <code>任意可计算问题都可以被解决</code>。</p><p><img alt="Turing Machine" data-src="https://upload.wikimedia.org/wikipedia/en/thumb/b/bb/Turing_machine_1.JPG/1024px-Turing_machine_1.JPG"></p><h3 id="二战产物-ENIAC"><a href="#二战产物-ENIAC" class="headerlink" title="二战产物 ENIAC"></a>二战产物 ENIAC</h3><p>二战极大促进了电子计算机的诞生，为了帮助美国陆军的弹道研究实验室（BRL）计算火炮的火力表， ENIAC 在 1946 年被设计了出来。ENIAC 并不是二战中第一个被设计出来的计算机，机械和电子计算机器从19世纪就开始出现了，但是20世纪40年代被看作是现代计算机时代的开端。</p><ul><li>德国<a href="https://zh.wikipedia.org/w/index.php?title=Z3_(计算机" target="_blank" rel="external nofollow noopener noreferrer">Z3</a>&amp;action=edit&amp;redlink=1)计算机于1941年5月公布，这是第一台通用的数字计算机<ul><li>使用<a href="https://zh.wikipedia.org/wiki/继电器" target="_blank" rel="external nofollow noopener noreferrer">继电器</a>，机电计算机，不是电子计算机</li><li>使用二进制进行逻辑计算</li><li>可用打孔纸带编程，但是没有逻辑分支</li></ul></li><li>美国<a href="https://en.wikipedia.org/wiki/Atanasoff–Berry_Computer" target="_blank" rel="external nofollow noopener noreferrer">ABC</a>，1941年夏天公布，是第一台电子计算设备<ul><li>使用电子管，电子计算机</li><li>使用二进制进行逻辑计算</li><li>不是通用的，仅用于求解线性方程组</li><li>没有利用电子计算的速度优势，旋转电容鼓存储器，输入输出系统要把中间结果写出到纸片</li><li>手动控制的，不可编程</li></ul></li><li>英国的<a href="https://zh.wikipedia.org/wiki/巨人计算机" target="_blank" rel="external nofollow noopener noreferrer">巨人计算机</a> Colossus computer，1943年用于密码分析<ul><li>使用电子管，电子计算机</li><li>可用插板和开关编程</li><li>不是通用的，仅用于密码破译</li></ul></li></ul><p>对比这些几乎同时期独立的计算机，ENIAC有以下特点：</p><ul><li>使用电子管，电子计算机</li><li>采用十进制计算</li><li>计算速度高，具备逻辑分支能力</li><li>符合<strong>图灵完全性</strong>，<strong>能够重新编程</strong>，<strong>解决各种计算问题</strong></li><li>缺乏存储程序能力，<strong>冯诺依曼结构</strong>在下一代计算机<a href="https://zh.wikipedia.org/wiki/EDVAC" target="_blank" rel="external nofollow noopener noreferrer">EDVAC</a>上实现</li></ul><p><img alt="ENIAC, 美国弹道研究实验室" data-src="https://upload.wikimedia.org/wikipedia/commons/4/4e/Eniac.jpg"></p><h3 id="多道程序设计-Multiprogramming"><a href="#多道程序设计-Multiprogramming" class="headerlink" title="多道程序设计 Multiprogramming"></a>多道程序设计 Multiprogramming</h3><p>最初的计算机都是串行运行的，一次只能录入并执行一个程序，当程序进行缓慢的 IO 操作时，CPU 只好空转等待。这不仅造成了 CPU 的浪费，也造成了其他计算机硬件资源的浪费。那时的计算机科学家们都在思考着要如何能够提高 CPU 的利用率，直到有人提出了多道程序设计（Multiprogramming，多任务处理的前身）。</p><p>在整个上世纪 50-60 年代，多道程序设计的讨论非常流行。它令 CPU 一次性读取多个程序到内存，先运行第一个程序直到它出现了 IO 操作，此时 CPU 切换到运行第二个程序。</p><blockquote><p>即，<strong>第 n+1 个程序得以执行的条件是第 n 个程序进行 IO 操作或已经运行完毕</strong>。</p></blockquote><p><img alt data-src="https://media.geeksforgeeks.org/wp-content/cdn-uploads/multiprogramming.jpg"></p><p>多道程序设计的特征就是：<strong>多道程序、宏观上并行、微观上串行</strong>。有效的提高了 CPU 的利用率，也充分发挥着其他计算机系统部件的并行性。</p><h3 id="分时-Time-Sharing"><a href="#分时-Time-Sharing" class="headerlink" title="分时 Time Sharing"></a>分时 Time Sharing</h3><p>但多道程序设计存在一个问题， 就是<strong>它并不会去考虑分配给各个程序的时间是否均等，很可能第一个程序运行了几个小时而不出现 IO 操作，故第二个程序没有运行</strong>。最初，这个问题是令人接受的，那时的必须多个程序之间的执行顺序更加关心程序的执行结果。直到有人提出了新的需求：多用户同时使用计算机。应需而生的正是时间共享，或者称之为 “分时” 的概念（Time Sharing）。</p><p>所谓 “分时” 的含义是将 CPU 占用切分为多个极短（1/100sec）的时间片，每个时间片都执行着不同的任务。分时系统中允许几个、几十个甚至几百个用户通过终端机连接到同一台主机，将处理机时间与内存空间按一定的时间间隔，轮流地切换给各终端用户的程序使用。由于时间间隔很短，每个用户感觉就像他独占了计算机一样。<strong>分时系统达到了多个程序分时共享计算机硬件和软件资源的效果</strong>，本质就是一个多用户交互式操作系统。</p><p>分时系统与多道程序设计虽然类似，却也有着底层实现细节的不同</p><ul><li>分时系统是为了给不同用户提供程序的使用，而多道程序则是为了不同程序间的穿插运行</li></ul><p>1959 年，牛津大学的计算机教授，Christopher Strachey 发表了一篇名为 <a href="https://archive.org/details/large-fast-computers" target="_blank" rel="external nofollow noopener noreferrer">Time sharing in large fast computers</a> 的学术报告，他在文中首次提出了 “虚拟化” 的基本概念，还论述了什么是虚拟化技术。</p><blockquote><p><strong>Time sharing</strong>, in the sense of causing the main computer to interrupt its program to perform the arithmetic and control operations required by external or peripheral equipment, has been used on a limited scale for a long time. this paper explores the possibility of applying time sharing to a large fast computer on a very extensive scale.</p></blockquote><p>本质上，Strachey 是在讨论如何将分时的概念融入到多道程序设计当中，从而实现一个可多用户操作（CPU 执行时间切片），又具有多程序设计效益（CPU 主动让出）的虚拟化系统。可见，<strong>虚拟化概念最初的提出就是为了满足多用户同时操作大型计算机，并充分利用大型计算机各部件资源的现实需求</strong>。而对这一需求的实现与演进，贯穿了整个大型机与小型机虚拟化技术的发展历程。</p><p>1961年 MIT 在 IBM7094 型机器上实现了首个分时系统CTSS（Compatible Time-Sharing System，相容分时系统）</p><h3 id="超级计算机-Altas"><a href="#超级计算机-Altas" class="headerlink" title="超级计算机 Altas"></a>超级计算机 Altas</h3><p>1962 年 12 月 7 日，第一台 Atlas 超级计算机在英国诞生，Atlas 是第二代晶体管计算机，被认为是当时世界上最强大的计算机。Atlas 开创了许多沿用至今的软件概念：</p><ul><li>第一次实现名为 Atlas Supervisor 的底层资源管理组件，<strong>Supervisor</strong> 通过特殊的指令或代码来管理主机的硬件资源</li><li>第一次实现分页技术（<strong>Paging Techniques</strong>）</li><li>第一次实现虚拟内存（<strong>Virtual Memory</strong>），当时被称为一级存储（One-Level Store）</li></ul><h3 id="第一个支持虚拟化-IBM-M44-44X"><a href="#第一个支持虚拟化-IBM-M44-44X" class="headerlink" title="第一个支持虚拟化 IBM M44/44X"></a>第一个支持虚拟化 IBM M44/44X</h3><p>1964 年的 IBM M44/44X 被认为是世界上第一个支持虚拟化的系统。它采用专门的硬件和软件，能够在一台物理机器上虚拟多个当时流行的 IBM 7044 大型机。它使用的虚拟化方法是非常原始的：像分时系统一样，在每个时间片，一个 IBM 7044 大型机独占所有硬件资源来运行。</p><p>值得一提的是，这个研究用的原型系统不仅开启了虚拟化技术的时代，M44/44X 实现了多个具有突破性的虚拟化概念，包括部分<strong>硬件共享（Partial Hardware Sharing）</strong>、<strong>分时（Time Sharing）</strong>、<strong>内存分页（Memory Paging）</strong>以及<strong>虚拟内存（Virtual Memory）</strong>。M44/44X 项目首次使用了 “<strong>Virtual Machine</strong>” 这一术语，所以被认为是世界上第一个支持虚拟机的计算机系统。虽然 M44/44X 只实现了部分的虚拟化功能，但其最大的成功在于证明了虚拟机的运行效率并不一定比传统的方式更低</p><p>在那个 “进程” 概念尚未被发明的年代，多任务操作系统和虚拟化技术事实上是难以分开的，因为 “虚拟机” 就是一个任务，而且当时还没有 Intel x86 这种霸主地位的体系结构，各家的大型机各自为政，也谈不上兼容别家的体系结构。这种 “任务级” 或者说 “进程级” 虚拟化，从概念上延续到今天，就是以 LXC 和 OpenVZ 为代表的操作系统级虚拟化。</p><h3 id="IBM的豪赌-System-360"><a href="#IBM的豪赌-System-360" class="headerlink" title="IBM的豪赌 System/360"></a>IBM的豪赌 System/360</h3><p>1964 年，IBM推出了著名的 System/360 大型计算机系统，整个研发过程投资巨大，其出货时间也不断延迟。但最终，取得了巨大的商业成功。当时的项目经理 <code>Frederick P. Brooks</code>事后根据这项计划的开发经验写出了同样著名的《人月神话：软件项目管理之道》（“The Mythical Man-Month: Essays on Software Engineering”），记述了人类工程史上一项里程碑式的大型复杂软件系统的开发经验。</p><ul><li>System/360 实现了基于全硬件的虚拟化解决方案（<strong>Full Hardware Virtualization</strong>）</li><li>System/360 实现了 TSS（Time Sharing System）分时系统，TSS 被认为是最原始的 <strong>CPU 虚拟化技术</strong>，它可以让低端电脑连接大型主机，上传和下载程序或资料，将电子数据处理的 “松散终端” 连接起来。</li></ul><blockquote><p><strong>虚拟化技术的应用和发展源于大型机对分时系统的需求</strong>。这种通过硬件的方式来生成多个可以运行独立操作系统软件的虚拟机实例，解决了早期大型计算机只能单任务处理而不能分时多任务处理的问题。由于这种虚拟化技术是基于硬件设备来实现的，故被称为<strong>硬件虚拟化（Hardware virtualization）</strong>。但需要注意的是，这一定义在后来被进一步细分为了狭义的硬件虚拟化技术，<strong>现今更加被公认的硬件虚拟化定义是：一种对计算机或操作系统的虚拟化，能够对用户隐藏真实的计算机硬件，表现出另一个抽象的计算平台。</strong></p></blockquote><h3 id="伟大实验-MULTICS"><a href="#伟大实验-MULTICS" class="headerlink" title="伟大实验 MULTICS"></a>伟大实验 MULTICS</h3><p>MULTICS，全名 <code>MULTiplexed Information and Computing System</code>，是1964年由贝尔实验室、麻省理工学院及美国通用电气公司所共同参与研发的，是一套安装在大型主机上多人多任务的操作系统，是连接1000部终端机，支持300的用户同时上线。</p><p>MULTICS 是一个伟大的实验，得意于第一代分时系统 CTSS 的成功，它在开发之初就提出了很高的要求：</p><ul><li>首次在大型软件中采用结构化的程序设计方法，使得开发周期大大缩短</li><li>首次采用高级语言编写操作系统，使得系统程序在功能上独立于机器</li><li>首次采用成熟软件作为工具，MULTICS中的很大一部分程序是用CTSS来编写</li><li>首次引入动态链接和分层文件系统的概念</li></ul><p>然而，由于当时编写 MULTICS 的 PL/I 语言并没有很成熟，无力肩负编写操作系统这样的重担。而且整个开发过程中求大求全，多个单位参与，进展过慢，贝尔实验室退出此计划。</p><h3 id="玩具而已-UNIX"><a href="#玩具而已-UNIX" class="headerlink" title="玩具而已 UNIX"></a>玩具而已 UNIX</h3><p>1969年，在 AT&amp;T 的Bell Labs，<code>Ken Thompson</code>为了一项名为<code>Space Travel</code>的游戏，需要一个操作系统。他找了一台闲置的PDP-7 小型机，独自经过 4 个星期的奋斗，以汇编语言写出了一组内核程序，同时包括一些内核工具程序，以及一个小的文件系统，这就是伟大的 UNIX 操作系统的原型。</p><p>UNIX 系统本质上是对 MULTICS 系统的简化，当时开发者 <code>Brian Kernighann</code> 开玩笑地戏称这个不完善系统MULTICS其实是 <code>UNiplexed Information and Computing System</code>，缩写为<code>UNICS</code>。后来，大家取其谐音这个名字被改为<code>UNIX</code>。</p><p>1973 年，贝尔实验室的<code>Dennis Ritchie</code> 以 B 语言为基础开发了一种称为 C 的编程语言。C 语言的设计原则就是好用，非常自由、弹性很大。<code>Ken Thompson</code>和<code>Dennis Ritchie</code>使用 C 语言完全重写了 UNIX，此后 UNIX 就真正成为了可移植的操作系统，那时已是 1977 年。</p><p>1979 年，Unix 的第 7 个版本引入了 chroot 机制，意味着第一个<strong>操作系统虚拟化（OS-level virtualization）</strong>诞生了。chroot 是直到现在我们依然在使用的一个系统调用，这个系统调用会让一个进程把指定的目录作为根目录，它的所有文件系统操作都只能在这个指定目录中进行，本质是一种文件系统层的隔离。</p><h3 id="虚拟化准则-VMM"><a href="#虚拟化准则-VMM" class="headerlink" title="虚拟化准则 VMM"></a>虚拟化准则 VMM</h3><p>1974 年，<code>Gerald J. Popek</code> 和 <code>Robert P. Goldberg</code>在合作论文《可虚拟第三代架构的规范化条件》（“Formal Requirements for Virtualizable Third Generation Architectures”）中提出了一组称为虚拟化准则的充分条件，又称波佩克与戈德堡虚拟化需求（<strong>Popek and Goldberg virtualization requirements</strong>），即：虚拟化系统结构的三个基本条件。满足这些条件的控制程序才可以被称为<strong>虚拟机监控器（Virtual Machine Monitor，简称 VMM）</strong>：</p><ul><li><strong>资源控制（Resource Control）</strong>，控制程序必须能够管理所有的系统资源。</li><li><strong>等价性（Equivalence）</strong>，在控制程序管理下运行的程序（包括操作系统），除时序和资源可用性之外的行为应该与没有控制程序时的完全一致，且预先编写的特权指令可以自由地执行。</li><li><strong>效率性（Efficiency）</strong>，绝大多数的客户机指令应该由主机硬件直接执行而无需控制程序的参与。</li></ul><p>该论文尽管基于简化的假设，但上述条件仍为评判一个计算机体系结构是否能够有效支持虚拟化提供了一个便利方法，也为设计可虚拟化的计算机架构给出了指导原则。同时，Gerald J. Popek 和 Robert P. Goldberg 还在论文中介绍了两种 Hypervisor 类型。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_hypervisor.png"></p><ul><li>类型 I (<strong>Bare-metal Hypervisors</strong>)<ul><li>这些虚拟机管理程序直接运行在宿主机（Host）的硬件上来控制硬件和管理虚拟机。</li><li>需要硬件支持</li><li>VMM 作为宿主机操作系统（Host OS）</li><li>运行效率高</li></ul></li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_hypervisor2.png"></p><ul><li>类型 II（<strong>Hosted Hypervisorsr</strong>）<ul><li>VMM 运行在传统的宿主机操作系统（Host OS）上，就像其他应用程序那样运行。</li><li>VMM 作为应用程序运行在宿主机操作系统之上</li><li>运行效率一般较类型 I 低</li></ul></li></ul><p>由于技术的原因，早期的 VMM 产品大多实现的是寄居式，例如：VMware 5.5 以前的版本、Xen 3.0 以前的版本。随着技术的成熟，主要是硬件虚拟化技术的诞生，几乎所有的 VMM 产品都转向了裸金属 Hypervisor 实现。例如：VMware 5.5 及以后版本、Xen 3.0 及以后版本以及 KVM。</p><h3 id="接棒开源-GNU-Linux"><a href="#接棒开源-GNU-Linux" class="headerlink" title="接棒开源 GNU/Linux"></a>接棒开源 GNU/Linux</h3><p><img alt="GNU/Linux" data-src="https://i1.wp.com/www.linuxandubuntu.com/wp-content/uploads/2019/07/What-is-GNU-in-GNULinux.jpg"></p><h3 id="软件辅助虚拟化-QEMU"><a href="#软件辅助虚拟化-QEMU" class="headerlink" title="软件辅助虚拟化 QEMU"></a>软件辅助虚拟化 QEMU</h3><p>2001，Fabrice Bellard 发布了目前最流行的、采用了<strong>动态二进制翻译（Binary Translation）</strong>技术的开源虚拟化软件 QEMU（Quick EMUlator）。QEMU 可以模拟 x86、x86_64、ARM、MIPS、SPARC、PowerPC 等多种处理器架构，无修改地运行这些架构上的操作系统。</p><p><strong>软件辅助虚拟化</strong> 是通过 <strong>优先级压缩（Ring Compression）</strong>和 <strong>二进制代码翻译（Binary Translation）</strong>这两个技术来完成的。RC 基于 CPU 特权级的原理。也就是 guest、VMM 和 host 分别处于不同的特权级上，guest 要访问 host 就属于越级访问，会抛异常，这时 VMM 会截获这个异常，并模拟出其可能的行为，从而进行相应处理。</p><p>以我们最熟悉的 Intel x86 架构为例，分为四个特权级 0~3。一般情况下，操作系统内核（特权代码）运行在 ring 0（最高特权级），而用户进程（非特权代码）运行在 ring 3（最低特权级）。</p><p><img alt data-src="https://ring0.me/images/2014/12/9c37a75e8e2164f50ffe76681c6d4522.png"></p><p>使用了虚拟机之后，Guest OS 运行在 ring 1，VMM 运行在 ring 0。比如在 Windows 上装个 Linux 虚拟机，Windows 内核运行在 ring 0，而被虚拟的 Linux 内核运行在 ring 1，Linux 系统里的应用程序则运行在 ring 3。当虚拟机系统需要执行特权指令时，VMM 就会立即捕获它（谁让 ring 0 比 ring 1 的特权级高呢！）并模拟执行这条特权指令，再返回到虚拟机系统。</p><p>为了提高系统调用、中断处理的性能，有时会利用动态二进制翻译的技术，在运行前把这些特权指令替换成调用虚拟机管理器 API 的指令。如果所有特权指令都模拟得天衣无缝，虚拟机系统就像运行在物理机器上一样，完全不能发现自己运行在虚拟机里。</p><h3 id="半虚拟化-Xen"><a href="#半虚拟化-Xen" class="headerlink" title="半虚拟化 Xen"></a>半虚拟化 Xen</h3><p>2003 年，英国剑桥大学的一位讲师发布了开源虚拟化项目 Xen 1.0，通过<strong>半虚拟化技术</strong>为 x86-64 提供虚拟化支持。</p><p>既然<strong>动态二进制翻译的难点和性能瓶颈在于模拟执行那些杂七杂八的特权指令</strong>，我们能不能修改虚拟机系统的内核，把那些特权指令改得好看些？毕竟在多数情况下，我们并不需要对虚拟机刻意 “隐瞒” 虚拟化层的存在，而是要在虚拟机之间提供必要的隔离，同时又不造成太多性能开销。</p><p>Paravirtualization 这个单词的前缀是 para-，即 “with” “alongside” 之意。也就是虚拟机系统与虚拟化层（主机系统）不再是严格的上下级关系，而是互信合作的关系，<strong>虚拟化层要在一定程度上信任虚拟机系统。在 x86 架构中，虚拟化层（Virtualization Layer）和虚拟机系统的内核（Guest OS）都运行在 ring 0。</strong></p><p><img alt data-src="https://ring0.me/images/2014/12/98ce27bf0640053df5db977f6c41cc3e.png"></p><p><strong>虚拟机系统的内核需要经过特殊修改，把特权指令改成对虚拟化层 API 的调用</strong>。在现代操作系统中，由于这些体系结构相关的特权操作都被封装起来了（例如 Linux 内核源码中的 arch/ 目录），比起二进制翻译需要考虑各种边角情况，这种对虚拟机内核源码的修改就简单一些了。</p><p><strong>相比使用二进制翻译的全虚拟化（full virtualization），半虚拟化是牺牲了通用性来换取性能，因为任何操作系统都可以无修改地运行在全虚拟化平台上，而每个半虚拟化的操作系统内核都要经过人肉修改。</strong></p><h3 id="硬件辅助虚拟化-Intel-VT-x"><a href="#硬件辅助虚拟化-Intel-VT-x" class="headerlink" title="硬件辅助虚拟化 Intel VT-x"></a>硬件辅助虚拟化 Intel VT-x</h3><p>2006 年，Intel 和 AMD 等厂商相继将对虚拟化技术的支持加入到 x86 体系结构的CPU中（AMD-V，Intel VT-x/d），使原来纯软件实现的各项功能可以用借助硬件的力量实现提速，此即 <strong>硬件辅助的虚拟化</strong>。</p><p>Xen这种<strong>将 Guest OS 中的特权指令改成对虚拟化层 API 的调用</strong>方式<strong>并不通用</strong>，要去改 Guest OS 的代码，只能看作是一种定制。为了能够通用，又能够提高性能，就只能从硬件上去做文章了。通过对硬件本身加入更多的虚拟化功能，就可以截获更多的敏感指令，填补上漏洞。所以后来，以 Intel 的 VT-x 和 AMD 的 AMD-V 为主的硬件辅助的 CPU 虚拟化就被提出来（Intel VT 包括 VT-x （支持 CPU 虚拟化）、EPT（支持内存虚拟化）和 VT-d（支持 I/O 虚拟化））。</p><p><img alt data-src="https://ring0.me/images/2014/12/05112238fed78cb9df19c07ec82544cf.png"></p><p>CPU 硬件辅助虚拟化在 Ring 模式的基础上引入了一种新的模式，叫 VMX 模式。它包括根操作模式（VMX Root Operation）和非根操作模式（VMX Non-Root Operation）。</p><p>引入这种模式的好处就在于，Guest OS 运行在 Ring 0 上，就意味着它的核心指令可以直接下达到硬件层去执行，而特权指令等敏感指令的执行则是由硬件辅助，直接切换到 VMM 执行，这是自动执行的，应用程序是感知不到的，性能自然就提高了。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_vmx.png"></p><p>这种切换 VT-x 定义了一套机制，称为 VM-entry 和 VM-exit。从非根模式切换到根模式，也就是从 Guest 切换到 Host VMM，称为 VM-exit，反之称为 VM-entry。</p><ul><li>VM-exit ： 如果 Guest OS 运行过程中遇到需要 VMM 处理的事件，比如中断或缺页异常，或者主动调用 <code>VMCAL</code>指 令调用 VMM 服务的时候（类似于系统调用），硬件自动挂起 Guest OS，切换到根模式，VMM 开始执行。</li><li>VM-entry： VMM 通过显示调用 <code>VMLAUNCH</code> 或 <code>VMRESUME</code> 指令切换到非根模式，硬件自动加载 Guest OS 的上下文，Guest OS 开始执行。</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_vm-entry-exit.png"></p><h3 id="基于内核的虚拟化-KVM"><a href="#基于内核的虚拟化-KVM" class="headerlink" title="基于内核的虚拟化 KVM"></a>基于内核的虚拟化 KVM</h3><p>2007 年 2 月，Linux Kernel 2.6.20 合入了 KVM 内核模块，使用 KVM 的前提是 CPU 必须要支持虚拟化技术。</p><p>一般 KVM 只负责 CPU 和内存的虚拟化，I/O 的虚拟化则由另外一个技术来完成，即 QEMU。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_kvm.png"></p><p>KVM 是一种硬件辅助的虚拟化技术，支持 Intel VT-x 和 AMD-v 技术，怎么知道 CPU 是否支持 KVM 虚拟化呢？可以通过如下命令查看：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># grep -E '(vmx|svm)' /proc/cpuinfo</span></span><br></pre></td></tr></table></figure><p>如果输出是 vmx 或 svm，则表明当前 CPU 支持 KVM，Intel 是 vmx，AMD 是svm。</p><p>从本质上看，一个 KVM 虚拟机对应 Host 上的一个 qemu-kvm 进程，它和其他 Linux 进程一样被调度，而 qemu-kvm 进程中的一个线程就对应虚拟机的虚拟 CPU （vCPU），虚拟机中的任务线程就被 vCPU 所调度。</p><p>比如下面这个例子，Host 机有两个物理 CPU，上面起了两个虚拟机 VM1 和 VM2，VM1 有两个 vCPU，VM2 有 3 个 vCPU，VM1 和 VM2 分别有 2 个 和 3 个线程在 2 个物理 CPU 上调度。VM1 和 VM2 中又分别有 3 个任务线程在被 vCPU 调度。</p><p>所以，这里有两级的 CPU 调度，Guest OS 中的 vCPU 负责一级调度，Host VMM 负责另一级调度，即 vCPU 在物理 CPU 上的调度。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_kvm-schedule.png"></p><p>我们也可以看到，vCPU 的个数，可以超过物理 CPU 的个数，这个叫 CPU 「超配」，这正是 CPU 虚拟化的优势所在，这表明了虚拟机能够充分利用 Host 的 CPU 资源，进行相应的业务处理，运维人员也可以据此控制 CPU 资源使用，达到灵活调度。</p><h3 id="大数据时代-GFS-MapReduce-BigTable"><a href="#大数据时代-GFS-MapReduce-BigTable" class="headerlink" title="大数据时代 GFS/MapReduce/BigTable"></a>大数据时代 GFS/MapReduce/BigTable</h3><ul><li>2003 年，Google 发布 <code>The Google File System</code>，讲述了一种可扩展的分布式文件系统</li><li>2004 年，Google 发布 <code>MapReduce: Simplified Data Processing on Large Clusters</code>，讲述了大数据的分布式计算方式，即将任务分解然后在多台处理能力较弱的计算节点中同时处理，然后将结果合并从而完成大数据处理。</li><li>2006 年，Google 发布 <code>Bigtable: A Distributed Storage System for Structured Data</code>，讲述了用于存储和管理结构化数据的分布式存储系统，其建立在 GFS、MapReduce 等基础之上。该论文启发了后期的很多的 NoSQL 数据库，包括 Cassandra、HBase 等。</li></ul><p>在 Google 的三篇论文发布之后，大数据时代宣告到来，于此同时，Hadoop 生态开始建立。</p><h3 id="云计算吃螃蟹的人-AWS"><a href="#云计算吃螃蟹的人-AWS" class="headerlink" title="云计算吃螃蟹的人 AWS"></a>云计算吃螃蟹的人 AWS</h3><p>2006 年，<strong>Amazon Web Services</strong> 开始以 Web 服务的形式向企业提供 IT 基础设施服务，包括弹性计算网云（EC2）、简单储存服务（S3）、简单数据库（SimpleDB）等，现在通常称为云计算。尽管云计算最早是由谷歌CEO <code>Eric Schmidt</code>，真正第一个吃螃蟹的人却是 Amazon。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_aws.png"></p><h3 id="操作系统级虚拟化-LXC"><a href="#操作系统级虚拟化-LXC" class="headerlink" title="操作系统级虚拟化 LXC"></a>操作系统级虚拟化 LXC</h3><p><strong>2008 年 6 月</strong>，Linux Container（LXC） 发布 0.1.0 版本，其可以提供轻量级的虚拟化，用来隔离进程和资源，是 Docker 最初使用的容器技术支撑。</p><p>很多时候，我们并不是想在虚拟机里运行任意的操作系统，而是希望在不同的任务间实现一定程度的隔离。前面提到的虚拟化技术，每个虚拟机都是一个独立的操作系统，有自己的任务调度、内存管理、文件系统、设备驱动程序等，还会运行一定数量的系统服务（如刷新磁盘缓冲区、日志记录器、定时任务、ssh 服务器、时间同步服务），这些东西都会消耗系统资源（主要是内存），而且虚拟机和虚拟机管理器的两层任务调度、设备驱动等也会增加时间开销。能不能让虚拟机共享操作系统内核，又保持一定的隔离性呢？</p><p><img alt data-src="https://ring0.me/images/2014/12/30cc029394b7687320fbca8c654b7671.png"></p><p>chroot 的文件系统隔离给我们带来部分的思路，但是要成为一个真正的虚拟化解决方案，只有文件系统隔离是不够的。另外两个重要的方面是：</p><ul><li>进程、网络、IPC（进程间通信）、用户等<strong>命名空间的隔离</strong>。使得虚拟机内部只能看到自己的进程，只能使用自己的虚拟网卡，进程间通信时不会干扰到虚拟机外面，虚拟机内的 UID/GID 与外面的独立。</li><li>资源的限制和审计。不能因为虚拟机内的程序 “跑飞了”，就占掉物理机器的所有 CPU、内存、硬盘等资源。必须要能统计虚拟机占了多少资源，并能够对资源进行限制。</li></ul><p>上述两件事情就是 BSD 和 Linux 社区在进入 21 世纪以来逐步在做的。在 Linux 中，命名空间的隔离叫做用户命名空间，在创建进程时，通过指定 clone 系统调用的参数来创建新的命名空间；资源的限制和审计是 cgroups 做的，它的 API 位于 proc 虚拟文件系统中。</p><p>这种虚拟机里运行一个或多个进程、虚拟机与主机共享一个内核的虚拟化方案，被称为 <strong>操作系统级虚拟化</strong> 或 <strong>任务级虚拟化</strong>。由于 Linux Containers（LXC）从 Linux 3.8 版本开始被纳入内核主线，操作系统级虚拟化又被称为 “容器”（container）。为了与虚拟机是一个完整的操作系统的虚拟化方案相区分，被隔离执行的进程（进程组）往往不称为 “虚拟机”，而称为 “容器”。由于没有多余的一层操作系统内核，容器比虚拟机更加轻量，启动更快，内存开销、调度开销也更小，更重要的是访问磁盘等 I/O 设备不需要经过虚拟化层，没有性能损失。</p><h3 id="云计算操作系统-OpenStack"><a href="#云计算操作系统-OpenStack" class="headerlink" title="云计算操作系统 OpenStack"></a>云计算操作系统 OpenStack</h3><p>2010 年 7 月，NASA 和 Rackspace 联合发起了 OpenStack 云操作系统开源项目。</p><p>OpenStack 要对云上的各种资源进行虚拟化：</p><ul><li><strong>计算</strong>：OpenStack 可以使用多种多样的虚拟化解决方案，如 Xen、KVM、QEMU、Docker。管理组件 Nova 根据各物理节点的负载决定把虚拟机调度到哪台物理机，再调用这些虚拟化解决方案的 API 来创建、删除、开机、关机等。</li><li><strong>存储</strong>：虚拟机镜像如果只能存储在计算节点本地，那么不仅不利于数据的冗余，也不利于虚拟机的迁移。因此在云中，一般采用逻辑上集中、物理上分布式的存储系统，独立于计算节点，也就是计算节点对数据磁盘的访问一般是通过网络访问。</li><li><strong>网络</strong>：每个客户要有自己的虚拟网络，如何让不同客户的虚拟网络在物理网络上互不干扰，就是网络虚拟化的事情。</li></ul><p>除了最核心的虚拟化管理器 Nova，OpenStack 还有虚拟机镜像管理器 Glance、对象存储 Swift、块存储 Cinder、虚拟网络 Neutron、身份认证服务 Keystone、控制面板 Horizon 等众多组件。</p><p><img alt="OpenStack Architecture" data-src="https://docs.openstack.org/install-guide/_images/openstack_kilo_conceptual_arch.png"></p><h3 id="容器的好管家-Docker"><a href="#容器的好管家-Docker" class="headerlink" title="容器的好管家 Docker"></a>容器的好管家 Docker</h3><p><strong>2014 年 6 月</strong>，Docker 基于 LXC 发布了第一个正式版本 v1.0。</p><p>Docker 是为系统运维而生，它大大降低了软件安装、部署的成本。软件的安装之所以是个麻烦事，是因为</p><ul><li><p><strong>软件之间存在依赖关系</strong>。比如，Linux 上依赖标准 C 库 glibc，依赖密码学库 OpenSSL，依赖 Java 运行环境；Windows 上依赖 .NET Framework，依赖 Flash 播放器。如果每个软件都带上它所有的依赖，那就太臃肿了，如何找到并安装软件的依赖，是一门大学问，也是各个 Linux 发行版的特色所在。</p></li><li><p><strong>软件之间存在冲突</strong>。比如，程序 A 依赖 glibc 2.13，而程序 B 依赖 glibc 2.14；甲脚本需要 Python 3，乙脚本需要 Python 2；Apache 和 Nginx 两个 Web 服务器都想要监听 80 端口。互相冲突的软件安装在同一个系统里，总是容易带来一些混乱，比如 Windows 早期的 DLL Hell。解决软件冲突之道就是隔离，让多个版本在系统里共存，并提供方法来找到匹配的版本。</p></li></ul><p>我们看看 Docker 如何解决这两个问题：</p><ol><li>把软件的所有依赖关系和运行环境打包在一个镜像里，而不是使用复杂的脚本来在未知的环境里 “安装” 软件；</li><li>这个包含了所有依赖的包一定很大，因此 Docker 的镜像是层次化的，即应用程序的镜像一般是基于基本系统镜像，只需要传输和存储增量部分就行了，这依赖于Linux 的 AUFS（Another Union File System）。<br><img alt data-src="https://ring0.me/images/2014/12/f584cb21ff9e39a0164bfc6e7b54900a.png"></li><li>Docker 使用基于容器的虚拟化，把每个软件运行在独立的容器里，避免了不同软件的文件系统路径冲突和运行时的资源冲突。<br><img alt data-src="https://ring0.me/images/2014/12/bd0b38dbee5e3dd50e89367a440fc6bf.png"></li></ol><p>Docker 最开始基于 LXC 实现，后来则是基于 libcontainer。libcontainer 和 LXC 事实上都是基于 Linux 内核提供的 cgroups 资源审计、chroot 文件系统隔离、命名空间隔离等机制。</p><h3 id="云原生时代-Kubernetes"><a href="#云原生时代-Kubernetes" class="headerlink" title="云原生时代 Kubernetes"></a>云原生时代 Kubernetes</h3><p><strong>2015 年 7 月 21 日</strong>：Kubernetes v1.0 发布！进入云原生时代。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_k8s-arch.jpg"></p><hr><p>实际上，上述从二十世纪四十年代以来的发展历程，主要说的是计算虚拟化的事情，也就是 CPU 虚拟化。CPU 虚拟化固然是核心中的核心，但是计算机其他组件的虚拟化也不容忽视，比如内存的虚拟化，包括存储、网络等在内的 I/O 虚拟化。</p><h2 id="内存虚拟化"><a href="#内存虚拟化" class="headerlink" title="内存虚拟化"></a>内存虚拟化</h2><h3 id="Virtual-Memory"><a href="#Virtual-Memory" class="headerlink" title="Virtual Memory"></a>Virtual Memory</h3><p>前面讲虚拟化的鼻祖 IBM M44/44X 的时候，提到它提出了 “分页” 的概念。也就是每个任务（虚拟机）似乎独占所有内存空间，分页机制负责把不同任务的内存地址映射到物理内存。如果物理内存不够了，操作系统就会把不常用的任务的内存交换到磁盘之类的外部存储，等那个不常用任务需要执行时再加载回来（当然，这种机制是后来才发明的）。这样，程序的开发者就不需要考虑物理内存空间有多大，也不需要考虑不同任务的内存地址是否会冲突。</p><p>现在我们用的计算机都有分页机制，应用程序（用户态进程）看到的是一片广阔无涯的虚拟内存（Virtual Memory），似乎整台机器都被自己独占；操作系统负责设置用户态进程的虚拟内存到物理内存的映射关系；CPU 中的 MMU（Memory Management Unit）负责在用户态程序运行时，通过查询映射关系（所谓的页表），把指令中的虚拟地址翻译成物理地址。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_virtual-memory.png"></p><p>这里要说的不是这种虚拟内存，而是<strong>基于虚拟机的内存虚拟化</strong>，它们本质上是一样的，通过对虚拟内存的理解，再去理解内存虚拟化就比较容易了。</p><p>内存虚拟化也分为<strong>基于软件的内存虚拟化</strong>和<strong>硬件辅助的内存虚拟化</strong>，其中，常用的基于软件的内存虚拟化技术为<strong>「影子页表」</strong>技术，硬件辅助内存虚拟化技术为 Intel 的 <strong>EPT（Extended Page Table，扩展页表）</strong>技术。</p><h3 id="Shadow-Page-Table"><a href="#Shadow-Page-Table" class="headerlink" title="Shadow Page Table"></a>Shadow Page Table</h3><p>内存软件虚拟化的目标就是要将虚拟机的虚拟地址（Guest Virtual Address, GVA）转化为 Host 的物理地址（Host Physical Address, HPA），中间要经过虚拟机的物理地址（Guest Physical Address, GPA）和 Host 虚拟地址（Host Virtual Address）的转化，即：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_memory-virtualization.png"></p><p>其中前两步由虚拟机的系统页表完成，中间两步由 VMM 定义的映射表（由数据结构 kvm_memory_slot 记录）完成，它可以将连续的虚拟机物理地址映射成非连续的 Host 机虚拟地址，后面两步则由 Host 机的系统页表完成。如下图所示。</p><p><img alt="Shadow Page Table" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-04-07_shadow-page-table.png"></p><p>这样做得目的有两个：</p><ol><li>提供给虚拟机一个从零开始的连续的物理内存空间。</li><li>在各虚拟机之间有效隔离、调度以及共享内存资源。</li></ol><p>我们可以看到，传统的内存虚拟化方式，虚拟机的每次内存访问都需要 VMM 介入，并由软件进行多次地址转换，其效率是非常低的。因此才有了影子页表技术和 EPT 技术。</p><p><strong>影子页表简化了地址转换的过程，实现了 Guest 虚拟地址空间到 Host 物理地址空间的直接映射。</strong></p><p>要实现这样的映射，必须为 Guest 的系统页表设计一套对应的影子页表，然后将影子页表装入 Host 的 MMU 中，这样当 Guest 访问 Host 内存时，就可以根据 MMU 中的影子页表映射关系，完成 GVA 到 HPA 的直接映射。而维护这套影子页表的工作则由 VMM 来完成。</p><p>由于 Guest 中的每个进程都有自己的虚拟地址空间，这就意味着 VMM 要为 Guest 中的每个进程页表都维护一套对应的影子页表，当 Guest 进程访问内存时，才将该进程的影子页表装入 Host 的 MMU 中，完成地址转换。</p><p>我们也看到，这种方式虽然减少了地址转换的次数，但本质上还是纯软件实现的，效率还是不高，而且 VMM 承担了太多影子页表的维护工作，设计不好。</p><p>为了改善这个问题，就提出了基于硬件的内存虚拟化方式，将这些繁琐的工作都交给硬件来完成，从而大大提高了效率。</p><h3 id="Extended-Page-Table"><a href="#Extended-Page-Table" class="headerlink" title="Extended Page Table"></a>Extended Page Table</h3><p>下图是 EPT 的基本原理图示，EPT 在原有 CR3 页表地址映射的基础上，引入了 EPT 页表来实现另一层映射，这样，GVA-&gt;GPA-&gt;HPA 的两次地址转换都由硬件来完成。</p><p><img alt="Extended Page Table" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-04-07_ept.png"></p><p>这里举一个小例子来说明整个地址转换的过程。假设现在 Guest 中某个进程需要访问内存，CPU 首先会访问 Guest 中的 CR3 页表来完成 GVA 到 GPA 的转换，如果 GPA 不为空，则 CPU 接着通过 EPT 页表来实现 GPA 到 HPA 的转换（实际上，CPU 会首先查看硬件 EPT TLB 或者缓存，如果没有对应的转换，才会进一步查看 EPT 页表），如果 HPA 为空呢，则 CPU 会抛出 EPT Violation 异常由 VMM 来处理。</p><p>如果 GPA 地址为空，即缺页，则 CPU 产生缺页异常，注意，这里，如果是软件实现的方式，则会产生 VM-exit，但是硬件实现方式，并不会发生 VM-exit，而是按照一般的缺页中断处理，这种情况下，也就是交给 Guest 内核的中断处理程序处理。</p><p>在中断处理程序中会产生 EXIT_REASON_EPT_VIOLATION，Guest 退出，VMM 截获到该异常后，分配物理地址并建立 GVA 到 HPA 的映射，并保存到 EPT 中，这样在下次访问的时候就可以完成从 GVA 到 HPA 的转换了。</p><p>有人也许会担心增加的一级映射关系会减慢内存访问速度，事实上不论是否启用二级内存翻译（SLAT），页表高速缓存（Translation Lookaside Buffer，TLB）都会存储虚拟地址（VA）到机器地址（MA）的映射。如果 TLB 的命中率较高，则增加的一级内存翻译不会显著影响内存访问性能。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_memory-visual-tlb.png"></p><p>EPT转换跟SPT相比有两点优化：</p><ul><li>客户机内部的Page Fault等不会引起VM-Exit，因此大大减少了VM-Exit的数量，从而提高了性能</li><li>EPT只需要维护一张EPT页表，而不需要像“影子页表”那样为每个客户机进程的页表维护一张影子页表，从而也减少了内存的开销</li></ul><h2 id="I-O虚拟化"><a href="#I-O虚拟化" class="headerlink" title="I/O虚拟化"></a>I/O虚拟化</h2><p>首先我们来回顾一下 I/O 模型：</p><p><img alt="Interactions With I/O Devices" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-04-07_interaction-with-io-devices.png"> </p><h3 id="全虚拟化-QEMU"><a href="#全虚拟化-QEMU" class="headerlink" title="全虚拟化 QEMU"></a>全虚拟化 QEMU</h3><p>下图是 QEMU 以<strong>纯软件方式模拟 I/O 设备</strong>的示意图：</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_io-qemu.png"></p><ul><li>当 Guest 中的设备驱动程序<strong>发起 I/O 操作请求</strong>时，KVM 模块中的 <strong>I/O Trap Code 会拦截这次 I/O 请求</strong>，经过处理后将本次 I/O 请求的信息存放到 <strong>I/O sharing page</strong> 中，并通知用户空间的 QEMU</li><li>QEMU 从 I/O sharing page 中获得 I/O 操作的具体信息后，交由<strong>硬件模拟代码（QEMU I/O Emulation Code）</strong>来模拟本次 I/O 操作</li><li>模拟代码负责<strong>和实际的设备驱动进行交互，模拟此次 I/O 操作</strong>，获取返回结果并将其放回 I/O Sharing Page 中</li><li>最后，KVM 中的 I/O Trap Code 负责读取 I/O Sharing Page 中的操作结果，并<strong>将结果返回到客户机中</strong></li></ul><p>需要注意的是：</p><ul><li><strong>客户机</strong>作为一个QEMU 进程，在<strong>等待 I/O 时也可能被阻塞</strong></li><li>当<strong>客户机通过 DMA 方式访问大块 I/O</strong> 时，QEMU 不会把 I/O 操作结果放到 I/O 共享页中，而是通过<strong>内存映射</strong>的方式将结果直接写进客户机的内存中，然后通过 KVM 模块告诉客户机 DMA 操作已经完成</li></ul><p><strong>优缺点</strong></p><ul><li>优点：可以通过软件模拟出各类硬件设备，而<strong>无需修改客户机操作系统</strong></li><li>缺点：每次 <strong>I/O 操作的路径较长</strong>，有较多的<code>VM-Entry</code>、<code>VM-Exit</code>发生，需要<strong>多次上下文切换</strong>，也需要<strong>多次数据复制</strong>，因此性能较差</li></ul><h3 id="半虚拟化-VirtIO"><a href="#半虚拟化-VirtIO" class="headerlink" title="半虚拟化 VirtIO"></a>半虚拟化 VirtIO</h3><p>半虚拟化方式需要借助 <code>virtio</code> 实现，在 GuestOS 中需要安装前端驱动（块设备驱动、网络设备驱动、PCI设备驱动等），QEMU中集中调用后端驱动，两者之间通信通过 virtio-ring 实现。这种方案无需频繁切换上下文，减少了内存拷贝次数，I/O效率较高，目前是公有云虚拟机选择的主流方案。</p><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_virtio-vring.png"></p><p>Virtio 分为了<strong>前端驱动</strong>和<strong>后端驱动</strong>：</p><ul><li><strong>前端驱动</strong>：Frontend Driver，是位于<strong>客户机内核</strong>中的<strong>驱动程序模块</strong>，如<code>virtio_blk</code>、<code>virtio_net</code>等</li><li><strong>后端驱动</strong>：Backend Driver，在<strong>宿主机用户空间</strong>的 <strong>QEMU</strong> 中实现</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-18_virtio-overview.jpg"></p><p>在前后端驱动之间，还定义了两层来支持客户机和 QEMU 之间的通信：</p><ul><li><strong>virtio 层</strong>：<strong>虚拟队列接口</strong>，它在概念上将前端驱动程序附加到后端处理程序。<strong>一个前端驱动程序可以使用 0 个或多个队列</strong>，具体数量取决于需求</li></ul><blockquote><p>例如：<code>virtio_net</code>网络驱动程序使用<strong>两个虚拟队列（接收/发送）</strong>，而<code>virtio_blk</code>驱动仅使用<strong>一个虚拟队列</strong>。<br>虚拟队列实际上被实现为客户机操作系统和 Hypervisor 之间的衔接点，但它可以通过任意方式实现，前提是客户机操作系统和 virtio 后端程序都遵循一定的标准，以相互匹配的方式实现它。</p></blockquote><ul><li><strong>virtio-ring 层</strong>：实现了<strong>环形缓冲区（ring buffer）</strong>，用于保存前端驱动和后端处理程序执行的信息，并且它可以<strong>一次性保存前端驱动的多次 I/O 请求，再交由后端驱动批量处理</strong>，最后实际调用宿主机中的设备驱动来完成物理层面上的 I/O 操作。</li></ul><blockquote><p>这样做就可以根据约定实现<strong>批量处理</strong>而不是客户机中每次 I/O 请求都需要处理一次，从而<strong>提高了客户机与 Hypervisor 之间信息交换的效率</strong></p></blockquote><p>优缺点</p><ul><li>优点：可获得很好的 I/O 性能，接近 Native。所以在使用 KVM 时，如果宿主机和客户机都支持 Virtio，一般都推荐使用 Virtio 以达到更高的 I/O 性能</li><li>缺点：必须在客户机中安装前端驱动，且按照 Virtio 的规定格式进行数据传输</li></ul><p>以virtio为标准的半虚拟化在其追寻性能的道路上也历经了三个演进方案：virtio-net、vhost-net和vhost-user。</p><h4 id="virtio-net"><a href="#virtio-net" class="headerlink" title="virtio-net"></a>virtio-net</h4><p>如下图所示，KVM负责为程序提供虚拟化硬件的内核模块，QEMU利用KVM模拟VM运行环境，包括处理器和外设等；Tap是内核中的虚拟以太网设备，可以理解为内核bridge。</p><p><img alt="图2 virtio-net.png" data-src="https://ictyangye.github.io/assets/picture/virtio1.png"></p><p>当客户机发送报文时，它会利用消息通知机制通知KVM，并退出到用户空间的QEMU进程，然后由QEMU对Tap设备进行读写（需要说明的是，QEMU是VM运行的主进程，因此才有退出这一说）。 在该模型中，<strong>宿主机、客户机和QEMU存在大量的上下文切换，以及频繁的数据拷贝、CPU特权级切换</strong>，因此性能差强人意。其函数调用路径如下：</p><p><img alt="图3 virtio-net数据包处理调用流程.png" data-src="https://ictyangye.github.io/assets/picture/virtio2.jpg"></p><p><strong>两次报文拷贝</strong>导致性能瓶颈，另外消息机制处理过程太长：报文到达Tap时内核通知QEMU，QEMU利用IOCTL向KVM请求中断，KVM发送中断到客户机。</p><h4 id="vhost-net"><a href="#vhost-net" class="headerlink" title="vhost-net"></a>vhost-net</h4><p>针对virtio-net的优化是把QEMU从消息队列的处理中解放出来，直接在宿主机实现了一个vhost-net内核模块，专门做virtio的后端，以此减少上下文切换和数据包拷贝。其结构如下图所示，以报文接收过程为例。数据通路直接从Tap设备接收数据报文，通过vhost-net内核模块把报文拷贝到虚拟队列中的数据区，从而使客户机接收报文。消息通路是当报文从Tap设备到达vhost-net时，通过KVM向客户机发送中断，通知客户机接收报文。</p><p><img alt="图4 vhost-net.png" data-src="https://ictyangye.github.io/assets/picture/virtio3.png"></p><p><strong>在数据通路层面，vhost-net减少了内存拷贝，但是由于其后端运行在内核态，仍然存在性能瓶颈。</strong></p><h4 id="vhost-user"><a href="#vhost-user" class="headerlink" title="vhost-user"></a>vhost-user</h4><p>vhost-user是采用DPDK用户态后端实现的高性能半虚拟化网络I/O。其实现机理与vhost-net类似，但是整个后端包括ovs（openvswitch） datapath全部置于用户空间，更好的利用DPDK加速。然而由于OVS进程是用户态进程，无权限访问客户机内存，因此需要使用共享内存技术，提前通过socket通信在客户机启动时，告知OVS自己的内存布局和virtio中虚拟队列信息等。这样OVS建立起对每个VM的共享内存，便可以在用户态实现上述vhost-net内核模块的功能。</p><p><img alt="图5 vhost-user.png" data-src="https://ictyangye.github.io/assets/picture/virtio4.png"></p><h4 id="vDPA加速的vhost-user"><a href="#vDPA加速的vhost-user" class="headerlink" title="vDPA加速的vhost-user"></a>vDPA加速的vhost-user</h4><p>在DPDK加速的vhost-user方案中，还有一次内存拷贝。半虚拟化中仅剩的性能瓶颈也就在这一次拷贝中，intel推出了一款硬件解决方案，直接让网卡与客户机内的virtio虚拟队列交互，把数据包DMA到客户机buffer内，在支持了virtio标准的基础上实现了真正意义上的<strong>零拷贝</strong>。</p><p><img alt="图6 vDPA.png" data-src="https://ictyangye.github.io/assets/picture/virtio5.png"></p><p>在18.05以后的DPDK版本中，已经有支持vDPA的feature供选择了。</p><h3 id="PCI-Pass-through"><a href="#PCI-Pass-through" class="headerlink" title="PCI Pass-through"></a>PCI Pass-through</h3><p>除了全虚拟化和准虚拟化，还有一种直接操作硬件的方式，无需KVM参与，如 Intel 的 VT-d，AMD 的 AMD-V。运行在 VT-d 平台上的 QEMU/KVM，可以分配网卡、磁盘控制器、USB控制器、VGA 显卡等设备供客户机直接使用。</p><ul><li>优势：执行 I/O 操作时大量减少甚至避免 VM-Exit 陷入到 Hypervisor 中，极大地提高了性能</li><li>劣势：主板上的空间有限，允许添加的 PCI 和 PCIe 设备是有限的，随着硬件增加，成本也会加大</li></ul><p><img alt data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2021-03-19_pci-pass-through.png"></p><p>对于性能的追求是永无止境的，除了上述全虚拟化、半虚拟化两种I/O虚拟化以外，还有一种非常极端的做法。让物理设备穿过宿主机、虚拟化层，直接被客户机使用，这种方式通常可以获取近乎native的性能。</p><p>这种方式主要缺点是： <strong>1.硬件资源昂贵且有限。</strong> <strong>2.动态迁移问题，宿主机并不知道设备的运行的内部状态，状态无法迁移或恢复。</strong></p><p>DPDK针对这两点问题都做了一定程度的解决。另外还提供了一种基于硬件的PF（物理功能）转VF（虚拟功能），这相当于在网卡层面上就已经有了虚拟化的概念，把一个网卡的PF虚拟成几十上百个VF，这样可以把不同的VF透传给不同的虚拟机，这就是我们最熟悉的SR-IOV。</p><p>对于I/O透传在虚拟化环境中最严重的问题不是性能了，而是灵活性。客户机和网卡之间没有任何软件中间层过度，也就意味着不存在负责交换转发功能的I/O栈，也就不会有软件交换机。那么如果要想有一台server内部的软件交换功能如何实现呢。业界的主要做法是把交换功能完全下沉到网卡，直接在智能网卡上实现虚拟交换功能。这又带来了另一个问题，成本和性能的权衡。</p><p><img alt="图7 SR-IOV.png" data-src="https://ictyangye.github.io/assets/picture/virtio6.png"></p><p>而DPDK 18.05以后的版本似乎也解决了这一灵活性问题，为了充分发掘标准网卡（区别于智能网卡）在flow（流）层面上的功能，推出了VF representer。可以直接将OVS上的流表规则下发到网卡上，实现网卡在VF之间的交换功能，这样就实现了高效灵活的虚拟化网络配置。</p><h2 id="写在最后"><a href="#写在最后" class="headerlink" title="写在最后"></a>写在最后</h2><p>本文是对虚拟化概览，也是作为 <a href="../../tags/虚拟化">虚拟化技术系列</a> 的第一篇。开篇概览对整体有了基本的认识，毋庸置疑，里面涉及到的技术细节凡凡总总。掌握了大的方向，后续本系列可以继续扩展，拓展到网络虚拟化、存储虚拟化、GPU 虚拟化等等。不管细节如何，我们做的都是抽象。</p><p>纵观虚拟化技术的发展历史，可以看到它始终如一的目标就是实现对 IT 资源的充分利用。虚拟化本质是对 IT 资源的抽象，沿着虚拟化的道路继续发展，我们看到了云计算的开花结果，实现了更上层的对企业业务能力的抽象。抽象之外，我们也可以在这个过程中不断的看到软硬件结合与替代的思路，做一件事软件与硬件只是不同的路径，到底路该怎么走，就得看我们想到哪了。</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="http://www.kernelthread.com/publications/virtualization" target="_blank" rel="external nofollow noopener noreferrer">http://www.kernelthread.com/publications/virtualization</a></li><li><a href="https://blog.csdn.net/Jmilk/article/details/99675664" target="_blank" rel="external nofollow noopener noreferrer">https://blog.csdn.net/Jmilk/article/details/99675664</a></li><li><a href="https://ring0.me/2014/12/virtualization-overview" target="_blank" rel="external nofollow noopener noreferrer">https://ring0.me/2014/12/virtualization-overview</a></li><li><a href="https://developer.ibm.com/tutorials/l-pci-passthrough" target="_blank" rel="external nofollow noopener noreferrer">https://developer.ibm.com/tutorials/l-pci-passthrough</a></li><li><a href="https://developer.ibm.com/technologies/linux/articles/l-virtio" target="_blank" rel="external nofollow noopener noreferrer">https://developer.ibm.com/technologies/linux/articles/l-virtio</a></li><li><a href="https://developer.ibm.com/tutorials/l-hypervisor" target="_blank" rel="external nofollow noopener noreferrer">https://developer.ibm.com/tutorials/l-hypervisor</a></li><li><a href="https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp17/cse506/slides/io_virtualization.pdf" target="_blank" rel="external nofollow noopener noreferrer">https://compas.cs.stonybrook.edu/~nhonarmand/courses/sp17/cse506/slides/io_virtualization.pdf</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;虚拟化的本质是&lt;strong&gt;抽象&lt;/strong&gt;，虚拟化技术本质就是&lt;strong&gt;资源管理与优化&lt;/strong&gt;技术。通过将计算机的各种物理资源，比如 &lt;strong&gt;CPU&lt;/strong&gt;、&lt;strong&gt;内存&lt;/strong&gt;以及磁盘空间、网络适配器等其他 &lt;strong&gt;I/O&lt;/strong&gt; 设备，进行抽象转换，呈现出一个可供分割并且可以任意组合的多个计算机的配置环境。通过虚拟化技术，计算、网络、存储等计算机硬件资源得到更好的利用，而这些资源的虚拟形式将不受现有架设方式、地域或物理配置所限制。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://upload.wikimedia.org/wikipedia/commons/e/e1/Hyperviseur.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="虚拟化" scheme="https://houmin.cc/tags/%E8%99%9A%E6%8B%9F%E5%8C%96/"/>
    
      <category term="hypervisor" scheme="https://houmin.cc/tags/hypervisor/"/>
    
      <category term="VMM" scheme="https://houmin.cc/tags/VMM/"/>
    
      <category term="云计算" scheme="https://houmin.cc/tags/%E4%BA%91%E8%AE%A1%E7%AE%97/"/>
    
  </entry>
  
  <entry>
    <title>Key Numbers Every Programmer Should Know</title>
    <link href="https://houmin.cc/posts/fb3d782a/"/>
    <id>https://houmin.cc/posts/fb3d782a/</id>
    <published>2020-03-10T07:36:30.000Z</published>
    <updated>2022-11-09T15:13:45.390Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>本文整理了作为程序员应该知道的关键数字，封面图源自 <a href="https://colin-scott.github.io/personal_website/research/interactive_latency.html" target="_blank" rel="external nofollow noopener noreferrer">伯克利每年更新的动态图表</a> ，可视化的展示了每年各种操作的耗时变化，非常形象。</p><p><img alt="Key Numbers" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-03-10_key-numbers.png"></p><a id="more"></a><h2 id="数据变化"><a href="#数据变化" class="headerlink" title="数据变化"></a>数据变化</h2><p>这里是 2020 年的具体数据：</p><figure class="highlight angelscript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line">           <span class="number">1</span>   ns - CPU L1 CACHE <span class="built_in">ref</span>erence</span><br><span class="line">           <span class="number">1</span>   ns - speed-of-light (a photon) travel a <span class="number">1</span> ft (<span class="number">30.5</span>cm) distance</span><br><span class="line">           <span class="number">3</span>   ns - CPU L1 CACHE Branch mispredict</span><br><span class="line">           <span class="number">4</span>   ns - CPU L2 CACHE <span class="built_in">ref</span>erence</span><br><span class="line">          <span class="number">17</span>   ns - MUTEX lock/unlock</span><br><span class="line">          <span class="number">44</span>   ns - Send <span class="number">2</span>K bytes over Commodity NETWORK</span><br><span class="line">          <span class="number">71</span>   ns - CPU cross-QPI/NUMA best  <span class="keyword">case</span> on XEON E5<span class="number">-46</span>*</span><br><span class="line">         <span class="number">100</span>   ns - own DDR MEMORY <span class="built_in">ref</span>erence</span><br><span class="line">         <span class="number">135</span>   ns - CPU cross-QPI/NUMA best  <span class="keyword">case</span> on XEON E7-*</span><br><span class="line">         <span class="number">202</span>   ns - CPU cross-QPI/NUMA worst <span class="keyword">case</span> on XEON E7-*</span><br><span class="line">         <span class="number">325</span>   ns - CPU cross-QPI/NUMA worst <span class="keyword">case</span> on XEON E5<span class="number">-46</span>*</span><br><span class="line">       <span class="number">2</span>,<span class="number">000</span>   ns - Compress <span class="number">1</span>K bytes with Zippy PROCESS</span><br><span class="line">       <span class="number">3</span>,<span class="number">000</span>   ns - Read <span class="number">1</span> MB sequentially <span class="keyword">from</span> MEMORY</span><br><span class="line">      <span class="number">49</span>,<span class="number">000</span>   ns - Read <span class="number">1</span> MB sequentially <span class="keyword">from</span> SSD</span><br><span class="line">     <span class="number">825</span>,<span class="number">000</span>   ns - Read <span class="number">1</span> MB sequentially <span class="keyword">from</span> DISK</span><br><span class="line">     <span class="number">500</span>,<span class="number">000</span>   ns - Round trip within a same DataCenter</span><br><span class="line">   <span class="number">2</span>,<span class="number">000</span>,<span class="number">000</span>   ns - DISK seek</span><br><span class="line"> <span class="number">150</span>,<span class="number">000</span>,<span class="number">000</span>   ns - Send a NETWORK packet CA -&gt; Netherlands</span><br><span class="line">|   |   |   |</span><br><span class="line">|   |   | ns|</span><br><span class="line">|   | us|</span><br><span class="line">| ms|</span><br></pre></td></tr></table></figure><p>根据伯克利每年的数据，可以总结出：</p><ul><li>从 2005 年后数据有所减少，但是基本稳定的有<ul><li>L1 和 L2 的缓存访问稳定在 ns 量级</li><li>互斥锁的代价稳定在 17 ns 量级（关于互斥锁的代价，以后可以专门讨论</li><li>访问本地内存的代价基本稳定在 100 ns</li><li>同一个数据中心的 RTT 稳定在 500 us</li><li>从加州到荷兰的 RTT 稳定在 150 ms</li></ul></li></ul><figure class="highlight angelscript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">    <span class="number">1</span> ns        L1 cache</span><br><span class="line">    <span class="number">3</span> ns        Branch mispredict</span><br><span class="line">    <span class="number">4</span> ns        L2 cache</span><br><span class="line">   <span class="number">17</span> ns        Mutex lock/unlock</span><br><span class="line">  <span class="number">100</span> ns        Main memory (RAM)</span><br><span class="line"><span class="number">2</span> <span class="number">000</span> ns (<span class="number">2</span>µs)  <span class="number">1</span>KB Zippy-compress</span><br></pre></td></tr></table></figure><ul><li>还有很多性能现在获得巨大的改善<ul><li>通过网络发送 2KB 数据损耗的时间，从 05 年的 8000ns 改善到现在的 44ns</li><li>从内存顺序读出 1MB 数据损耗的时间，从 05 年的 95,000 ns 改善到现在 3,000 ns</li><li>从 SSD 顺序读出 1MB 数据损耗的时间，从 05 年的 2,000,000 ns 改善到现在 49,000 ns，也就是从 2ms 优化到 50us 量级</li><li>从磁盘顺序读出 1MB 数据损耗的时间，从 05 年的 7,000,000 ns 改善到现在 825,000 ns，也就是从 7ms 优化到 800us 量级</li></ul></li></ul><h2 id="数据理解"><a href="#数据理解" class="headerlink" title="数据理解"></a>数据理解</h2><p>下面从定性角度来理解这些数据。</p><p><strong>内存</strong>、<strong>SSD</strong>、<strong>磁盘</strong>、<strong>网络</strong> 之间速度的巨大差别了，粗略地讲：</p><ul><li>SSD比内存慢 10 倍</li><li>磁盘比内存慢 300 倍，比 SSD 慢 30 倍</li><li>网络比内存慢 10 万倍，比硬盘慢 200 倍</li></ul><h3 id="Clock"><a href="#Clock" class="headerlink" title="Clock"></a>Clock</h3><p>最开始为了提高计算机速度，选择将 CPU 的频率提高，后来计算机的频率到达 3GHz 之后，很难再提高了，所以访问 Cache 和内存的速度也基本不再变化了。</p><figure class="highlight angelscript"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">Core i7 Xeon <span class="number">5500</span> Series Data Source Latency (approximate)               [Pg. <span class="number">22</span>]</span><br><span class="line"></span><br><span class="line">local  L1 CACHE hit,                              ~<span class="number">4</span> cycles (   <span class="number">2.1</span> -  <span class="number">1.2</span> ns )</span><br><span class="line">local  L2 CACHE hit,                             ~<span class="number">10</span> cycles (   <span class="number">5.3</span> -  <span class="number">3.0</span> ns )</span><br><span class="line">local  L3 CACHE hit, line unshared               ~<span class="number">40</span> cycles (  <span class="number">21.4</span> - <span class="number">12.0</span> ns )</span><br><span class="line">local  L3 CACHE hit, <span class="keyword">shared</span> line <span class="keyword">in</span> another core ~<span class="number">65</span> cycles (  <span class="number">34.8</span> - <span class="number">19.5</span> ns )</span><br><span class="line">local  L3 CACHE hit, modified <span class="keyword">in</span> another core    ~<span class="number">75</span> cycles (  <span class="number">40.2</span> - <span class="number">22.5</span> ns )</span><br><span class="line"></span><br><span class="line">remote L3 CACHE (Ref: Fig<span class="number">.1</span> [Pg. <span class="number">5</span>])        ~<span class="number">100</span><span class="number">-300</span> cycles ( <span class="number">160.7</span> - <span class="number">30.0</span> ns )</span><br><span class="line"></span><br><span class="line">local  DRAM                                                   ~<span class="number">60</span> ns</span><br><span class="line">remote DRAM                                                  ~<span class="number">100</span> ns</span><br></pre></td></tr></table></figure><h3 id="NIC"><a href="#NIC" class="headerlink" title="NIC"></a>NIC</h3><p>网卡的速度越来越快，从最早的万兆网卡，到现在100Gb的网卡。</p><p>网络带宽越大，<strong>传输延时</strong>越小。</p><h3 id="RTT"><a href="#RTT" class="headerlink" title="RTT"></a>RTT</h3><p>roundtrip in same datacenter 和 packet roundtrip CA to Netherlands 耗时没有任何变化，一致保持 500us 和 150ms，原因很好理解，毕竟信号在光纤中以近似光速传播，该时间由物理规律决定，这里说的是<strong>传播延时</strong>。</p><h3 id="SSD"><a href="#SSD" class="headerlink" title="SSD"></a>SSD</h3><p>SSD 的随机读取速度从 1990 年到 2019 年变化不同，不过从 19us 提升到 16us，但顺序读取速度却从 50ms 提升到 49us，提升巨大。</p><h3 id="DISK"><a href="#DISK" class="headerlink" title="DISK"></a>DISK</h3><p>从 2006 年开始，前两列操作的数值不再变化，只有后两列在变化，说明近十年来存储介质的速度有较大提升。</p><h3 id="Mutex"><a href="#Mutex" class="headerlink" title="Mutex"></a>Mutex</h3><p>Mutex的lock或unlock操作代价是17 ns。（所以加锁解锁的操作不耗费时间，锁的大量竞争才耗费，思路降低锁粒度，每个锁对象只保护一小部分数据）</p><h3 id="写的代价是很昂贵的"><a href="#写的代价是很昂贵的" class="headerlink" title="写的代价是很昂贵的"></a>写的代价是很昂贵的</h3><ul><li>数据存储是事务型的：写需要磁盘访问</li><li>磁盘访问意味着磁盘寻道</li><li>拇指法则（经验规则）：一次磁盘寻道就往往浪费了10 ms（毫秒）</li><li>简单的计算一下： 1s / 10ms = 100 seeks / sec, 也就是1秒磁盘最大寻道次数在100次</li></ul><p>所以，根据以上法则，要时刻考虑你的数据大小和数据结构，并且要以批量的思想来做，批量写和批量读。</p><h2 id="Reference"><a href="#Reference" class="headerlink" title="Reference"></a>Reference</h2><ul><li><p><a href="https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory" target="_blank" rel="external nofollow noopener noreferrer">https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory</a></p></li><li><p><a href="https://colin-scott.github.io/personal_website/research/interactive_latency.html" target="_blank" rel="external nofollow noopener noreferrer">https://colin-scott.github.io/personal_website/research/interactive_latency.html</a></p></li><li><p><a href="http://www.eecs.berkeley.edu/~rcs/research/hw_trends.xlsx" target="_blank" rel="external nofollow noopener noreferrer">http://www.eecs.berkeley.edu/~rcs/research/hw_trends.xlsx</a></p></li><li><p><a href="https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf" target="_blank" rel="external nofollow noopener noreferrer">https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf</a></p></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;本文整理了作为程序员应该知道的关键数字，封面图源自 &lt;a href=&quot;https://colin-scott.github.io/personal_website/research/interactive_latency.html&quot; target=&quot;_blank&quot; rel=&quot;external nofollow noopener noreferrer&quot;&gt;伯克利每年更新的动态图表&lt;/a&gt; ，可视化的展示了每年各种操作的耗时变化，非常形象。&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-03-10_key-numbers.png&quot; alt=&quot;Key Numbers&quot;&gt;&lt;/p&gt;
    
    </summary>
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="计算机" scheme="https://houmin.cc/tags/%E8%AE%A1%E7%AE%97%E6%9C%BA/"/>
    
      <category term="latency" scheme="https://houmin.cc/tags/latency/"/>
    
  </entry>
  
  <entry>
    <title>The Big Short</title>
    <link href="https://houmin.cc/posts/787197ce/"/>
    <id>https://houmin.cc/posts/787197ce/</id>
    <published>2020-02-20T09:28:22.000Z</published>
    <updated>2022-11-09T15:13:45.390Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>没错，这里是最近新开的另一个专栏「资本不眠」，这个专栏会总结股票市场的交易笔记，比如第一期聊到的 <code>MACD</code>；也会聊在资本世界里面各种有意思的事情，比如这一期就是通过 <code>The Big Short</code>这部电影对 2007 年 到 2008 年那次由 <code>次贷危机</code>引发的全球经济危机进行的梳理复盘。如果以后有机会的话，我会专门在这个专栏复盘自己在股市中的每日操作（当然，我是十分期待自己能够开这个坑的，如果我对自己每次操作都能够知其所以然的话）。</p><p>哦对了，最近 A 股的半导体和新能源等科技股都炒疯了。呵，愚蠢的人类。</p><a id="more"></a><h2 id="大崩盘"><a href="#大崩盘" class="headerlink" title="大崩盘"></a>大崩盘</h2><p>让我们再回顾一下那一年的大崩盘</p><div class="video-container"><iframe src="//www.youtube.com/embed/oyiCuAVcQDs" frameborder="0" allowfullscreen></iframe></div><p>这是一场席卷全球的金融风暴，它承接过去一百年内发生的种种，在 2008 年达到了高潮，彻底改变了全世界成千上万人们的生命轨迹。尽管已经过去了十二年，我们还是能够感受这场风暴的余波：英国脱欧、特朗普当选莫不如是，甚至你都能从最近奥斯卡最佳电影 <code>寄生虫</code>感受到那场危机带来的影响。</p><p>那么今天，我会从头梳理，在那场危机中到底发生了什么？是什么导致了这场危机？它对我们到底造成了那些影响？</p><h3 id="主角"><a href="#主角" class="headerlink" title="主角"></a>主角</h3><p>在 2000 年代，经过多轮的兼并重组，美国的金融行业被几家巨型公司所主宰，下面是我梳理的这场金融风暴中的主要玩家：</p><embed src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-02-20_america-financial-industry.svg" style="display:block;width:100%;height:100%" onclick><p>总的来说，主要分为以下几个部分，他们是管理着数万亿美元资产的美国五大投资银行和四大商业银行，影响着上亿人养老金和保险的三大保险公司，为这场游戏制定规则的美联储、财政部以及美国证券交易委员会，作为第三方应该独立的评级机构。</p><h3 id="剧本"><a href="#剧本" class="headerlink" title="剧本"></a>剧本</h3><p>在这场危机中，他们粉墨登场，表现各异，最后的结局也大相径庭：</p><ul><li>2007 年 4 月，开始有从事次级抵押贷款的公司申请破产</li><li>2007 年 10 月 9 日，道琼斯指数达到最高收盘 <code>14164</code>点</li><li>2008 年 3 月 17 日，美国第五大投资银行<code>Bear Stearns</code>现金流耗尽，美联储帮助其被 <code>JPMorgan Chase</code>收购</li><li>2008 年 9 月 7 日，美国最大两家房地产公司 <code>房地美</code> 和 <code>房利美</code> 被美国联邦政府接管</li><li>2008 年 9 月 15 日<ul><li>在美联储拒绝担保其贷款后，美国第三大投资银行 <code>雷曼兄弟</code> 破产，导致道琼斯指数下跌 504 点，为七年来最大跌幅。</li><li>美国第四大投资银行 <code>Merrill Lynch</code> 被美国银行收购。</li></ul></li><li>2008 年 9 月 16 日，美国最大的保险公司 <code>AIG</code> 资不抵债，美联储宣布提供 <code>850亿美元</code>短期紧急贷款，接管 AIG。</li><li>2008 年 9 月 19 日，美国财政部长 <code>Paulson</code> 和 美联储主席 <code>Ben Bernanke</code> 提出不良资产解决方案(Troubled Assets Relief Program, <code>TARP</code>)要求国会授权支出<code>7000 亿美元</code>以允许美国财政部从金融机构中购买不良资产。此救市方案获得投资者积极反应，但是其注资计划仍需众议院落实。</li><li>2008 年 9 月 21 日，幸存的两家投资银行 <code>高盛</code> 和 <code>摩根士坦利</code>申请从 <code>投资银行</code> 转变为 <code>银行控股公司</code>以更多的获得来自美联储的救助。</li><li><p>2008 年 9 月 29 日，美国国会否决了包含了 7000 亿美元 <code>TARP</code> 计划的<code>经济稳定紧急法案</code>，造成道琼斯指数指数当日下跌 700 点，创历史以来单日最大跌幅。</p></li><li><p>2008 年 10 月 3 日，国会通过经济稳定紧急法案，该方案同意美国财政部从九大银行购买优先股以为其纾困：</p><ul><li>花旗集团：450 亿美元</li><li>美国银行：450 亿美元</li><li>AIG：400 亿美元</li><li>摩根大通：250 亿美元</li><li>富国银行：250 亿美元</li><li>高盛：100 亿美元</li><li>摩根士坦利：100 亿美元</li><li>纽约梅隆银行：30 亿美元</li><li>道富银行：20 亿美元</li></ul></li><li>2008 年 12 月 16 日，美联储降低基金利率至 0。</li><li>2009 年 1 月，美国三大汽车制造商 <code>通用电气</code>、<code>克莱斯特</code>和 <code>福特汽车</code>申请 <code>TARP</code>计划的纾困。</li><li>2009 年 3 月 6 日，道琼斯指数触及最低 <code>6443</code> 点。</li></ul><p>当我一个字一个字敲下这场危机时间线的时候，我的脑海里只有一个词：<code>一地鸡毛</code>。可以看到，危机早在2007年已经发出了信号，可是绝大部分人都选择忽略它，股市继续狂热火爆并一度达到巅峰。直到最后，灰犀牛如期而至，整个市场一地鸡毛。</p><h2 id="专业术语"><a href="#专业术语" class="headerlink" title="专业术语"></a>专业术语</h2><p>众所周知，<code>次贷危机</code>源自于房地产泡沫。可是，什么是<code>次贷</code>? 为什么房地产的泡沫会导致全球的金融海啸呢？有人说这归咎于华尔街精英们发明的各种神奇的<code>金融衍生品</code>，扔出了一堆的专有名词。绝大部分人看到这里就开始晕了，WTF，什么是 <code>MBS</code>，什么是 <code>CDO</code>，什么是 <code>CDS</code>。别怕，所有的这些就是华尔街故意发明出来的看起来高大上的一些词。相信我，所有接受过基本数学教育的人都能够弄懂他们到底在玩些什么花样。接下来，让我们来学一些术语 ：）</p><h3 id="MBS"><a href="#MBS" class="headerlink" title="MBS"></a>MBS</h3><p><strong>Mortgage Backed Securities</strong>，也即<strong>房贷抵押证券</strong></p><p>首先忽略这个看起来不明所以的名词，我们知道它和房贷有关。房贷我们很清楚，你为了买房向银行申请贷款，银行对你进行信贷审核，符合条件的人可以获得贷款。</p><p><img alt="住房贷款基本模型" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-02-20_mortgage-old-system.png"></p><p>在这种模型下，住房贷款需要花费几十年的时间才能偿还，如果借贷者信用不够，银行面临着违约风险，所以银行在发放住房贷款的时候都十分谨慎。所以，当时的银行业完全是垃圾：</p><blockquote><p>In the late 70s, banking wasn’t a job you went into to make large sums of money. It was a fucking snooze. Filled with losers, like selling insurance or accounting. And if the banking was boring, then the bond department at the bank was straight-up comatose.</p></blockquote><p>然而当时的现状是，由于婴儿潮（1946-1964年）带来的人口膨胀，导致了住房短缺。房贷需求始终存在，有需求就有供给，<code>Leiws Ranieri</code>在 1977 年发明了MBS。</p><p>MBS是什么？一句话说，就是银行或者房产中介把不同的<strong>房贷抵押</strong>卖给投资银行，投行将这些<strong>房贷抵押</strong>打包起来，发起 <code>房贷抵押债券</code> 来融资，把借款人偿还的分期本金和利息传递给投资者（债券持有人）。</p><p><img alt="MBS基本模型" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-02-20_mbs-system.png"></p><p>这种方式对银行来说当然是一笔好生意了，不仅仅把风险都转嫁给投资者，自己还回笼了资金赚了息差和服务费。</p><p>问题是，投资者真的这么傻吗？前面说的银行面临的风险此时依然存在，而且这些证券是 <code>non-agency</code>的，没有国家担保，完全不值得信赖。</p><p>问题就两个，一是谁担风险，二是谁来担保。</p><p>第一个问题：<strong>谁来担风险？</strong></p><p>投行把从商业银行、贷款公司、中介公司收集来的房贷抵押，形成一个资产池。然后找来评级机构，将这个资产池里面的每一份 MBS 做一个评级。</p><p>在这个资产池里，按照信用评分高低，将MBS分为<code>Prime</code>, <code>Alt-A,</code> <code>Subprime</code>三档。这里的 <code>Subprime</code>，说的就是次一级别的贷款，也即次级贷，我们说的<strong>次贷危机(Subprime Crisis)</strong>就是来源于此。</p><p><img alt="具有不同等级风险和回报的MBS" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-02-20_subprime-rating.jpg"></p><p>由于 Prime 和 Alt-A 的评级高，很快就被抢光了，问题是怎么对 Subprime 这档好好设计，让它能够更好的卖出去。这里就是对 Sumprime 这一档进行<strong>分层(tranching)</strong>，然后在对这些层次进行评级：AAA, AA, A, BBB, BB,B…，形成优先级、夹层、劣后级不同风险和收益率的结构化份额。</p><p>评级越高的风险越低，期望回报也越低。反之评级越低的风险越高，期望回报也就越高（因为风险越高，银行的贷款利率也就越高，相应的期望回报也就越高）</p><p><img alt="Subprime不同评级对应的风险与回报" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-02-20_mbs-risk-return-for-investors.png"></p><p>当这一步没有任何问题，只是将不同的房贷抵押进行风险评级，以满足不同风险偏好的投资者。</p><p>第二个问题：<strong>谁来担保？</strong></p><p>这就引入了 CDO。</p><h3 id="CDO"><a href="#CDO" class="headerlink" title="CDO"></a>CDO</h3><p><strong>collateralized debt obligation，担保债务凭证</strong></p><p>话说回来，评级好的 MBS ，比如 Prime 和 Alt-A 的 MBS 都很快被卖出去了，毕竟风险低和收益可观。那么 Subprime 的 MBS 怎么办呢？于是华尔街投行的精英们发明了 CDO。</p><p>CDO 是什么？它是一种带有担保性质的抵押债券，这里的担保方可以是第三方，比如AIG, Ambac, MBIA这些大型保险公司。</p><p>那么问题来了，这么高风险的tranche（贷款中的一部分，华尔街发明的晦涩难懂的词，具体就是 MBS 中的B级部分），保险公司为什么要给他们担保啊？</p><blockquote><p>投行说，数学证明了这样的资产配置风险极低，更何况现在房价蹭蹭往上涨，即便是有人违约断供了，我们还是可以收回房产来拍卖，全国的房子集体断供这是不可能。</p><p>这时候评级机构又跳出来了，“咳咳，经过我们的评估，这些整合的MBS风险极低，我们给评个 double-A, 不给trible-A的原因是怕他们骄傲。” 事实上，大部分BB/BBB级的tranches被直接提升为AAA。</p></blockquote><p>对于保险公司来说，这个数学模型看起来好像很厉害，而且房价也一直在涨，即使收不回贷款拿房产抵押投行也可以盈利。再加上评级机构都这么捧场，自己还能拿保费，不干白不干。</p><blockquote><p>So mortgage bonds are dogshits, CDOs are dogshit wrapped in catshit.</p></blockquote><p>正如剧中人所说，CDO就像一个垃圾场，把各种垃圾倒进去，重新回收、打包、分类。经过这么一折腾，B 级评级直接被提升为 AAA 级。</p><p>厉害的是，CDO A可以包含CDO B，CDO B也可以包含CDO A，他们又可以组合成CDO C（包含CDO的CDO，CDO的平方…… $CDO^2$）。在实际的金融市场中，不仅有$CDO^2$，甚至还有$CDO^3$（包含$CDO^2$的CDO）、$CDO^4$（包含$CDO^3$的CDO）。</p><p>在那几年，华尔街精英们不断的用 CDO 打包再打包将其作为证券买到市场上，因为一堆风险高的资产组合收益率普遍大于一堆风险低的资产组合的收益率，房地产继续繁荣，CDO 得到了市场的广泛追捧。</p><h3 id="CDS"><a href="#CDS" class="headerlink" title="CDS"></a>CDS</h3><p><strong>Credit Default Swap，信用违约互换</strong></p><p>我们继续聊，刚才我们提到了投行让保险公司给他们担保，这实际上就是一种保险。</p><p>保险是什么？投保人给保险公司保费，如果出现了意外事故，保险公司则需要给投保人赔偿。对应到上面的 CDO 产品，投资银行为上面的 MBS 投保，如果<strong>标的资产</strong>（这里的标的资产就是住房贷款抵押）没有违约，投保人（也就是投资银行，Protection Buyer）需要按季度给被投保人（也就是保险公司，Protection Seller）支付保费。这对保险公司来说听起来是一件很棒的事情，毕竟房地产行情持续火热，住房贷款在同一时间集体违约的概率也那么小，CDO 怎么会违约呢？这种保费不赚白不赚。</p><p>到现在为止，我们还只是在谈保险，如果仅仅是这样，房地产的次贷危机还不至于引发整个金融海啸。</p><p>对，CDS！</p><p>对于持有 CDO 的投资者来说，CDS 相当于保险机制。购买了 CDS 的投资者，按季度向保险公司支付保费。如果 CDO发生违约，保险公司承诺会补偿投资者的损失。是的，这个和保险听起来没有什么区别。</p><p><img alt="除了资产拥有者，CDS允许所有人都可以进行投保" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-02-20_cds-bonus.png"></p><p>但是问题在于，CDS 不是简单的保险。在保险领域，你只能对拥有的东西投保， 比如我拥有一处房地产，我只能对它投保一次。 但是对于 CDS，它可以允许任何人对这个房子进行投保，换句话说，可能有上百个人对同一个标的资产投保。</p><p><img alt="资产违约，AIG需要向购买CDS的所有买家支付赔偿" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-02-20_cds-fail.png"></p><p>那么，如果这个房子烧毁了呢，那么保险公司（比如这里的 AIG）除了要向作为房地产拥有者的我赔偿损失，还要向成千上万购买了 CDS 的投机者赔偿。这样一来，在系统的损失就会成比例的扩大，Boom！</p><p>再来回顾下 CDS 的特点：</p><ul><li>CDS 所投保的资产可以跟投资者没有任何关系，这是区别于保险的最大特点</li><li>CDS 可以面向各类投资者，个人、银行、对冲基金、社保、养老金、保险公司等等</li><li>CDS 和 CDO 一样，也可以在一级、二级市场交易，可以使投行转移风险。</li></ul><p>到这一步，CDS 已经变成了<strong>双方对违约事件的对赌</strong>，承保方相当于博彩行业里对某事件发生概率而自行计算并开出赔率的庄家。买家交付一定比例的保费，产品违约即由承保方赔付，不一定是保险公司，也可以是金融机构，例如高盛。华尔街的人，把 CDS 叫做 <code>Synthetic CDO</code>。利用CDS灵活性，可以制造出无数个赌场。我们可以赌MBS，可以赌cash CDO，也可以赌某一层tranche，还可以赌价格的跌幅。</p><p>关于这里的 CDS 交易，还要在仔细说一说。对于一般的投资者，他们可以作为 CDS 的卖方 （Protection Seller），来获取固定收益，是卖出<strong>看跌期权</strong>（Sell CDS），认为后市看涨。而对于 <code>Michael Burry</code>来说，他是 CDS 的买方（Protection Buyer），定期支付保费，是买入<strong>看跌期权</strong> （Buy CDS），赌房地产市场下跌，博高杠杆的违约保金收益。原来这个买方角色一直是由发行bonds的投行担任，是为了卖产品。</p><p>对应到 2008 年次贷危机，我们来看一看各个玩家到底都做了些什么事情：</p><ul><li>AIG销售了价值至少 5000 亿美元的 CDS。当房地产泡沫破裂时，账上资金耗尽，不得不被财政部接管，用纳税人的钱为他们买单。</li><li>高盛从 AIG 购买了价值至少 200 亿美元的CDS，数额大到连高盛也担心 AIG 会破产。与此同时，高盛还在大量售卖了 CDO，所以他们的客户损失的越多，他们也就赚的越多。</li><li>不光是高盛，摩根士坦利、摩根大通、美林、雷曼兄弟也买入了数十亿美元的 CDS。</li><li>那么，为什么这些 CDO 能够被大众投资者买入呢？评级机构，他们的 AAA 评级。在整个房地产泡沫期间，评级机构们发放了大量的 AAA 评级，并从中获利众多。</li><li>救市计划开始实行，拯救 AIG 花费了纳税人1500 亿美元，其中 610 亿美元被支付给高盛。与此同时，美国财政部长 <code>Paulson</code> 要求 AIG 放弃起诉高盛和其他银行的权利。</li></ul><h2 id="大而不倒"><a href="#大而不倒" class="headerlink" title="大而不倒"></a>大而不倒</h2><p>在回顾这场金融危机的时候，脑海中一直有个问题。</p><blockquote><p>为什么政府一定要出手呢？为什么要用纳税人的钱来给他们收拾烂摊子呢？就让他们倒闭不行吗？</p></blockquote><p>是的，政府担保了贝尔斯登使其被摩根大通收购，政府接管了房利美和房地美，但是政府不想传递出一种信息说自己会收拾所有华尔街的烂摊子，于是没有帮助雷曼兄弟，但是后面事态的发展让财政部和美联储无法再旁观下去。</p><p>整个市场失去了信心，银行挤兑，市场失去了流动性，就连仅存的两家投行高盛和摩根士坦利都面临着巨大的撤资和空头压力。与此同时，AIG 深陷困境，作为全美最大的保险公司，影响着成千上万人的退休金。市场的恐慌让信贷停滞，就连像通用电气和克莱斯勒这种汽车制造业公司也出现了财务危机。危机持续扩散，世界上成千上万的公司出现衰退，消费者收紧钱包，成千上万的人失去了他们的工作、住房，整个金融系统停摆，濒临崩溃。</p><p>如果 AIG 倒下，一切不堪设想，政府必须出手。之后呢？</p><blockquote><p>In the years of followed, hundreds of bankers and rating-agency’s executives went to jail. The SEC was completely overhauled and congress have no choice but to break up the big banks and regulate the mortgage and derivative industries.</p><p>Just kidding!</p><p>Banks took the money the American people gave them and used it to pay themselves huge bonuses and lobby the congress to kill the reform, and then they blame immigrants and poor people. And this time, even teachers.</p></blockquote><h2 id="历史循环"><a href="#历史循环" class="headerlink" title="历史循环"></a>历史循环</h2><p><strong>Break up the big banks</strong></p><p>这件事情我们不是没有做过。你不觉得在看前面主角的时候，<code>Morgan Stanley</code> 和 <code>JPMorgan Chase</code>两家听起来好像有点关联吗？是的，在那次著名的<a href="https://en.wikipedia.org/wiki/Great_Depression" target="_blank" rel="external nofollow noopener noreferrer">大萧条</a>发生之前，他们俩是一家公司。（大萧条也是一个很值得研究的问题，这里先挖个坑）</p><p>当时美国政府认为是超级银行（Universal bank）是造成大萧条的原因，于是出台法案禁止商业银行同时从事投资银行业务。在此背景下，摩根银行一分为三，JP摩根成为纯商业银行，摩根士丹利成为投资银行，还有一个摩根负责海外业务，于1990年被德国银行收购。当时的JP摩根，就是摩根大通的前身，2000年并入摩根大通，不再独立存在。</p><p>然而，正如中国上千年农业社会的治乱历史循环一样：王朝建立土地重新分配 -&gt; 小农经济土地兼并 -&gt; 农民失地矛盾加重 -&gt; 天灾触发造成叛乱 -&gt; 新的王朝建立开始下一轮循环。看这次金融危机的历史，突然也有了这种感觉。拆分大银行，我们历史上并不是没有做过，但是资本的惯性就是倾向于兼并垄断，拆分之后小公司还是会不停的并购，再一次变成更大的公司，就像这一次发生的一样。</p><p>我们这一次没有拆分大的公司，只是因为矛盾还没有那么尖锐而已，真正的矛盾还在酝酿。有人说，科学技术的快速发展使得我们可以脱离这种历史循环。真的是这样吗？我持悲观态度，下一次大萧条并不是没有可能。</p><h2 id="写在最后"><a href="#写在最后" class="headerlink" title="写在最后"></a>写在最后</h2><p>这一篇 「资本不眠」是我在看了 <code>The Big Short</code>这部电影的时候一时兴起，尝试去理解那次金融海啸到底发生了什么。在这个过程中，我去搜索那些名词到底意味着什么，这场海啸中到底是谁在参与这次游戏，为什么它会造成那么大的后果，它对于我们又意味着什么。为此，我又看了 <code>Inside Job</code> 和 <code>Too Big To Fail</code>两部纪录片，分别从不同的角度去看这场危机。</p><p>看的越多，我越来越意识到，这次 <code>房地产泡沫</code> 绝不是偶然。它和 2000 年的那场 <code>互联网泡沫</code>密切相关，它和 90 年代冷战后美国霸主地位的建立密切相关，它和 80 年代的<code>里根大循环</code>密切相关，它和 70 年代<code>布雷顿森林体系</code>的解散密切相关，它和那两次世界大战密切相关……</p><p>到这里就结束了吗？哦不，上面这么多的坑都还没有填呢，还有 <code>Margin Call</code> 和 <code>Panic: The Untold Story of the 2008 Financial Crisis</code>还没有看呢。资本永不眠，这里只是一个程序猿尝试用自己的方式去理解这个世界，下次再见：）</p><p>P.S. 本来想把所有的这些内容做成一个视频传到 B 站上的，但是发现视频剪辑真的是一个深坑，下次一定？</p><h2 id="参考资料"><a href="#参考资料" class="headerlink" title="参考资料"></a>参考资料</h2><ul><li><a href="https://movie.douban.com/subject/26303622/" target="_blank" rel="external nofollow noopener noreferrer">The Big Short</a></li><li><a href="https://movie.douban.com/subject/4843480/" target="_blank" rel="external nofollow noopener noreferrer">Inside Job</a></li><li><a href="https://movie.douban.com/subject/6013501/" target="_blank" rel="external nofollow noopener noreferrer">Too Big To Fail</a></li><li><a href="https://www.zhihu.com/question/39012069" target="_blank" rel="external nofollow noopener noreferrer">如何评价电影《大空头》的专业性</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;没错，这里是最近新开的另一个专栏「资本不眠」，这个专栏会总结股票市场的交易笔记，比如第一期聊到的 &lt;code&gt;MACD&lt;/code&gt;；也会聊在资本世界里面各种有意思的事情，比如这一期就是通过 &lt;code&gt;The Big Short&lt;/code&gt;这部电影对 2007 年 到 2008 年那次由 &lt;code&gt;次贷危机&lt;/code&gt;引发的全球经济危机进行的梳理复盘。如果以后有机会的话，我会专门在这个专栏复盘自己在股市中的每日操作（当然，我是十分期待自己能够开这个坑的，如果我对自己每次操作都能够知其所以然的话）。&lt;/p&gt;
&lt;p&gt;哦对了，最近 A 股的半导体和新能源等科技股都炒疯了。呵，愚蠢的人类。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-02-20_the-big-short.jpg" type="image" />
    
    
      <category term="资本不眠" scheme="https://houmin.cc/categories/%E8%B5%84%E6%9C%AC%E4%B8%8D%E7%9C%A0/"/>
    
    
      <category term="经济危机" scheme="https://houmin.cc/tags/%E7%BB%8F%E6%B5%8E%E5%8D%B1%E6%9C%BA/"/>
    
      <category term="电影评论" scheme="https://houmin.cc/tags/%E7%94%B5%E5%BD%B1%E8%AF%84%E8%AE%BA/"/>
    
      <category term="金融" scheme="https://houmin.cc/tags/%E9%87%91%E8%9E%8D/"/>
    
  </entry>
  
  <entry>
    <title>【计算机体系结构】NUMA架构详解</title>
    <link href="https://houmin.cc/posts/b893097a/"/>
    <id>https://houmin.cc/posts/b893097a/</id>
    <published>2020-01-08T09:07:40.000Z</published>
    <updated>2022-11-09T15:13:45.391Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>本博文是我对计算机系统中的NUMA 架构做的备忘笔记，参考资料来自于互联网。</p><a id="more"></a><h2 id="基本概念"><a href="#基本概念" class="headerlink" title="基本概念"></a>基本概念</h2><h3 id="SMP-VS-AMP"><a href="#SMP-VS-AMP" class="headerlink" title="SMP VS. AMP"></a>SMP VS. AMP</h3><ul><li><a href="https://en.wikipedia.org/wiki/Symmetric_multiprocessing" target="_blank" rel="external nofollow noopener noreferrer">SMP(Symmetric Multiprocessing)</a>， 即对称多处理器架构，是目前最常见的多处理器计算机架构。</li><li><a href="https://en.wikipedia.org/wiki/Asymmetric_multiprocessing" target="_blank" rel="external nofollow noopener noreferrer">AMP(Asymmetric Multiprocessing)</a>， 即非对称多处理器架构，则是与SMP相对的概念。</li></ul><p>那么两者之间的主要区别是什么呢？ 总结下来有这么几点，</p><ol><li>SMP的多个处理器都是同构的，使用相同架构的CPU；而AMP的多个处理器则可能是异构的。</li><li>SMP的多个处理器共享同一内存地址空间；而AMP的每个处理器则拥有自己独立的地址空间。</li><li>SMP的多个处理器操通常共享一个操作系统的实例；而AMP的每个处理器可以有或者没有运行操作系统， 运行操作系统的CPU也是在运行多个独立的实例。</li><li>SMP的多处理器之间可以通过共享内存来协同通信；而AMP则需要提供一种处理器间的通信机制。</li></ol><p>SMP和AMP的深入介绍很多经典文章书籍可参考，此处不再赘述。现今主流的x86多处理器服务器都是SMP架构的， 而很多嵌入式系统则是AMP架构的。</p><h3 id="NUMA-VS-UMA"><a href="#NUMA-VS-UMA" class="headerlink" title="NUMA VS. UMA"></a>NUMA VS. UMA</h3><p><a href="https://en.wikipedia.org/wiki/Non-uniform_memory_access" target="_blank" rel="external nofollow noopener noreferrer">NUMA(Non-Uniform Memory Access)</a> 非均匀内存访问架构是指多处理器系统中，内存的访问时间是依赖于处理器和内存之间的相对位置的。 这种设计里存在和处理器相对近的内存，通常被称作本地内存；还有和处理器相对远的内存， 通常被称为非本地内存。</p><p><a href="https://en.wikipedia.org/wiki/Uniform_memory_access" target="_blank" rel="external nofollow noopener noreferrer">UMA(Uniform Memory Access)</a> 均匀内存访问架构则是与NUMA相反，所以处理器对共享内存的访问距离和时间是相同的。</p><p>由此可知，不论是NUMA还是UMA都是SMP架构的一种设计和实现上的选择。</p><p>阅读文档时，也常常能看到<strong>ccNUMA(Cache Coherent NUMA)</strong>，即缓存一致性NUMA架构。 这种架构主要是在NUMA架构之上保证了多处理器之间的缓存一致性。降低了系统程序的编写难度。</p><p>x86多处理器发展历史上，早期的多核和多处理器系统都是UMA架构的。这种架构下， 多个CPU通过同一个北桥(North Bridge)芯片与内存链接。北桥芯片里集成了内存控制器(Memory Controller)，</p><p>下图是一个典型的早期 x86 UMA 系统，四路处理器通过 FSB (前端系统总线, Front Side Bus) 和主板上的内存控制器芯片 (MCH, Memory Controller Hub) 相连，DRAM 是以 UMA 方式组织的，延迟并无访问差异。</p><p><img alt="x86 UMA" data-src="http://oliveryang.net/media/images/2018/numa-fsb-3.png"></p><blockquote><p>注：</p><ul><li><a href="https://en.wikipedia.org/wiki/Platform_Controller_Hub" target="_blank" rel="external nofollow noopener noreferrer">PCH(Platform Controller Hub)</a>，Intel 于 2008 年起退出的一系列晶片组，用于取代以往的 I/O Controller Hub（ICH)</li></ul></blockquote><p>在 UMA 架构下，CPU 和内存控制器之间的前端总线 (FSB) 在系统 CPU 数量不断增加的前提下， 成为了系统性能的瓶颈。因此，AMD 在引入 64 位 x86 架构时，实现了 NUMA 架构。之后， Intel 也推出了 x64 的 Nehalem 架构，x86 终于全面进入到 NUMA 时代。x86 NUMA 目前的实现属于 ccNUMA。</p><p>从 Nehalem 架构开始，x86 开始转向 NUMA 架构，内存控制器芯片被集成到处理器内部，多个处理器通过 QPI 链路相连，从此 DRAM 有了远近之分。 而 Sandybridge 架构则更近一步，将片外的 IOH 芯片也集成到了处理器内部，至此，内存控制器和 PCIe Root Complex 全部在处理器内部了。 下图就是一个典型的 x86 的 NUMA 架构：</p><p><img alt="x86 典型 NUMA 架构" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-09_numa-imc-iio-smb.png"></p><div class="note info">            <p><img alt="Intel 处理器微架构路线" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-09-Intel-Processor-Roadmap.png"></p>          </div><h2 id="NUMA-Hierarchy"><a href="#NUMA-Hierarchy" class="headerlink" title="NUMA Hierarchy"></a>NUMA Hierarchy</h2><h3 id="NUMA-Node-内部"><a href="#NUMA-Node-内部" class="headerlink" title="NUMA Node 内部"></a>NUMA Node 内部</h3><p>一个NUMA Node内部是由一个<strong>物理CPU</strong>和它所有的<strong>本地内存(Local Memory)</strong>组成的。广义得讲， 一个NUMA Node内部还包含<strong>本地IO资源</strong>，对大多数Intel x86 NUMA平台来说，主要是PCIe总线资源。 ACPI规范就是这么抽象一个NUMA Node的。</p><h4 id="物理-CPU"><a href="#物理-CPU" class="headerlink" title="物理 CPU"></a>物理 CPU</h4><p>一个CPU Socket里可以由多个CPU Core和一个Uncore部分组成。每个CPU Core内部又可以由两个CPU Thread组成。 每个CPU thread都是一个操作系统可见的逻辑CPU。对大多数操作系统来说，一个八核HT打开的CPU会被识别为16个CPU。 下面就说一说这里面相关的概念，</p><ul><li><p>Socket</p><p>一个Socket对应一个物理CPU。 这个词大概是从CPU在主板上的物理连接方式上来的，可以理解为 Socket 就是主板上的 CPU 插槽。处理器通过主板的Socket来插到主板上。 尤其是有了多核(Multi-core)系统以后，Multi-socket系统被用来指明系统到底存在多少个物理CPU。</p></li><li><p>Node</p><p>NUMA体系结构中多了Node的概念，这个概念其实是用来解决core的分组的问题。每个node有自己的内部CPU，总线和内存，同时还可以访问其他node内的内存，NUMA的最大的优势就是可以方便的增加CPU的数量。通常一个 Socket 有一个 Node，也有可能一个 Socket 有多个 Node。</p></li><li><p>Core</p><p>CPU的运算核心。 x86的核包含了CPU运算的基本部件，如逻辑运算单元(ALU), 浮点运算单元(FPU), L1和L2缓存。 一个Socket里可以有多个Core。如今的多核时代，即使是Single Socket的系统， 也是逻辑上的SMP系统。但是，一个物理CPU的系统不存在非本地内存，因此相当于UMA系统。</p></li><li><p>Uncore</p><p>Intel x86物理CPU里没有放在Core里的部件都被叫做Uncore。Uncore里集成了过去x86 UMA架构时代北桥芯片的基本功能。 在Nehalem时代，内存控制器被集成到CPU里，叫做iMC(Integrated Memory Controller)。 而PCIe Root Complex还做为独立部件在IO Hub芯片里。到了SandyBridge时代，PCIe Root Complex也被集成到了CPU里。 现今的Uncore部分，除了iMC，PCIe Root Complex，还有QPI(QuickPath Interconnect)控制器， L3缓存，CBox(负责缓存一致性)，及其它外设控制器。</p></li><li><p>Threads</p><p>这里特指CPU的多线程技术。在Intel x86架构下，CPU的多线程技术被称作超线程(Hyper-Threading)技术。 Intel的超线程技术在一个处理器Core内部引入了额外的硬件设计模拟了两个逻辑处理器(Logical Processor)， 每个逻辑处理器都有独立的处理器状态，但共享Core内部的计算资源，如ALU，FPU，L1，L2缓存。 这样在最小的硬件投入下提高了CPU在多线程软件工作负载下的性能，提高了硬件使用效率。 x86的超线程技术出现早于NUMA架构。</p></li></ul><h3 id="本地内存"><a href="#本地内存" class="headerlink" title="本地内存"></a>本地内存</h3><p>在Intel x86平台上，所谓本地内存，就是CPU可以经过Uncore部件里的iMC访问到的内存。而那些非本地的， 远程内存(Remote Memory)，则需要经过QPI的链路到该内存所在的本地CPU的iMC来访问。 曾经在Intel IvyBridge的NUMA平台上做的内存访问性能测试显示，远程内存访问的延时时本地内存的一倍。</p><p>可以假设，操作系统应该尽量利用本地内存的低访问延迟特性来优化应用和系统的性能。</p><h3 id="本地-IO-资源"><a href="#本地-IO-资源" class="headerlink" title="本地 IO 资源"></a>本地 IO 资源</h3><p>如前所述，Intel自从SandyBridge处理器开始，已经把PCIe Root Complex集成到CPU里了。 正因为如此，从CPU直接引出PCIe Root Port的PCIe 3.0的链路可以直接与PCIe Switch或者PCIe Endpoint相连。 一个PCIe Endpoint就是一个PCIe外设。这就意味着，对某个PCIe外设来说，如果它直接于哪个CPU相连， 它就属于哪个CPU所在的NUMA Node。</p><p>与本地内存一样，所谓本地IO资源，就是CPU可以经过Uncore部件里的PCIe Root Complex直接访问到的IO资源。 如果是非本地IO资源，则需要经过QPI链路到该IO资源所属的CPU，再通过该CPU PCIe Root Complex访问。 如果同一个NUMA Node内的CPU和内存和另外一个NUMA Node的IO资源发生互操作，因为要跨越QPI链路， 会存在额外的访问延迟问题。</p><p>其它体系结构里，为降低外设访问延迟，也有将IB(Infiniband)总线集成到CPU里的。 这样IB设备也属于NUMA Node的一部分了。</p><p>可以假设，操作系统如果是NUMA Aware的话，应该会尽量针对本地IO资源低延迟的优点进行优化。</p><p><img alt="PCIe Root Complex Location" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-09_pcie-root-complex.png"></p><h3 id="NUMA-Node-互联"><a href="#NUMA-Node-互联" class="headerlink" title="NUMA Node 互联"></a>NUMA Node 互联</h3><p>在Intel x86上，NUMA Node之间的互联是通过 <a href="https://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect" target="_blank" rel="external nofollow noopener noreferrer">QPI((QuickPath Interconnect) Link</a>的。 CPU的Uncore部分有QPI的控制器来控制CPU到QPI的数据访问。</p><p>下图就是一个利用 QPI Switch 互联的 8 NUMA Node 的 x86 系统，</p><p><img alt="img" data-src="http://oliveryang.net/media/images/2018/numa-imc-iio-qpi-switch-3.png"></p><h2 id="NUMA-Affinity"><a href="#NUMA-Affinity" class="headerlink" title="NUMA Affinity"></a>NUMA Affinity</h2><p>NUMA Affinity(亲和性)是和NUMA Hierarchy(层级结构)直接相关的。对系统软件来说， 以下两个概念至关重要，</p><ul><li><p><strong>CPU NUMA Affinity</strong></p><p>CPU NUMA的亲和性是指从CPU角度看，哪些内存访问更快，有更低的延迟。如前所述， 和该CPU直接相连的本地内存是更快的。操作系统如果可以根据任务所在CPU去分配本地内存， 就是基于CPU NUMA亲和性的考虑。因此，CPU NUMA亲和性就是要尽量让任务运行在本地的NUMA Node里。</p></li><li><p><strong>Device NUMA Affinity</strong></p><p>设备NUMA亲和性是指从PCIe外设的角度看，如果和CPU和内存相关的IO活动都发生在外设所属的NUMA Node， 将会有更低延迟。这里有两种设备NUMA亲和性的问题，</p><ol><li><p><strong>DMA Buffer NUMA Affinity</strong></p><p>大部分PCIe设备支持DMA功能的。也就是说，设备可以直接把数据写入到位于内存中的DMA缓冲区。 显然，如果DMA缓冲区在PCIe外设所属的NUMA Node里分配，那么将会有最低的延迟。 否则，外设的DMA操作要跨越QPI链接去读写另外一个NUMA Node里的DMA缓冲区。 因此，操作系统如果可以根据PCIe设备所属的NUMA node分配DMA缓冲区， 将会有最好的DMA操作的性能。</p></li><li><p><strong>Interrupt NUMA Affinity</strong></p><p>设备DMA操作完成后，需要在CPU上触发中断来通知驱动程序的中断处理例程(ISR)来读写DMA缓冲区。 很多时候，ISR触发下半部机制(SoftIRQ)来进入到协议栈相关(Network，Storage)的代码路径来传送数据。 对大部分操作系统来说，硬件中断(HardIRQ)和下半部机制的代码在同一个CPU上发生。 因此，DMA缓冲区的读写操作发生的位置和设备硬件中断(HardIRQ)密切相关。假设操作系统可以把设备的硬件中断绑定到自己所属的NUMA node， 那之后中断处理函数和协议栈代码对DMA缓冲区的读写将会有更低的延迟。</p></li></ol></li></ul><h2 id="Firmware-接口"><a href="#Firmware-接口" class="headerlink" title="Firmware 接口"></a>Firmware 接口</h2><p>由于NUMA的亲和性对应用的性能非常重要，那么硬件平台就需要给操作系统提供接口机制来感知硬件的NUMA层级结构。 在x86平台，<a href="http://acpi.info/" target="_blank" rel="external nofollow noopener noreferrer">ACPI规范</a>提供了以下接口来让操作系统来检测系统的NUMA层级结构。</p><p>ACPI 5.0a规范的第17章是有关NUMA的章节。ACPI规范里，NUMA Node被第9章定义的Module Device所描述。 ACPI规范里用<strong>Proximity Domain</strong>对NUMA Node做了抽象，两者的概念大多时候等同。</p><ul><li><p><strong>SRAT(System Resource Affinity Table)</strong></p><p>主要描述了系统boot时的CPU和内存都属于哪个Proximity Domain(NUMA Node)。 这个表格里的信息时静态的，如果是启动后热插拔，需要用OSPM的_PXM方法去获得相关信息。</p></li><li><p><strong>SLIT(System Locality Information Table)</strong></p><p>提供CPU和内存之间的位置远近信息。在SRAT表格里，只能告诉给定的CPU和内存是否在一个NUMA Node。 对某个CPU来说，不在本NUMA Node里的内存，即远程内存们是否都是一样的访问延迟取决于NUMA的拓扑有多复杂(QPI的跳数)。 总之，对于不能简单用<strong>远近</strong>来描述的NUMA系统(QPI存在0，1，2等不同跳数)， 需要SLIT表格给出进一步的说明。同样的，这个表格也是静态表格，热插拔需要使用OSPM的_SLI方法。</p></li><li><p><strong>DSDT(Differentiated System Description Table)</strong></p><p>从Device NUMA角度看，这个表格给出了系统boot时的外设都属于哪个Proximity Domain(NUMA Node)。</p></li></ul><p>ACPI规范OSPM(Operating System-directed configuration and Power Management) 和OSPM各种方法就是操作系统里的ACPI驱动和ACPI firmware之间的一个互动的接口。 x86启动OS后，没有ACPI之前，firmware(BIOS)的代码是无法被执行了，除非通过SMI中断处理程序。 但有了ACPI，BIOS提前把ACPI的一些静态表格和AML的bytecode代码装载到内存， 然后ACPI驱动就会加载AML的解释器，这样OS就可以通过ACPI驱动调用预先装载的AML代码。 AML(ACPI Machine Language)是和Java类似的一种虚拟机解释型语言，所以不同操作系统的ACPI驱动， 只要有相同的虚拟机解释器，就可以直接从操作系统调用ACPI写好的AML的代码了。 所以，前文所述的所有热插拔的OSPM方法，其实就是对应ACPI firmware的AML的一段函数代码而已。 (关于ACPI的简单介绍，这里给出两篇延伸阅读：<a href="http://rdist.root.org/2008/10/17/all-about-acpi/" target="_blank" rel="external nofollow noopener noreferrer">1</a> 和<a href="https://www.usenix.org/legacy/events/usenix02/tech/freenix/full_papers/watanabe/watanabe_html/index.html" target="_blank" rel="external nofollow noopener noreferrer">2</a>。)</p><h2 id="NUMA-Optimization"><a href="#NUMA-Optimization" class="headerlink" title="NUMA Optimization"></a>NUMA Optimization</h2><ul><li><a href="https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/" target="_blank" rel="external nofollow noopener noreferrer">https://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/</a></li><li><a href="https://frankdenneman.nl/2016/07/08/numa-deep-dive-part-2-system-architecture/" target="_blank" rel="external nofollow noopener noreferrer">https://frankdenneman.nl/2016/07/08/numa-deep-dive-part-2-system-architecture/</a></li><li><a href="http://oliveryang.net/2016/02/linux-numa-optimization-1/" target="_blank" rel="external nofollow noopener noreferrer">http://oliveryang.net/2016/02/linux-numa-optimization-1/</a></li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;本博文是我对计算机系统中的NUMA 架构做的备忘笔记，参考资料来自于互联网。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-09_numa-imc-iio-smb.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="计算机" scheme="https://houmin.cc/tags/%E8%AE%A1%E7%AE%97%E6%9C%BA/"/>
    
      <category term="numa" scheme="https://houmin.cc/tags/numa/"/>
    
  </entry>
  
  <entry>
    <title>【计算机体系结构】Cache Memory</title>
    <link href="https://houmin.cc/posts/9bccd097/"/>
    <id>https://houmin.cc/posts/9bccd097/</id>
    <published>2020-01-06T09:05:27.000Z</published>
    <updated>2023-03-26T04:23:43.878Z</updated>
    
    <content type="html"><![CDATA[<link rel="stylesheet" class="aplayer-secondary-style-marker" href="/assets/css/APlayer.min.css"><script src="/assets/js/APlayer.min.js" class="aplayer-secondary-script-marker"></script><script class="meting-secondary-script-marker" src="/assets/js/Meting.min.js"></script><p>本博文是我对计算机系统中的缓存做的备忘笔记，参考资料来自于互联网。</p><a id="more"></a><h2 id="Background"><a href="#Background" class="headerlink" title="Background"></a>Background</h2><h3 id="Memory-Hierarchy"><a href="#Memory-Hierarchy" class="headerlink" title="Memory Hierarchy"></a>Memory Hierarchy</h3><p>众所周知，对于不同的存储设备，更高的性能意味着更高的成本和更小的容量。随着 CPU 越做越快，CPU 和主存之间的速度差距正在不断扩大。好在，<strong>软件的局部性原理</strong> 拯救了这一切，在现代计算机体系中通过 <code>Memory Hierarchy</code> 的设计，使得系统在性能、成本和制造工艺之间作出取舍，从而达到了一个平衡。</p><p>下图是现在可以看到的常见的存储器层次机构：</p><p><img alt="Memory Hierarchy" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-06_memory-hierarchy.png"></p><h3 id="Principle-of-Locality"><a href="#Principle-of-Locality" class="headerlink" title="Principle of Locality"></a>Principle of Locality</h3><p><code>程序访问的局部性原理</code>指的是，<strong>内存中某个地址被访问后，短时间内还有可能继续访问这块地址。内存中的某个地址被访问后，它相邻的内存单元被访问的概率也很大。</strong></p><p>程序访问的局部性包含两种：</p><ul><li>时间局部性：某个内存单元在较短时间内很可能被再次访问</li><li>空间局部性：某个内存单元被访问后相邻的内存单元较短时间内很可能被访问</li></ul><p>出现这种情况的原因很简单，因为程序是指令和数据组成的，指令在内存中按顺序存放且地址连续，如果运行一段循环程序或调用一个方法，又或者再程序中遍历一个数组，都有可能符合上面提到的局部性原理。</p><p>那既然在执行程序时，内存的某些单元很可能会经常的访问或写入，那可否在CPU和内存之间，加一个缓存，CPU在访问数据时，先看一下缓存中是否存在，如果有直接就读取缓存中的数据即可。如果缓存中不存在，再从内存中读取数据。</p><p>事实证明利用这种方式，程序的运行效率会提高90%以上，这个缓存也叫做<strong>高速缓存Cache</strong>。</p><h2 id="Big-Picture"><a href="#Big-Picture" class="headerlink" title="Big Picture"></a>Big Picture</h2><p>在深入了解 Cache 的技术细节之前，我们可以先看看关于 Cache 在现代计算机系统中的 big picture。如下图所示，ALU 不是直接和主存相连，所有的 load 和 store 通过 Cache 完成，Cache 是 CPU 芯片组成的一部分。</p><p><img alt="Cache System Structure" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-07_cache-system-structure.png"></p><p>随着计算机系统的发展，Cache 不仅仅只有一层，可能被分为多层。于此同时，人们发现，将指令的 Cache 和 数据的 Cache 分开可以获得更大的系统增益。而且，CPU也从单核单处理器逐渐发展到多核多处理器，所以一个现代的计算机系统中，Cache 的组成方式可能如下图所示：</p><p><img alt="Intel Core i7 Cache Hierarchy" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-06_intel-core-i7-cache-hierarchy.png"></p><p>在这个图中，只显示了一个处理器(Processor)，处理器中有四个核(Core)，每个 Core 会有自己的L1数据缓存和L1指令缓存，也有自己统一的 L2 缓存。四个核之间会共享 L3 缓存，L3 缓存和主存直接沟通。</p><h2 id="General-Cache-Organization"><a href="#General-Cache-Organization" class="headerlink" title="General Cache Organization"></a>General Cache Organization</h2><p>Cache 是以 <code>缓存行(Cache Line)</code> 为基本组织单位的，下图是一个通用的缓存组织结构。</p><p><img alt="General Cache Organization (S, E, B, m)" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-07_cache-organization.png"></p><p>假设内容容量是 M，内存物理地址为 <code>m</code> 个bit。CPU 在访问缓存时，物理地址会被解析成如下的格式</p><p><img alt="Cache Line 地址解析" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-07_cache-line-address.png"></p><p>这里，有如下的关系</p><script type="math/tex; mode=display">\begin{gather}S = 2^s \\B = 2^b \\m = t + s + b \\\end{gather}</script><p>参数具体意义如下：一个 Cache 被分成 S 个 <code>set</code>，每个 set 有 E 路 Cache Line。在一个 Cache Line 中，有 B 个字节的存储单元。所以，在一个内存地址中，中间的 s 个 bit 决定了该寻址单元被映射到哪个 set，而最低的 b 个 bit 决定了该单元在一个缓存行中的偏移量。<code>tag</code> 是内存地址的高 t 个 bit，因为可能有多个内存地址映射到同一个 Cache Line 中，所以用 tag 来校验该 Cache Line 是否是 CPU 要访问的内存单元。</p><p>上面是从内存地址的角度看访问 Cache 时候的地址参数解析，对应到实际的 Cache Line 的组成，可以看第一张图。可以看到，对于每一个 Cache Line，除了 tag 用来校验是否是 CPU 要访问的内存单元，还有一个 <code>valid bit</code> 来确认该缓存行是否有效，然后就是一个含有 $B = 2^b$ 个字节的 <code>Cache Block</code>。在目前的 x86 CPU 的 Cache Line 中，一般都是 64 字节的。</p><p>当 tag 和 valid 校验成功时，我们称为 <code>Cache Hit</code>，这时就可以将cache中的单元取出，放入到 CPU 中的寄存器。</p><p>当 tag 或 valid 校验失败时，说明要访问的内存单元并不在cache中，需要去内存中或者下一级的 Cache 中取出，这就是 <code>Cache Miss</code>。当不命中的情况发生时，系统就会从内存中或者下一级缓存中取得该单元，将其装入cache中，与此同时也放入CPU寄存器中，等待下一步处理。</p><p>下图是一个典型的 Cache Read 的流程</p><p><img alt="Cache Read" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-07_cache-read.png"></p><h2 id="Cache-Implement-Details"><a href="#Cache-Implement-Details" class="headerlink" title="Cache Implement Details"></a>Cache Implement Details</h2><p>根据上面参数 E 的不同选择，可以把 Cache 到 Memory 的映射分为以下几种类型</p><ul><li>全相联(Fully Associative)<ul><li>$s = 0$，每个内存块数据可以映射到任意缓存行中</li></ul></li><li>直接映射(Direct Mapped)，也称单路组相连(Single Way Set Associative)<ul><li>$E = 1$，每个内存块数据只能映射到固定的缓存行中</li></ul></li><li>多路组相联(N Way Set Associative)<ul><li>$E = N$, 每个内存块数据可以映射到固定 set 的任意缓存行中</li></ul></li></ul><h3 id="Fully-Associative-Cache"><a href="#Fully-Associative-Cache" class="headerlink" title="Fully Associative Cache"></a>Fully Associative Cache</h3><p>全相联把内存方位两个字段，<code>tag</code> 和 <code>offset</code>，没有了 set index 的字段。</p><p>在访问数据时，直接根据内存地址中的 <code>tag</code>，去遍历对比每一个缓存行，直到找到 <code>tag</code>一致的缓存行，然后访问其中的数据。</p><p>如果遍历完所有的缓存行之后，没有找到一致的<code>tag</code>, 那么就会从内存中获取数据，然后找到空闲的缓存行，直接写入<code>tag</code>和 数据即可。</p><p>全相联意味着主存中的数据块可以出现在任意一个缓存行中。这种方式下替换算法(Replacement Policy)有最大的灵活度，也意味着可以有最低的 <code>Cache Miss Rate</code>。但是因为没有索引可以使用，检查一个缓存行是否命中需要在整个 Cache 范围内搜索，这带来了查找电路的大量延时。因此只有在缓存极小的情况下才有可能使用这种映射方式。</p><p><img alt="Fully Associative" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-07_cache-fully-associative.png"></p><h3 id="Direct-Mapped-Cache"><a href="#Direct-Mapped-Cache" class="headerlink" title="Direct Mapped Cache"></a>Direct Mapped Cache</h3><p>直接映射是一种<code>多对一</code>的映射关系。在这种映射下，主存中的每一个数据块只能有一个缓存行与之对应。可能有多个主存中的数据块被映射到统一个缓存行，但是每一个数据块只能被映射到确定的缓存行。</p><p>在 1990 年代初期，直接映射是当时最流行的机制。但是随着 CPU 主频的提高，直接映射机制正在逐渐退出舞台。</p><p>直接映射最大的问题在于，每个数据块在哪个缓存行是确定的，没有替换策略(Replacement Policy)。如果两个数据块被映射到同一个缓存行时，它们会不停的把对方替换出去。由于严重的冲突，频繁刷新 Cache 会造成大量的延时，而且未能有效利用程序运行期所具有的时间局部性。这样导致了缓存命中率(cache miss rate)明显提高。</p><p>下图是一个Memory 为 16Kbytes， Cache Line 为 4bytes 的直接映射缓存例子。</p><p><img alt="Direct Mapped" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-07_cache-direct-mapped.png"></p><h3 id="Set-Associative-Cache"><a href="#Set-Associative-Cache" class="headerlink" title="Set Associative Cache"></a>Set Associative Cache</h3><p>组相联映射结合了以上两种映射方式的优点。具体的方法就是</p><ul><li>首先通过 set index 来确认数据块应该放在哪一个 set 中</li><li>确认到 set 之后，通过 cache 替换策略(Replacement Policy)来确定到底放在组中的那个缓存行</li></ul><p><img alt="Set Associative" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-07_cache-set-associative.png"></p><h2 id="Cache-Replacement-Policy"><a href="#Cache-Replacement-Policy" class="headerlink" title="Cache Replacement Policy"></a>Cache Replacement Policy</h2><p>Cache容量比内存小，所以内存数据映射到Cache时，必然会导致Cache满的情况，那之后的内存映射要替Cache中的哪些行呢？这就需要制定一种策略。</p><p>常见的替换算法有如下几种：</p><ul><li>先进先出算法（FIFO)：总是把最早装入Cache的行替换掉，这种算法实现简单，但不能正确反映程序的访问局部性，命中率不高</li><li>最近最少使用算法（LRU)：总是选择最近最少使用的Cache行替换，这种这种算法稍微复杂一些，但可以正确反映程序访问的局部性，命中率最高</li><li>最不经常使用算法（LFU）：总是替换掉Cache中引用次数最少的行，与LRU类似，但没有LRU替换方式更精准</li><li>随机替换算法（Random）：随机替换掉Cache中的行，与使用情况无关，命中率不高</li></ul><p>现实使用最多的是最近最少使用算法（LRU)进行Cache行的替换方案，这种方案使得缓存的命中率最高。</p><h2 id="Cache-in-Real-World"><a href="#Cache-in-Real-World" class="headerlink" title="Cache in Real World"></a>Cache in Real World</h2><p>可以通过如下的方式查看。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">----- 如果启动时间比较短，可以通过如下方式查看</span><br><span class="line"><span class="meta">#</span><span class="bash"> dmesg | grep cache</span></span><br><span class="line"></span><br><span class="line">----- 比较详细的硬件信息，包括了Cache的详细信息</span><br><span class="line"><span class="meta">#</span><span class="bash"> dmidecode</span></span><br><span class="line"></span><br><span class="line">----- 查看硬件信息，例如CPU、内存等</span><br><span class="line"><span class="meta">#</span><span class="bash"> lshw</span></span><br><span class="line"></span><br><span class="line">----- 查看CPU相关信息，两者比较类似</span><br><span class="line"><span class="meta">#</span><span class="bash"> lscpu</span></span><br><span class="line"><span class="meta">#</span><span class="bash"> cat /proc/cpuinfo</span></span><br></pre></td></tr></table></figure><p>另外，由专门针对 x86 信息的程序，也就是 <code>x86info</code> ，可以直接安装对应的包。</p><p>注意，现在多数的 CPU 采用的是超线程，也就是说对于一个物理核来说，对于内核看到的是两个，而实际的物理核是一个。</p><p>另外，在 <code>/sys/devices/system/cpu/</code> 中包含了一些相关的指标，例如。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">----- 查看cpu0的一级缓存中的有多少组</span><br><span class="line"><span class="meta">#</span><span class="bash"> /sys/devices/system/cpu/cpu0/cache/index0/number_of_sets</span></span><br><span class="line">64</span><br><span class="line">----- 查看cpu0的一级缓存中一组中的行数</span><br><span class="line"><span class="meta">$</span><span class="bash"> cat /sys/devices/system/cpu/cpu0/cache/index0/ways_of_associativity</span></span><br><span class="line">8</span><br></pre></td></tr></table></figure><p>通过类似 lscpu 查看到对应 CPU0 的一级缓存大小是 32K ，包含了 64 个组 (sets)，每组有 8 ways，则可以算出每一个 way (也就是 Cache Line) 的大小是 <code>32*1024/(64*8)=64</code> 。</p><p>可以通过如下方式查看。</p><figure class="highlight shell"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#</span><span class="bash"> cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size</span></span><br><span class="line">64</span><br></pre></td></tr></table></figure><p>这里是 Intel Core i7 L1数据缓存的实际例子</p><p><img alt="Intel Core i7 L1 Data Cache" data-src="https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-07_intel-i7-l1-data-cache.png"></p><h2 id="Cache-Write"><a href="#Cache-Write" class="headerlink" title="Cache Write"></a>Cache Write</h2><p>试想，如果CPU想要修改某个内存的数据，这块内存的数据刚好在Cache中存在，那么是不是要同时更新Cache中的数据？对于写入的数据，如何保证Cache和内存数据的一致性？</p><h3 id="Cache-Write-Hit"><a href="#Cache-Write-Hit" class="headerlink" title="Cache Write Hit"></a>Cache Write Hit</h3><ul><li><strong>Write through</strong><ul><li>在写操作时，如果Cache命中，则同时写Cache和内存。</li></ul></li><li><strong>Write back</strong><ul><li>在写操作时，如果Cache命中，则只更新Cache而不更新内存。</li><li>所以每一个 Cache Line需要有一个 dirty bit</li></ul></li></ul><h3 id="Cache-Write-Miss"><a href="#Cache-Write-Miss" class="headerlink" title="Cache Write Miss"></a>Cache Write Miss</h3><ul><li>Write Allocate<ul><li>先更新内存数据，然后再写入空闲的Cache行中，保证Cache有数据，提高了缓存命中率，但增加了写入Cache的开销</li></ul></li><li>No Write Allocate<ul><li>只更新内存数据，不写入Cache，只有等访问不命中时，再进行缓存写入 </li></ul></li></ul><p>关于缓存一致性的问题，可以参考我的另一篇博文 <a href="https://houmin.cc">Cache Coherency</a>，此处不再赘述。</p><h2 id="Cache-Friendly-Code"><a href="#Cache-Friendly-Code" class="headerlink" title="Cache Friendly Code"></a>Cache Friendly Code</h2><p>针对缓存的这种特殊的结构，作为程序猿，如果一不小心，可能会带来重大的性能问题。</p><p>一些典型的案例可以参考我的另一篇博文 <a href="https://houmin.cc">Cache Friendly Code</a>，此处不再赘述。</p><h2 id="Reference"><a href="#Reference" class="headerlink" title="Reference"></a>Reference</h2><ul><li><a href="https://lwn.net/Articles/252125/" target="_blank" rel="external nofollow noopener noreferrer">https://lwn.net/Articles/252125/</a></li><li><a href="http://hedengcheng.com/?p=648" target="_blank" rel="external nofollow noopener noreferrer">http://hedengcheng.com/?p=648</a></li><li>CMU 15213 Cache Memories Slide</li></ul>]]></content>
    
    <summary type="html">
    
      &lt;p&gt;本博文是我对计算机系统中的缓存做的备忘笔记，参考资料来自于互联网。&lt;/p&gt;
    
    </summary>
    
    <content src="https://houmin.cc/https://cosmos-1251905798.cos.ap-beijing.myqcloud.com/blog/2020-01-07_cache-read.png" type="image" />
    
    
      <category term="术业专攻" scheme="https://houmin.cc/categories/%E6%9C%AF%E4%B8%9A%E4%B8%93%E6%94%BB/"/>
    
    
      <category term="计算机" scheme="https://houmin.cc/tags/%E8%AE%A1%E7%AE%97%E6%9C%BA/"/>
    
      <category term="cache" scheme="https://houmin.cc/tags/cache/"/>
    
  </entry>
  
</feed>
