0%

Antrea 架构与实现

Antrea 是 VmWare 推出的基于 Open vSwitch 作为网络数据平面的 k8s 网络解决方案。Open vSwitch 作为被广泛采用的高性能可编程的虚拟软件交换机,被广泛采用在公有云基础网络中,关于其原理可以参考我的 这篇文章, 本文不再概述。Antrea 基于 Open vSwitch 在 L3/L4 高效地实现了 Networking 和 Security Services,本文将介绍其架构与实现。

Architecture

在一个Kubernetes集群内,Antrea以Deployment的形式部署运行中心控制器antrea-controller,以负责网络策略的解析转换、跨度(Span)计算,并能够以增量传输的方式向工作节点按需分发网络策略,以消除在每个工作节点上执行这一处理带来的计算、存储、传输开销。

在每个工作节点上,Antrea以DaemonSet的形式部署运行antrea-agent和OVS用户空间守护进程,并以init container的形式将antrea-cni插件安装到主机并加载OVS内核模块。antrea-agent负责OVS网桥和Pod网络接口的管理,并通过OpenFlow协议编程OVS来实现各种网络和安全功能。

此外,Antrea还带有命令行工具antctl和基于Octant的图形界面插件,提供了强大的故障诊断能力。

Antrea Controller

Antrea Controller从Kubernetes API监听NetworkPolicy,Pod和Namespace资源,计算NetworkPolicy并分发计算出的策略给相关的Antrea Agents。目前Antrea Controller主要功能是实现网络安全策略。

Antrea Controller利用Kubernetes apiserver库 实现与Antrea Agent交互。每个Antrea Agent连接到Controller API Server并监听计算后的NetworkPolicy对象。Controller还提供了用于调试的REST API,可以由antctl调用展示计算后的策略信息。

Antrea Controller对计算和发布NetworkPolicy的过程进行了定制和优化:

  • Antrea Controller 将所有状态保存在内存中,不需要持久化保存数据。
  • 仅将NetworkPolicy对象发送到需要该NetworkPolicy的节点。当且仅当NetworkPolicy被应用于该节点上至少一个Pod时,节点才会收到这条NetworkPolicy。
  • 支持NetworkPolicy对象的增量更新。
  • Controller和Agent之间的消息使用Protobuf格式进行序列化并提高效率。

Antrea Controller API Server还利用Kubernetes Service实现服务发现以及身份验证和授权。Controller API通过Kubernetes ClusterIP对外提供服务。Antrea Agent从环境变量获取服务的ClusterIP并连接到Controller API Server。Controller API Server将身份验证和授权委托给Kubernetes API ,Antrea Agent使用Kubernetes ServiceAccount token向Controller进行身份验证,然后Controller API Server验证token并检查ServiceAccount是否有权访问所请求的API。

Antrea Controller还通过API Server向antctl命令行工具暴露了REST API。它利用Kubernetes API聚合使antctl通过Kubernetes API访问Antrea Controller API。antctl连接并向Kubernetes API验证,然后该请求被转发到Antrea Controller。通过这种方式,antctl可以在任何能够访问Kubernetes API的机器上执行,并且还可以利用kubectl配置(kubeconfig文件)来发现Kubernetes API和身份验证信息。

Antrea Agent

Antrea Agent运行在每个Kubernetes节点上, 接受Antrea controller的策略配置,管理OVS Bridge和Pod接口并实现Pod的网络连接和安全特性。

Antrea Agent向antrea-cni提供gRPC服务(CNI服务)来实现Pod的网络资源管理。对于节点上每个新创建的Pod,antrea-cni将容器运行时的CNI ADD请求转发给 Antrea Agent,由后者创建Pod的网络接口,分配IP地址,并将该接口连接到OVS Bridge,同时在OVS中安装必要的流规则。要了解更多有关OVS 流规则请参见OVS流文档。

Antrea Agent包括两个本地Controller:

  • Node Controller:监视Kubernetes API服务器中的新节点,并为每个远程节点创建一个OVS(Geneve/VXLAN/GRE/STT)隧道连接。
  • NetworkPolicy Controller:从Antrea Controller API监视计算后的NetworkPolicy,并为本地Pod安装对应的OVS 流规则。

此外,Antrea Agent同样提供了REST API,antctl可以调用它展示内部数据和状态从而支持故障定位。

OVS Daemon

两个OVS守护程序ovsdb-server和ovs-vswitchd运行在Antrea Agent DaemonSet中一个名为antrea-ovs的独立容器中。

Antrea CNI

antrea-cni是Antrea的CNI插件。它是一个简单的gRPC客户端,负责将CNI请求通过RPC发送给Antrea Agent。Antrea Agent执行实际工作(为Pod建立网络)并将结果或错误返回给antrea-cni。

Antctl

Antctl是Antrea的命令行工具。它可以显示基本的Antrea Controller和Antrea Agent的运行时信息,用于调试和故障定位。

访问Controller时,antctl调用Controller API来查询请求的信息。如前所述,antctl可以通过Kubernetes API访问Controller API,由Kubernetes API进行身份验证授权后将请求转发到Antrea Controller。antctl也可以作为kubectl的插件来执行。

访问Agent时,antctl会连接到Agent本地的REST API,并且只能在Agent容器内执行。

Pod Networking

Pod interface configuration and IPAM

在每个节点上,Antrea Agent都会创建一个OVS Bridge(默认情况下名为br-int),并为每个Pod创建一对veth设备,一端在Pod的网络命名空间中,另一端连接到OVS Bridge。在OVS Bridge上,Antrea Agent还会创建一个内部端口(默认情况下为antrea-gw0)作为节点子网的内部网关,以及一个用来连接到其他节点的overlay隧道端口antrea-tun0。

图片

每个节点都分配有一个子网,该节点上的所有Pod均从该子网获得IP。Antrea利用Kubernetes的NodeIPAMController为节点做子网分配,并由其将分配的子网配置到Kubernetes Node API的podCIDR字段。Antrea Agent从podCIDR字段获得节点的子网,保留子网的第一个IP作为本地节点的网关IP,将其分配给antrea-gw0端口,并调用host-local IPAM插件[7]从子网给本地Pod分配IP。

对于每个远程节点,Antrea Agent添加一条Openflow规则使流量能够通过对应的隧道发送到该节点。该规则通过匹配数据包的目的IP来选择隧道端点。

Traffic Walk

  • 节点内流量:两个本地Pod之间的数据包将直接由OVS Bridge转发。
  • 节点间流量:到另一个节点上的Pod的数据包将首先转发到antrea-tun0端口,封装后通过隧道发送到目标节点,由目标端点解封装,通过antrea-tun0端口进入OVS Bridge,最后转发到目标Pod。
  • 访问外部网络的流量:发送到外部网络的数据包将转发到antrea-gw0端口(因为它是本地Pod子网网关),由主机网络将其路由到节点上对应的网络接口(例如裸机的物理网络接口)然后发送到外部网络。Antrea Agent创建iptables(MASQUERADE)规则来对Pod的数据包执行SNAT,这样它们的源IP将在发送前重写为节点的IP。

ClusterIP Service

Antrea支持kube-proxy或Antrea Proxy 两种方法实现ClusterIP类型的服务,其中Antrea Proxy通过OVS为ClusterIP服务实现负载均衡。

在使用kube-proxy时,Antrea Agent通过添加OVS流规则来转发从Pod访问ClusterIP服务的数据包到antrea-gw0端口,kube-proxy将拦截数据包并选择一个Service Endpoint作为连接的目的地址,然后将数据包DNAT发送到Endpoint的IP和端口。如果目标Endpoint是本地Pod,则数据包将直接转发到Pod。如果目标 Endpoint 在另一个节点上,则通过隧道方式将数据包发送到该节点。

kube-proxy可以在任何支持的模式下使用,包括user-space,iptables或IPVS。请参阅Kubernetes服务文档[8]查看更多细节。

启用AntreaProxy后,Antrea Agent会添加OVS流规则来实现ClusterIP服务流量的负载均衡和DNAT。服务流量负载均衡将通过OVS内部来完成。因为在转发流量到主机网络时没有额外的开销和iptables处理,与kube-proxy相比可以获得更好的性能。Antrea Agent中AntreaProxy使用了部分kube-proxy的实现来监视和处理Service Endpoint。

NetworkPolicy

Antrea中NetworkPolicy的重要设计实现是选择集中的策略计算。Antrea Controller监视Kubernetes API中的NetworkPolicy,Pod和Namespace资源,并按以下方式处理podSelectors,namespaceSelectors和ipBlocks:

  • 在NetworkPolicy规范下直接定义的PodSelector(定义了应用该NetworkPolicy的Pod)转换为Pod成员。
  • Selectors(podSelectors和namespaceSelectors)和ipBlocks(定义了此策略允许的进出口流量)将被映射到Pod IP地址或IP地址范围。

Antrea Controller还计算哪些节点需要接收NetworkPolicy。每个Antrea Agent仅接收本地Pod需要的网络策略,并直接使用由Controller计算的IP地址来创建实现特定NetworkPolicy的OVS流规则。

集中计算的方式更加高效,主要表现在以下几个方面:

  • 只需要一个Antrea Controller实例接收和处理NetworkPolicy,Pod和Namespace的更新,并计算podSelector和namespaceSelectors选中的成员。相较于在所有工作节点监视这些更新和执行同样复杂计算的方式,总体成本要低得多。
  • 很容易启用Controller的横向扩展来增加计算力,同时运行多个Controller并行计算NetworkPolicy,每个Controller只负责NetworkPolicy的一个子集(因为单个Controller已经满足2000节点的超大集群,所以目前Antrea仅可配置一个Controller实例)。
  • Antrea Controller是NetworkPolicy计算的唯一来源,这样在节点之间更容易实现数据一致性,并且极大降低了调试的难度。

Hybrid, NoEncap, NetworkPolicyOnly TrafficEncapMode

默认的Encap模式会封装所有节点间的Pod流量,除了Encap模式外, Antrea还支持其他封装模式,包括NoEncap、Hybrid和NetworkPolicyOnly模式。

  • NoEncap模式:此模式下Pod流量不会被封装。Antrea会假设节点网络可以处理跨节点的Pod流量路由。该实现通常需要为节点间的路由器添加 Pod子网路由,一般由Kubernetes Cloud Provider来实现。Antrea Agent还会在节点上为所有处于同一子网的远程节点创建静态路由,这样Pod间流量将直接转发到目标节点而无需通过节点的路由器从而优化了跳数。Antrea Agent还会创建iptables(MASQUERADE)规则,用于实现Pod到外部流量的SNAT。
  • Hybrid模式:当两个节点位于两个不同的子网中时,两个节点之间的Pod流量将被封装;当两个节点位于同一子网中时,Pod之间的流量不会被封装,而是从一个节点路由到另一个节点。对于同一子网的其他节点,Agent会添加一条静态路由,将该节点IP用作其Pod子网的下一跳。(Hybrid模式要求节点网络允许将带有Pod IP的数据包从节点的网卡中发送出。)
  • NetworkPolicyOnly模式:节点间Pod流量既不被Antrea通过隧道传输也不被路由。Antrea只为Pod流量实现NetworkPolicy,此时需要额外的CNI和Cloud网络来实现Pod网络分配和跨节点流量转发。更多信息请参阅NetworkPolicyOnly模式设计文档。Antrea支持在AKS和EKS启用仅网络策略模式。

OVS Pipeline

Registers

在整个 OVS Pipeline 中,我们使用两个 32 bit 的 OVS 寄存器来携带信息:

  • reg0 (NXM_NX_REG0)
    • bits [0..15] 被用来存储 traffic source,这部分在 ClassifierTable 中设置
      • from tunnel: 0
      • from local gateway: 1
      • from local Pod: 2
    • bit 16 用于指示是否一个 packet 的 destination MAC 地址 是已知的
    • bit 19 用于指示是否一个 packet 的 destination 和 source MACs 应该在 l3ForwardingTable 中被重写
      • 当接收到来自于 tunnel port 的数据包时,这一位在 ClassifierTable 中设置
      • 这些数据包的目的 MAC 地址是 Global Virtual MAC,并且 should be rewritten to the destination port's MAC before output to the port
      • 当这样一个数据包目的是一个Pod,它的源MAC地址应该被重写为 local gateway port's MAC
  • reg1 (NXM_NX_REG1)
    • 用于存储数据包的 egress OF port
    • 如果数据包的目标是 Services,这个寄存器被 DNATTable 设置,否则则被 L2ForwardingCalcTable 设置
    • 这个寄存器的值被 L2ForwardingOutTable 消费,从而将每个数据包送到正确的 port

NetworkPolicy

Antrea 通过几张 OVS 流表来实现 K8s Network PolicyEgressRuleTable, EgressDefaultTable, IngressRuleTable and IngressDefaultTable。具体的实现包括 Antrea 中的 Controller 和 Agent 的通信,以及将 NetworkPolicy 转换为对应的 OVS 流表规则,此处不再详述。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: test-network-policy
namespace: default
spec:
podSelector:
matchLabels:
app: nginx
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: nginx
ports:
- protocol: TCP
port: 80
egress:
- to:
- podSelector:
matchLabels:
app: nginx
ports:
- protocol: TCP
port: 80

Antrea-native Policies

除了为了支持 k8s 原生的 NetworkPolicy,Antrea 还支持自己定义的 Antrea-native policiesAntreaPolicyEgressRuleTable and AntreaPolicyIngressRuleTable

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
apiVersion: crd.antrea.io/v1alpha1
kind: ClusterNetworkPolicy
metadata:
name: cnp0
spec:
priority: 10
tier: application # defaults to application tier if not specified
appliedTo:
- podSelector:
matchLabels:
app: server
ingress:
- action: Drop
from:
- podSelector:
matchLabels:
app: notClient
ports:
- protocol: TCP
port: 80
egress:
- action: Allow
to:
- podSelector:
matchLabels:
app: dns
ports:
- protocol: UDP
port: 53

Tables

ClassifierTable (0)

ClassifierTable 用于决定数据包属于哪种类型的流量(tunnel, local gateway 或者 local Pod),它通过匹配数据包的 ingress port。分类会写到 NXM_NX_REG0 寄存器的 bits [0..15] 上:

  • from tunnel: 0
  • from local gateway: 1
  • from local Pod: 2

对于一个来自 tunnel port 的数据包,NXM_NX_REG0 寄存器的 bit 19 将会被设置为 1,用来指示 MAC rewrite 应该在 L3ForwardingTable 被执行。

启动 Antrea 后,可以看到 OVS 流表规则中有:

1
2
3
4
5
1. table=0, priority=200,in_port=antrea-gw0 actions=load:0x1->NXM_NX_REG0[0..15],goto_table:10
2. table=0, priority=200,in_port=antrea-tun0 actions=load:0->NXM_NX_REG0[0..15],load:0x1->NXM_NX_REG0[19],goto_table:30
3. table=0, priority=190,in_port="coredns5-8ec607" actions=load:0x2->NXM_NX_REG0[0..15],goto_table:10
4. table=0, priority=190,in_port="coredns5-9d9530" actions=load:0x2->NXM_NX_REG0[0..15],goto_table:10
5. table=0, priority=0 actions=drop
  • Flow 1 表明数据包来自 local gateway
  • Flow 2 表明数据包来自一个 overlay tunnel,也就是说来自于另外一个节点
  • Flow 3/4 都是来自于 local Pods,这个案例中是来自于 CoreDNS 的 Pod

本地流量会直接转到 SpoofGuardTable,也就是 table 10,来自于 tunnel 的 traffic 会转到 ConntrackTable,如果以上规则都不匹配则会将数据包丢掉。

SpoofGuardTable (10)

SpoofGuardTable 用于防止来自 local Pods 流量的 IP 和 ARP 欺骗,对于每一个 Pod,我们保证:

  • 对于 IP 流量,源 IP 和 源 MAC 地址正确,也就是要匹配 Antrea 设置 Pod 容器网络配置的参数
  • 对于 ARP 流量,其广播的 IP 和 MAC 地址要正确,也就是要匹配 Antrea 设置 Pod 容器网络配置的参数

因为目前 Antrea 依赖于 kube-proxy 来实现对于 Service 流量的负载均衡,来自于 local Pods 目的是 Service 的流量首先会经过 gateway,然后通过 kube-proxy 做负载均衡后做 DNAT ,然后在发回给 gateway。这意味着合法的流量将会在 gateway port 接收到,并且 source IP 来自于一个 local Pod。

当前 Antrea 会允许所有的 IP 流量通过 gateway,也即是不对 IP 流量做欺骗检查,但是会对 ARP 流量做欺骗检查:

1
2
3
4
5
6
7
1. table=10, priority=200,ip,in_port=antrea-gw0 actions=goto_table:30
2. table=10, priority=200,arp,in_port=antrea-gw0,arp_spa=10.10.0.1,arp_sha=e2:e5:a4:9b:1c:b1 actions=goto_table:20
3. table=10, priority=200,ip,in_port="coredns5-8ec607",dl_src=12:9e:a6:47:d0:70,nw_src=10.10.0.2 actions=goto_table:30
4. table=10, priority=200,ip,in_port="coredns5-9d9530",dl_src=ba:a8:13:ca:ed:cf,nw_src=10.10.0.3 actions=goto_table:30
5. table=10, priority=200,arp,in_port="coredns5-8ec607",arp_spa=10.10.0.2,arp_sha=12:9e:a6:47:d0:70 actions=goto_table:20
6. table=10, priority=200,arp,in_port="coredns5-9d9530",arp_spa=10.10.0.3,arp_sha=ba:a8:13:ca:ed:cf actions=goto_table:20
7. table=10, priority=0 actions=drop

经过这个流表之后,ARP 流量将会转发到 ARPResponderTable,而 IP 流量将会转发到 ConntrackTable,不符合以上规则的流量将会被 drop。

ARPResponderTable (20)

The main purpose of this table is to reply to ARP requests from the local gateway asking for the MAC address of a remote peer gateway (another Node’s gateway). This ensures that the local Node can reach any remote Pod, which in particular is required for Service traffic which has been load-balanced to a remote Pod backend by kube-proxy. Note that the table is programmed to reply to such ARP requests with a “Global Virtual MAC” (“Global” because it is used by all Antrea OVS bridges), and not with the actual MAC address of the remote gateway. This ensures that once the traffic is received by the remote OVS bridge, it can be directly forwarded to the appropriate Pod without actually going through the gateway. The Virtual MAC is used as the destination MAC address for all the traffic being tunnelled.

If you dump the flows for this table, you may see the following:

1
2
3
1. table=20, priority=200,arp,arp_tpa=10.10.1.1,arp_op=1 actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:aa:bb:cc:dd:ee:ff,load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],load:0xaabbccddeeff->NXM_NX_ARP_SHA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xa0a0101->NXM_OF_ARP_SPA[],IN_PORT
2. table=20, priority=190,arp actions=NORMAL
3. table=20, priority=0 actions=drop

Flow 1 is the “ARP responder” for the peer Node whose local Pod subnet is 10.10.1.0/24. If we were to look at the routing table for the local Node, we would see the following “onlink” route:

1
10.10.1.0/24 via 10.10.1.1 dev antrea-gw0 onlink

A similar route is installed on the gateway (antrea-gw0) interface every time the Antrea Node Route Controller is notified that a new Node has joined the cluster. The route must be marked as “onlink” since the kernel does not have a route to the peer gateway 10.10.1.1: we trick the kernel into believing that 10.10.1.1 is directly connected to the local Node, even though it is on the other side of the tunnel.

Flow 2 ensures that OVS handle the remainder of ARP traffic as a regular L2 learning switch (using the normal action). In particular, this takes care of forwarding ARP requests and replies between local Pods.

The table-miss flow entry (flow 3) will drop all other packets. This flow should never be used because only ARP traffic should go to this table, and ARP traffic will either match flow 1 or flow 2.

ConntrackTable (30)

这张表唯一的目的就是触发所有数据包的 ct action 并且设置 ct_zone 为一个硬编码的值,然后将流量转发给 ContrackStateTable

1
1. table=30, priority=200,ip actions=ct(table=31,zone=65520)

这里 ct_zone 指示用来隔离连接追踪规则的,类似于 linux 中的网络命名空间,但是 ct_zone 专门用于 conntrack 并且 overhead 更低。触发 ct action 后,数据包将会处于 tracked(trk) 的状态 并且所有的 connection tracking fields 将会被设置为正确的值。之后数据包会被转到 ConntrackStateTable

OVS Conntrack 可以看到更多关于 OVS Conntrack 的内容。

ConntrackStateTable (31)

这张表用于处理所有处于 tracked 状态的数据包,有以下目的:

  • 保持追踪来自于 gateway port 的连接,也即是那个从 gateway 接收到第一个数据包的连接。对于这种连接的所有 reply packets,我们将会重写数据包的目的MAC地址到 local gateway MAC,从而保证他们能够从 gateway port 转发。这种场景下对于以下两种场景是必要的:
    • 回复从 local Pod 到一个 ClusterIP Service 连接的流量。
      • 如前所述,从 local Pod 到 ClusterIP Service 的流量首先会到 gateway port,然后经过 kube-proxy 做 DNAT。
      • 对于从 Cluster Service 后端 Pod 返回给 local Pod 的流量,如果不重写 MAC 地址到 gateway port,而是直接从 backend Pod 到 local Pod,那么 local Pod 接收到的流量的 source IP 不对(此时为 backend IP 而不是 Service IP)
      • 为了保证回包正确,我们需要重写数据包的目的MAC地址到 local gateway MAC,从而将数据包转发给 gateway,然后 source IP 将会被重写为 ClusterIP,也即是 undo DNAT
    • 处理涉及到 Hair-pinning 流量的数据包,也即是 两个 local Pod 间的连接 并且 NAT is performed
      • 举个例子,一个 Pod 访问一个 NodePort Service 并且 externalTrafficPolicy 设置为 Local 表示使用 local Node 的 IP 地址,因为对于这种流量将不会有 SNAT
      • 另一个例子是 hostPort
  • 丢弃那些被 conntrck 报告为不合法的数据包

查看 OVS 流表规则,可以看到:

1
2
3
4
1. table=31, priority=210,ct_state=-new+trk,ct_mark=0x20,ip,reg0=0x1/0xffff actions=goto_table:40
2. table=31, priority=200,ct_state=+inv+trk,ip actions=drop
3. table=31, priority=200,ct_state=-new+trk,ct_mark=0x20,ip actions=mod_dl_dst:e2:e5:a4:9b:1c:b1,goto_table:40
4. table=31, priority=0 actions=goto_table:40
  • Flows 1 和 3 实现了上面描述的 目的 MAC 重写,注意这个阶段我们还没有 commit 任何连接,我们会在执行所有 NetworkPolicy 之后 commit 所有的连接,在 ConntrackCommitTable 。同时,我们将所有来自于 gateway 的流量设置 ct_mark0x20

  • Flow 2 丢弃所有的不合法流量

  • 所有没有被 drop 的流量最终都会转到 DNATTable

DNATTable (40)

DNATTable 的唯一工作就是将目的是 Service 的流量通过 local gateway 发送,而不需要任何修改。kube-proxy 之后将会负责将导向 Service 的流量负载均衡到不同的后端 Pod。

查看 OVS 流表规则,可以看到:

1
2
1. table=40, priority=200,ip,nw_dst=10.96.0.0/12 actions=load:0x2->NXM_NX_REG1[],load:0x1->NXM_NX_REG0[16],goto_table:105
2. table=40, priority=0 actions=goto_table:45

In the example above, 10.96.0.0/12 is the Service CIDR (this is the default value used by kubeadm init). This flow is not actually required for forwarding, but to bypass EgressRuleTable and EgressDefaultTable for Service traffic on its way to kube-proxy through the gateway. If we omitted this flow, such traffic would be unconditionally dropped if a Network Policy is applied on the originating Pod. For such traffic, we instead enforce Network Policy egress rules when packets come back through the gateway and the destination IP has been rewritten by kube-proxy (DNAT to a backend for the Service). We cannot output the Service traffic to the gateway port directly as we haven’t committed the connection yet; instead we store the port in NXM_NX_REG1 - similarly to how we process non-Service traffic in L2ForwardingCalcTable - and forward it to ConntrackCommitTable. By committing the connection we ensure that reply traffic (traffic from the Service backend which has already gone through kube-proxy for source IP rewrite) will not be dropped because of Network Policies.

table-miss flow entry(flow 2) 将会把所有 non-Service 的流量转发到下一个流表 EgressRuleTable

在将来,DNATTable 可能会支持 kube-proxy 的功能,并且负责目标是 Service 的流量的负载均衡/DNAT。

AntreaPolicyEgressRuleTable (45)

For this table, you will need to keep in mind the ACNP specification that we are using.

This table is used to implement the egress rules across all Antrea-native policies, except for policies that are created in the Baseline Tier. Antrea-native policies created in the Baseline Tier will be enforced after K8s NetworkPolicies, and their egress rules are installed in the EgressDefaultTable and EgressRuleTable respectively, i.e.

1
2
3
Baseline Tier     ->  EgressDefaultTable(60)
K8s NetworkPolicy -> EgressRuleTable(50)
All other Tiers -> AntreaPolicyEgressRuleTable(45)

Since the example ACNP resides in the Application tier, if you dump the flows for table 45, you should see something like this:

1
2
3
4
5
6
1. table=45, priority=64990,ct_state=-new+est,ip actions=resubmit(,61)
2. table=45, priority=14000,conj_id=1,ip actions=load:0x1->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x1->NXM_NX_CT_LABEL[32..63]))
3. table=45, priority=14000,ip,nw_src=10.10.1.6 actions=conjunction(1,1/3)
4. table=45, priority=14000,ip,nw_dst=10.10.1.8 actions=conjunction(1,2/3)
5. table=45, priority=14000,udp,tp_dst=53 actions=conjunction(1,3/3)
6. table=45, priority=0 actions=resubmit(,50)

Similar to K8s NetworkPolicy implementation, AntreaPolicyEgressRuleTable also relies on the OVS built-in conjunction action to implement policies efficiently.

The above example flows read as follow: if the source IP address is in set {10.10.1.6}, and the destination IP address is in the set {10.10.1.8}, and the destination TCP port is in the set {53}, then use the conjunction action with id 1, which stores the conj_id 1 in ct_label[32..63] for egress metrics collection purposes, and forwards the packet to EgressMetricsTable, then L3ForwardingTable. Otherwise, go to EgressRuleTable if no conjunctive flow above priority 0 is matched. This corresponds to the case where the packet is not matched by any of the Antrea-native policy egress rules in any tier (except for the “baseline” tier).

If the conjunction action is matched, packets are “allowed” or “dropped” based on the action field of the policy rule. If allowed, they follow a similar path as described in the following EgressRuleTable section.

Unlike the default of K8s NetworkPolicies, Antrea-native policies have no such default rules. Hence, they are evaluated as-is, and there is no need for a AntreaPolicyEgressDefaultTable.

EgressRuleTable (50)

For this table, you will need to keep mind the Network Policy specification that we are using. We have 2 Pods running on the same Node, with IP addresses 10.10.1.2 to 10.10.1.3. They are allowed to talk to each other using TCP on port 80, but nothing else.

This table is used to implement the egress rules across all Network Policies. If you dump the flows for this table, you should see something like this:

1
2
3
4
5
6
7
8
1. table=50, priority=210,ct_state=-new+est,ip actions=goto_table:70
2. table=50, priority=200,ip,nw_src=10.10.1.2 actions=conjunction(2,1/3)
3. table=50, priority=200,ip,nw_src=10.10.1.3 actions=conjunction(2,1/3)
4. table=50, priority=200,ip,nw_dst=10.10.1.2 actions=conjunction(2,2/3)
5. table=50, priority=200,ip,nw_dst=10.10.1.3 actions=conjunction(2,2/3)
6. table=50, priority=200,tcp,tp_dst=80 actions=conjunction(2,3/3)
7. table=50, priority=190,conj_id=2,ip actions=load:0x2->NXM_NX_REG5[],ct(commit,table=61,zone=65520,exec(load:0x2->NXM_NX_CT_LABEL[32..63]))
8. table=50, priority=0 actions=goto_table:60

Notice how we use the OVS built-in conjunction action to implement policies efficiently. This enables us to do a conjunctive match across multiple dimensions (source IP, destination IP, port) efficiently without “exploding” the number of flows. By definition of a conjunctive match, we have at least 2 dimensions. For our use-case we have at most 3 dimensions.

The only requirements on conj_id is for it to be a unique 32-bit integer within the table. At the moment we use a single custom allocator, which is common to all tables that can have NetworkPolicy flows installed (45, 50, 60, 85, 90 and 100). This is why conj_id is set to 2 in the above example (1 was allocated for the egress rule of our Antrea-native NetworkPolicy example in the previous section).

The above example flows read as follow: if the source IP address is in set {10.10.1.2, 10.10.1.3}, and the destination IP address is in the set {10.10.1.2, 10.10.1.3}, and the destination TCP port is in the set {80}, then use the conjunction action with id 2, which goes to EgressMetricsTable, and then L3ForwardingTable. Otherwise, packet goes to EgressDefaultTable.

If the Network Policy specification includes exceptions (except field), then the table will include multiple flows with conjunctive match, corresponding to each CIDR that is present in from or to fields, but not in except field. Network Policy implementation details are not covered in this document.

If the conjunction action is matched, packets are “allowed” and forwarded directly to L3ForwardingTable. Other packets go to EgressDefaultTable. If a connection is established - as a reminder all connections are committed in ConntrackCommitTable - its packets go straight to L3ForwardingTable, with no other match required (see flow 1 above, which has the highest priority). In particular, this ensures that reply traffic is never dropped because of a Network Policy rule. However, this also means that ongoing connections are not affected if the K8s Network Policies are updated.

One thing to keep in mind is that for Service traffic, these rules are applied after the packets have gone through the local gateway and through kube-proxy. At this point the ingress port is no longer the Pod port, but the local gateway port. Therefore we cannot use the port as the match condition to identify if the Pod has been applied a Network Policy - which is what we do for the IngressRuleTable -, but instead have to use the source IP address.

EgressDefaultTable (60)

This table complements EgressRuleTable for Network Policy egress rule implementation. In K8s, when a Network Policy is applied to a set of Pods, the default behavior for these Pods become “deny” (it becomes an isolated Pod). This table is in charge of dropping traffic originating from Pods to which a Network Policy (with an egress rule) is applied, and which did not match any of the allowlist rules.

Accordingly, based on our Network Policy example, we would expect to see flows to drop traffic originating from our 2 Pods (10.10.1.2 and 10.10.1.3), which is confirmed by dumping the flows:

1
2
3
1. table=60, priority=200,ip,nw_src=10.10.1.2 actions=drop
2. table=60, priority=200,ip,nw_src=10.10.1.3 actions=drop
3. table=60, priority=0 actions=goto_table:61

This table is also used to implement Antrea-native policy egress rules that are created in the Baseline Tier. Since the Baseline Tier is meant to be enforced after K8s NetworkPolicies, the corresponding flows will be created at a lower priority than K8s default drop flows. For example, a baseline rule to drop egress traffic to 10.0.10.0/24 for a Namespace will look like the following:

1
2
3
4
1. table=60, priority=80,ip,nw_src=10.10.1.11 actions=conjunction(5,1/2)
2. table=60, priority=80,ip,nw_src=10.10.1.10 actions=conjunction(5,1/2)
3. table=60, priority=80,ip,nw_dst=10.0.10.0/24 actions=conjunction(5,2)
4. table=60, priority=80,conj_id=5,ip actions=load:0x3->NXM_NX_REG5[],load:0x1->NXM_NX_REG0[20],resubmit(,61)

The table-miss flow entry, which is used for non-isolated Pods, forwards traffic to the next table EgressMetricsTable, then (L3ForwardingTable).

L3ForwardingTable (70)

此表是 L3 Routing table,它将会负责以下功能:

  • Tunnelled traffic coming-in from a peer Node and destined to a local Pod is directly forwarded to the Pod.
    • 这将会把 source MAC 设置为 local gateway 的 MAC 地址,把 destination MAC 设置为 Pod 的 MAC 地址
    • 然后数据包将会转给 L3DecTTLTable 来减少 IP TTL 值
    • 这样的数据包将通过 bit 19 of the NXM_NX_REG0 和目的IP地址(which should match the IP address of a local Pod) 来确认
    • 因为对于每一个 Pod 都会在本节点创建一条流表规则如下所示
1
table=70, priority=200,ip,reg0=0x80000/0x80000,nw_dst=10.10.0.2 actions=mod_dl_src:e2:e5:a4:9b:1c:b1,mod_dl_dst:12:9e:a6:47:d0:70,goto_table:72
  • 所有目标是 local gateway 的 tunnel traffic 将会被转发给 gateway port(通过重写目的 MAC,from the Global Virtual MAC to the local gateway’s MAC)
1
table=70, priority=200,ip,reg0=0x80000/0x80000,nw_dst=10.10.0.1 actions=mod_dl_dst:e2:e5:a4:9b:1c:b1,goto_table:80
  • All traffic destined to a remote Pod is forwarded through the appropriate tunnel. This means that we install one flow for each peer Node, each one matching the destination IP address of the packet against the Pod subnet for the Node. In case of a match the source MAC is set to the local gateway MAC, the destination MAC is set to the Global Virtual MAC and we set the OF tun_dst field to the appropriate value (i.e. the IP address of the remote gateway). Traffic then goes to L3DecTTLTable. For a given peer Node, the flow may look like this:
1
table=70, priority=200,ip,nw_dst=10.10.1.0/24 actions=mod_dl_src:e2:e5:a4:9b:1c:b1,mod_dl_dst:aa:bb:cc:dd:ee:ff,load:0x1->NXM_NX_REG1[],load:0x1->NXM_NX_REG0[16],load:0xc0a84d65->NXM_NX_TUN_IPV4_DST[],goto_table:72

If none of the flows described above are hit, traffic goes directly to L2ForwardingCalcTable. This is the case for external traffic, whose destination is outside the cluster (such traffic has already been forwarded to the local gateway by the local source Pod, and only L2 switching is required), as well as for local Pod-to-Pod traffic.

1
table=70, priority=0 actions=goto_table:80

When the Egress feature is enabled, there will be two extra flows added into L3ForwardingTable, which send the egress traffic from Pods to external network to SNATTable (rather than sending the traffic the L2ForwardingCalcTable directly). One of the flows is for egress traffic from local Pods; another one is for egress traffic from remote Pods, which is tunnelled to this Node to be SNAT’d with a SNAT IP configured on the Node. In the latter case, the flow also rewrites the destination MAC to the local gateway interface MAC.

1
2
table=70, priority=190,ip,reg0=0x2/0xffff actions=goto_table:71
table=70, priority=190,ip,reg0=0/0xffff actions=mod_dl_dst:e2:e5:a4:9b:1c:b1,goto_table:71

SNATTable (71)

This table is created only when the Egress feature is enabled. It includes flows to implement Egresses and select the right SNAT IPs for egress traffic from Pods to external network.

When no Egress applies to Pods on the Node, and no SNAT IP is configured on the Node, SNATTable just has two flows. One drops egress traffic tunnelled from remote Nodes that does not match any SNAT IP configured on this Node, and the default flow that sends egress traffic from local Pods, which do not have any Egress applied, to L2ForwardingCalcTable. Such traffic will be SNAT’d with the default SNAT IP (by an iptables masquerade rule).

1
2
table=71, priority=190,ct_state=+new+trk,ip,reg0=0/0xffff actions=drop
table=71, priority=0 actions=goto_table:80

When there is an Egress applied to a Pod on the Node, a flow will be added for the Pod’s egress traffic. If the SNAT IP of the Egress is configured on the local Node, the flow sets an 8 bits ID allocated for the SNAT IP to pkt_mark. The ID is for iptables SNAT rules to match the packets and perfrom SNAT with the right SNAT IP (Antrea Agent adds an iptables SNAT rule for each local SNAT IP that matches the ID).

1
table=71, priority=200,ct_state=+new+trk,ip,in_port="pod1-7e503a" actions=load:0x1->NXM_NX_PKT_MARK[0..7],goto_table:80

When the SNAT IP of the Egress is on a remote Node, the flow will tunnel the packets to the remote Node with the tunnel’s destination IP to be the SNAT IP. The packets will be SNAT’d on the remote Node. The same as a normal tunnel flow in L3ForwardingTable, the flow will rewrite the packets’ source and destination MAC addresses, load the SNAT IP to NXM_NX_TUN_IPV4_DST, and send the packets to L3DecTTLTable.

1
table=71, priority=200,ct_state=+new+trk,ip,in_port="pod2-357c21" actions=mod_dl_src:e2:e5:a4:9b:1c:b1,mod_dl_dst:aa:bb:cc:dd:ee:ff,load:0x1->NXM_NX_REG1[],load:0x1->NXM_NX_REG0[16],load:0xc0a84d66->NXM_NX_TUN_IPV4_DST[],goto_table:72

Last, when a SNAT IP configured for Egresses is on the local Node, an additional flow is added in SNATTable for egress traffic from remote Node that should use the SNAT IP. The flow matches the tunnel destination IP (which should be equal to the SNAT IP), and sets the 8 bits ID of the SNAT IP to pkt_mark.

1
table=71, priority=200,ct_state=+new+trk,ip,tun_dst="192.168.77.101" actions=load:0x1->NXM_NX_PKT_MARK[0..7],goto_table:80

L3DecTTLTable (72)

This is the table to decrement TTL for the IP packets destined to remote Nodes through a tunnel, or the IP packets received from a tunnel. But for the packets that enter the OVS pipeline from the local gateway and are destined to a remote Node, TTL should not be decremented in OVS on the source Node, because the host IP stack should have already decremented TTL if that is needed.

If you dump the flows for this table, you should see flows like the following:

1
2
3
1. table=72, priority=210,ip,reg0=0x1/0xffff, actions=goto_table:80
2. table=72, priority=200,ip, actions=dec_ttl,goto_table:80
3. table=72, priority=0, actions=goto_table:80

The first flow is to bypass the TTL decrement for the packets from the gateway port.

L2ForwardingCalcTable (80)

This is essentially the “dmac” table of the switch. We program one flow for each port (tunnel port, gateway port, and local Pod ports), as you can see if you dump the flows:

1
2
3
4
5
1. table=80, priority=200,dl_dst=aa:bb:cc:dd:ee:ff actions=load:0x1->NXM_NX_REG1[],load:0x1->NXM_NX_REG0[16],goto_table:105
2. table=80, priority=200,dl_dst=e2:e5:a4:9b:1c:b1 actions=load:0x2->NXM_NX_REG1[],load:0x1->NXM_NX_REG0[16],goto_table:105
3. table=80, priority=200,dl_dst=12:9e:a6:47:d0:70 actions=load:0x3->NXM_NX_REG1[],load:0x1->NXM_NX_REG0[16],goto_table:90
4. table=80, priority=200,dl_dst=ba:a8:13:ca:ed:cf actions=load:0x4->NXM_NX_REG1[],load:0x1->NXM_NX_REG0[16],goto_table:90
5. table=80, priority=0 actions=goto_table:105

For each port flow (1 through 5 in the example above), we set bit 16 of the NXM_NX_REG0 register to indicate that there was a matching entry for the destination MAC address and that the packet must be forwarded. In the last table of the pipeline (L2ForwardingOutTable), we will drop all packets for which this bit is not set. We also use the NXM_NX_REG1 register to store the egress port for the packet, which will be used as a parameter to the output OpenFlow action in L2ForwardingOutTable.

The packets that match local Pods’ MAC entries will go to the first table (AntreaPolicyIngressRuleTable when AntreaPolicy is enabled, or IngressRuleTable when AntreaPolicy is not enabled) for NetworkPolicy ingress rules. Other packets will go to ConntrackCommitTable. Specifically, packets to the gateway port or the tunnel port will also go to ConntrackCommitTable and bypass the NetworkPolicy ingress rule tables, as NetworkPolicy ingress rules are not enforced for these packets on the source Node.

What about L2 multicast / broadcast traffic? ARP requests will never reach this table, as they will be handled by the OpenFlow normal action in the ArpResponderTable. As for the rest, if it is IP traffic, it will hit the “last” flow in this table and go to ConntrackCommitTable; and finally the last table of the pipeline (L2ForwardingOutTable), and get dropped there since bit 16 of the NXM_NX_REG0 will not be set. Traffic which is non-ARP and non-IP (assuming any can be received by the switch) is actually dropped much earlier in the pipeline (SpoofGuardTable). In the future, we may need to support more cases for L2 multicast / broadcast traffic.

AntreaPolicyIngressRuleTable (85)

This table is very similar to AntreaPolicyEgressRuleTable, but implements the ingress rules of Antrea-native Policies. Depending on the tier to which the policy belongs to, the rules will be installed in a table corresponding to that tier. The ingress table to tier mappings is as follows:

1
2
3
Baseline Tier     ->  IngressDefaultTable(100)
K8s NetworkPolicy -> IngressRuleTable(90)
All other Tiers -> AntreaPolicyIngressRuleTable(85)

Again for this table, you will need to keep in mind the ACNP specification that we are using. Since the example ACNP resides in the Application tier, if you dump the flows for table 85, you should see something like this:

1
2
3
4
5
6
1. table=85, priority=64990,ct_state=-new+est,ip actions=resubmit(,105)
2. table=85, priority=14000,conj_id=4,ip actions=load:0x4->NXM_NX_REG3[],load:0x1->NXM_NX_REG0[20],resubmit(,101)
3. table=85, priority=14000,ip,nw_src=10.10.1.7 actions=conjunction(4,1/3)
4. table=85, priority=14000,ip,reg1=0x19c actions=conjunction(4,2/3)
5. table=85, priority=14000,tcp,tp_dst=80 actions=conjunction(4,3/3)
6. table=85, priority=0 actions=resubmit(,90)

As for AntreaPolicyEgressRuleTable, flow 1 (highest priority) ensures that for established connections packets go straight to IngressMetricsTable, then L2ForwardingOutTable, with no other match required.

The rest of the flows read as follows: if the source IP address is in set {10.10.1.7}, and the destination OF port is in the set {412} (which correspond to IP addresses {10.10.1.6}), and the destination TCP port is in the set {80}, then use conjunction action with id 4, which loads the conj_id 4 into NXM_NX_REG3, a register used by Antrea internally to indicate the disposition of the packet is Drop, and forward the packet to IngressMetricsTable for it to be dropped.

Otherwise, go to IngressRuleTable if no conjunctive flow above priority 0 is matched. This corresponds to the case where the packet is not matched by any of the Antrea-native policy ingress rules in any tier (except for the “baseline” tier). One notable difference is how we use OF ports to identify the destination of the traffic, while we use IP addresses in AntreaPolicyEgressRuleTable to identify the source of the traffic. More details regarding this can be found in the following IngressRuleTable section.

As seen in AntreaPolicyEgressRuleTable, the default action is to evaluate K8s Network Policy IngressRuleTable and a AntreaPolicyIngressDefaultTable does not exist.

IngressRuleTable (90)

This table is very similar to EgressRuleTable, but implements ingress rules for Network Policies. Once again, you will need to keep mind the Network Policy specification that we are using. We have 2 Pods running on the same Node, with IP addresses 10.10.1.2 to 10.10.1.3. They are allowed to talk to each other using TCP on port 80, but nothing else.

If you dump the flows for this table, you should see something like this:

1
2
3
4
5
6
7
8
9
1. table=90, priority=210,ct_state=-new+est,ip actions=goto_table:101
2. table=90, priority=210,pkt_mark=0x1/0x1 actions=goto_table:105
3. table=90, priority=200,ip,nw_src=10.10.1.2 actions=conjunction(3,1/3)
4. table=90, priority=200,ip,nw_src=10.10.1.3 actions=conjunction(3,1/3)
5. table=90, priority=200,ip,reg1=0x3 actions=conjunction(3,2/3)
6. table=90, priority=200,ip,reg1=0x4 actions=conjunction(3,2/3)
7. table=90, priority=200,tcp,tp_dst=80 actions=conjunction(3,3/3)
8. table=90, priority=190,conj_id=3,ip actions=load:0x3->NXM_NX_REG6[],ct(commit,table=101,zone=65520,exec(load:0x3->NXM_NX_CT_LABEL[0..31]))
9. table=90, priority=0 actions=goto_table:100

As for EgressRuleTable, flow 1 (highest priority) ensures that for established connections - as a reminder all connections are committed in ConntrackCommitTable - packets go straight to IngressMetricsTable, then L2ForwardingOutTable, with no other match required.

Flow 2 ensures that the traffic initiated from the host network namespace cannot be dropped because of Network Policies. This ensures that K8s liveness probes can go through. An iptables rule in the mangle table of the host network namespace is responsible for marking the locally-generated packets with the 0x1/0x1 mark. Note that the flow will be different for Windows worker Node or when OVS userspace (netdev) datapath is used. This is because either there is no way to add mark for particular traffic (i.e. Windows) or matching the mark in OVS is not properly supported (i.e. netdev datapath). As a result, the flow will match source IP instead, however, NodePort Service access by external clients will be masqueraded as a local gateway IP to bypass Network Policies. This may be fixed after AntreaProxy can serve NodePort traffic.

The rest of the flows read as follows: if the source IP address is in set {10.10.1.2, 10.10.1.3}, and the destination OF port is in the set {3, 4} (which correspond to IP addresses {10.10.1.2, 10.10.1.3}, and the destination TCP port is in the set {80}, then use conjunction action with id 3, which stores the conj_id 3 in ct_label[0..31] for egress metrics collection purposes, and forwards the packet to IngressMetricsTable, then L2ForwardingOutTable. Otherwise, go to IngressDefaultTable. One notable difference is how we use OF ports to identify the destination of the traffic, while we use IP addresses in EgressRuleTable to identify the source of the traffic. We do this as an increased security measure in case a local Pod is misbehaving and trying to access another local Pod using the correct destination MAC address but a different destination IP address to bypass an egress Network Policy rule. This is also why the Network Policy ingress rules are enforced after the egress port has been determined.

IngressDefaultTable (100)

This table is similar in its purpose to EgressDefaultTable, and it complements IngressRuleTable for Network Policy ingress rule implementation. In K8s, when a Network Policy is applied to a set of Pods, the default behavior for these Pods become “deny” (it becomes an isolated Pod). This table is in charge of dropping traffic destined to Pods to which a Network Policy (with an ingress rule) is applied, and which did not match any of the allowlist rules.

Accordingly, based on our Network Policy example, we would expect to see flows to drop traffic destined to our 2 Pods (3 and 4), which is confirmed by dumping the flows:

1
2
3
1. table=100, priority=200,ip,reg1=0x3 actions=drop
2. table=100, priority=200,ip,reg1=0x4 actions=drop
3. table=100, priority=0 actions=goto_table:105

Similar to the EgressDefaultTable, this table is also used to implement Antrea-native policy ingress rules that are created in the Baseline Tier. Since the Baseline Tier is meant to be enforced after K8s NetworkPolicies, the corresponding flows will be created at a lower priority than K8s default drop flows. For example, a baseline rule to isolate ingress traffic for a Namespace will look like the following:

1
2
3
4
5
6
table=100, priority=80,ip,reg1=0xb actions=conjunction(6,2/3)
table=100, priority=80,ip,reg1=0xc actions=conjunction(6,2/3)
table=100, priority=80,ip,nw_src=10.10.1.9 actions=conjunction(6,1/3)
table=100, priority=80,ip,nw_src=10.10.1.7 actions=conjunction(6,1/3)
table=100, priority=80,tcp,tp_dst=8080 actions=conjunction(6,3/3)
table=100, priority=80,conj_id=6,ip actions=load:0x6->NXM_NX_REG3[],load:0x1->NXM_NX_REG0[20],resubmit(,101)

The table-miss flow entry, which is used for non-isolated Pods, forwards traffic to the next table (ConntrackCommitTable).

ConntrackCommitTable (105)

As mentioned before, this table is in charge of committing all new connections which are not dropped because of Network Policies. If you dump the flows for this table, you should see something like this:

1
2
3
1. table=105, priority=200,ct_state=+new+trk,ip,reg0=0x1/0xffff actions=ct(commit,table=110,zone=65520,exec(load:0x20->NXM_NX_CT_MARK[]))
2. table=105, priority=190,ct_state=+new+trk,ip actions=ct(commit,table=110,zone=65520)
3. table=105, priority=0 actions=goto_table:110

Flow 1 ensures that we commit connections initiated through the gateway interface and mark them with a ct_mark of 0x20. This ensures that ConntrackStateTable can perform its functions correctly and rewrite the destination MAC address to the gateway’s MAC address for connections which require it. Such connections include Pod-to-ClusterIP traffic. Note that the 0x20 mark is applied to all connections initiated through the gateway (i.e. for which the first packet of the connection was received through the gateway) and that ConntrackStateTable will perform the destination MAC address for the reply traffic of all such connections. In some cases (the ones described for ConntrackStateTable), this rewrite is necessary. For others (e.g. a connection from the host to a local Pod), this rewrite is not necessary but is also harmless, as the destination MAC is already correct.

Flow 2 commits all other new connections.

All traffic then goes to the next table (L2ForwardingOutTable).

L2ForwardingOutTable (110)

It is a simple table and if you dump the flows for this table, you should only see 2 flows:

1
2
1. table=110, priority=200,ip,reg0=0x10000/0x10000 actions=output:NXM_NX_REG1[]
2. table=110, priority=0, actions=drop

The first flow outputs all unicast packets to the correct port (the port was resolved by the “dmac” table, L2ForwardingCalcTable). IP packets for which L2ForwardingCalcTable did not set bit 16 of NXM_NX_REG0 will be dropped.

参考资料