0%

【Kubernetes】Pod

Pod 是 Kubernetes 集群中能够被创建、调度和管理的最小部署单元,是一组容器的集合,是 k8s 中最简单的对象,也是 k8s 中最为基础的概念。同一个 Pod 中的容器可以共享同一个网络命名空间,IP地址和端口空间。从生命周期上来讲,Pod是短暂而不是长久的应用。Pod被调度到节点,保持在这个节点直到被摧毁。

本文将分两个部分对 Pod 解读,第一个部分介绍 Pod 的基本概念和常见特性,第二部分会从源码层面介绍 Pod 从创建到删除的整个生命周期的实现。

概述

作为 Kubernetes 集群中的基本单元,Pod 就是最小并且最简单的 Kubernetes 对象,这个简单的对象其实就能够独立启动一个后端进程并在集群的内部为调用方提供服务。在上一篇文章 从 Kubernetes 中的对象谈起 中,我们曾经介绍过简单的 Kubernetes Pod 是如何使用 YAML 进行描述的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
name: busybox
labels:
app: busybox
spec:
containers:
- image: busybox
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Always

这个 YAML 文件描述了一个 Pod 启动时运行的容器和命令以及它的重启策略,在当前 Pod 出现错误或者执行结束后是否应该被 Kubernetes 的控制器拉起来,除了这些比较显眼的配置之外,元数据 metadata 的配置也非常重要,name 是当前对象在 Kubernetes 集群中的唯一标识符,而标签 labels 可以帮助我们快速选择对象。

在同一个 Pod 中,有几个概念特别值得关注,首先就是容器,在 Pod 中其实可以同时运行一个或者多个容器,这些容器能够共享网络、存储以及 CPU、内存等资源。在这一小节中我们将关注 Pod 中的容器、卷和网络三大概念。

Pod Spec

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
// PodSpec is a description of a pod.
type PodSpec struct {
// List of volumes that can be mounted by containers belonging to the pod.
// More info: https://kubernetes.io/docs/concepts/storage/volumes
// +optional
// +patchMergeKey=name
// +patchStrategy=merge,retainKeys
Volumes []Volume `json:"volumes,omitempty" patchStrategy:"merge,retainKeys" patchMergeKey:"name" protobuf:"bytes,1,rep,name=volumes"`
// List of initialization containers belonging to the pod.
// Init containers are executed in order prior to containers being started. If any
// init container fails, the pod is considered to have failed and is handled according
// to its restartPolicy. The name for an init container or normal container must be
// unique among all containers.
// Init containers may not have Lifecycle actions, Readiness probes, Liveness probes, or Startup probes.
// The resourceRequirements of an init container are taken into account during scheduling
// by finding the highest request/limit for each resource type, and then using the max of
// of that value or the sum of the normal containers. Limits are applied to init containers
// in a similar fashion.
// Init containers cannot currently be added or removed.
// Cannot be updated.
// More info: https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
// +patchMergeKey=name
// +patchStrategy=merge
InitContainers []Container `json:"initContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,20,rep,name=initContainers"`
// List of containers belonging to the pod.
// Containers cannot currently be added or removed.
// There must be at least one container in a Pod.
// Cannot be updated.
// +patchMergeKey=name
// +patchStrategy=merge
Containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`
// List of ephemeral containers run in this pod. Ephemeral containers may be run in an existing
// pod to perform user-initiated actions such as debugging. This list cannot be specified when
// creating a pod, and it cannot be modified by updating the pod spec. In order to add an
// ephemeral container to an existing pod, use the pod's ephemeralcontainers subresource.
// This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
// +optional
// +patchMergeKey=name
// +patchStrategy=merge
EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,34,rep,name=ephemeralContainers"`
// Restart policy for all containers within the pod.
// One of Always, OnFailure, Never.
// Default to Always.
// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
// +optional
RestartPolicy RestartPolicy `json:"restartPolicy,omitempty" protobuf:"bytes,3,opt,name=restartPolicy,casttype=RestartPolicy"`
// Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request.
// Value must be non-negative integer. The value zero indicates delete immediately.
// If this value is nil, the default grace period will be used instead.
// The grace period is the duration in seconds after the processes running in the pod are sent
// a termination signal and the time when the processes are forcibly halted with a kill signal.
// Set this value longer than the expected cleanup time for your process.
// Defaults to 30 seconds.
// +optional
TerminationGracePeriodSeconds *int64 `json:"terminationGracePeriodSeconds,omitempty" protobuf:"varint,4,opt,name=terminationGracePeriodSeconds"`
// Optional duration in seconds the pod may be active on the node relative to
// StartTime before the system will actively try to mark it failed and kill associated containers.
// Value must be a positive integer.
// +optional
ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty" protobuf:"varint,5,opt,name=activeDeadlineSeconds"`
// Set DNS policy for the pod.
// Defaults to "ClusterFirst".
// Valid values are 'ClusterFirstWithHostNet', 'ClusterFirst', 'Default' or 'None'.
// DNS parameters given in DNSConfig will be merged with the policy selected with DNSPolicy.
// To have DNS options set along with hostNetwork, you have to specify DNS policy
// explicitly to 'ClusterFirstWithHostNet'.
// +optional
DNSPolicy DNSPolicy `json:"dnsPolicy,omitempty" protobuf:"bytes,6,opt,name=dnsPolicy,casttype=DNSPolicy"`
// NodeSelector is a selector which must be true for the pod to fit on a node.
// Selector which must match a node's labels for the pod to be scheduled on that node.
// More info: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty" protobuf:"bytes,7,rep,name=nodeSelector"`

// ServiceAccountName is the name of the ServiceAccount to use to run this pod.
// More info: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/
// +optional
ServiceAccountName string `json:"serviceAccountName,omitempty" protobuf:"bytes,8,opt,name=serviceAccountName"`
// DeprecatedServiceAccount is a depreciated alias for ServiceAccountName.
// Deprecated: Use serviceAccountName instead.
// +k8s:conversion-gen=false
// +optional
DeprecatedServiceAccount string `json:"serviceAccount,omitempty" protobuf:"bytes,9,opt,name=serviceAccount"`
// AutomountServiceAccountToken indicates whether a service account token should be automatically mounted.
// +optional
AutomountServiceAccountToken *bool `json:"automountServiceAccountToken,omitempty" protobuf:"varint,21,opt,name=automountServiceAccountToken"`

// NodeName is a request to schedule this pod onto a specific node. If it is non-empty,
// the scheduler simply schedules this pod onto that node, assuming that it fits resource
// requirements.
// +optional
NodeName string `json:"nodeName,omitempty" protobuf:"bytes,10,opt,name=nodeName"`
// Host networking requested for this pod. Use the host's network namespace.
// If this option is set, the ports that will be used must be specified.
// Default to false.
// +k8s:conversion-gen=false
// +optional
HostNetwork bool `json:"hostNetwork,omitempty" protobuf:"varint,11,opt,name=hostNetwork"`
// Use the host's pid namespace.
// Optional: Default to false.
// +k8s:conversion-gen=false
// +optional
HostPID bool `json:"hostPID,omitempty" protobuf:"varint,12,opt,name=hostPID"`
// Use the host's ipc namespace.
// Optional: Default to false.
// +k8s:conversion-gen=false
// +optional
HostIPC bool `json:"hostIPC,omitempty" protobuf:"varint,13,opt,name=hostIPC"`
// Share a single process namespace between all of the containers in a pod.
// When this is set containers will be able to view and signal processes from other containers
// in the same pod, and the first process in each container will not be assigned PID 1.
// HostPID and ShareProcessNamespace cannot both be set.
// Optional: Default to false.
// +k8s:conversion-gen=false
// +optional
ShareProcessNamespace *bool `json:"shareProcessNamespace,omitempty" protobuf:"varint,27,opt,name=shareProcessNamespace"`
// SecurityContext holds pod-level security attributes and common container settings.
// Optional: Defaults to empty. See type description for default values of each field.
// +optional
SecurityContext *PodSecurityContext `json:"securityContext,omitempty" protobuf:"bytes,14,opt,name=securityContext"`
// ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec.
// If specified, these secrets will be passed to individual puller implementations for them to use. For example,
// in the case of docker, only DockerConfig type secrets are honored.
// More info: https://kubernetes.io/docs/concepts/containers/images#specifying-imagepullsecrets-on-a-pod
// +optional
// +patchMergeKey=name
// +patchStrategy=merge
ImagePullSecrets []LocalObjectReference `json:"imagePullSecrets,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,15,rep,name=imagePullSecrets"`
// Specifies the hostname of the Pod
// If not specified, the pod's hostname will be set to a system-defined value.
// +optional
Hostname string `json:"hostname,omitempty" protobuf:"bytes,16,opt,name=hostname"`
// If specified, the fully qualified Pod hostname will be "<hostname>.<subdomain>.<pod namespace>.svc.<cluster domain>".
// If not specified, the pod will not have a domainname at all.
// +optional
Subdomain string `json:"subdomain,omitempty" protobuf:"bytes,17,opt,name=subdomain"`
// If specified, the pod's scheduling constraints
// +optional
Affinity *Affinity `json:"affinity,omitempty" protobuf:"bytes,18,opt,name=affinity"`
// If specified, the pod will be dispatched by specified scheduler.
// If not specified, the pod will be dispatched by default scheduler.
// +optional
SchedulerName string `json:"schedulerName,omitempty" protobuf:"bytes,19,opt,name=schedulerName"`
// If specified, the pod's tolerations.
// +optional
Tolerations []Toleration `json:"tolerations,omitempty" protobuf:"bytes,22,opt,name=tolerations"`
// HostAliases is an optional list of hosts and IPs that will be injected into the pod's hosts
// file if specified. This is only valid for non-hostNetwork pods.
// +optional
// +patchMergeKey=ip
// +patchStrategy=merge
HostAliases []HostAlias `json:"hostAliases,omitempty" patchStrategy:"merge" patchMergeKey:"ip" protobuf:"bytes,23,rep,name=hostAliases"`
// If specified, indicates the pod's priority. "system-node-critical" and
// "system-cluster-critical" are two special keywords which indicate the
// highest priorities with the former being the highest priority. Any other
// name must be defined by creating a PriorityClass object with that name.
// If not specified, the pod priority will be default or zero if there is no
// default.
// +optional
PriorityClassName string `json:"priorityClassName,omitempty" protobuf:"bytes,24,opt,name=priorityClassName"`
// The priority value. Various system components use this field to find the
// priority of the pod. When Priority Admission Controller is enabled, it
// prevents users from setting this field. The admission controller populates
// this field from PriorityClassName.
// The higher the value, the higher the priority.
// +optional
Priority *int32 `json:"priority,omitempty" protobuf:"bytes,25,opt,name=priority"`
// Specifies the DNS parameters of a pod.
// Parameters specified here will be merged to the generated DNS
// configuration based on DNSPolicy.
// +optional
DNSConfig *PodDNSConfig `json:"dnsConfig,omitempty" protobuf:"bytes,26,opt,name=dnsConfig"`
// If specified, all readiness gates will be evaluated for pod readiness.
// A pod is ready when all its containers are ready AND
// all conditions specified in the readiness gates have status equal to "True"
// More info: https://git.k8s.io/enhancements/keps/sig-network/0007-pod-ready%2B%2B.md
// +optional
ReadinessGates []PodReadinessGate `json:"readinessGates,omitempty" protobuf:"bytes,28,opt,name=readinessGates"`
// RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used
// to run this pod. If no RuntimeClass resource matches the named class, the pod will not be run.
// If unset or empty, the "legacy" RuntimeClass will be used, which is an implicit class with an
// empty definition that uses the default runtime handler.
// More info: https://git.k8s.io/enhancements/keps/sig-node/runtime-class.md
// This is a beta feature as of Kubernetes v1.14.
// +optional
RuntimeClassName *string `json:"runtimeClassName,omitempty" protobuf:"bytes,29,opt,name=runtimeClassName"`
// EnableServiceLinks indicates whether information about services should be injected into pod's
// environment variables, matching the syntax of Docker links.
// Optional: Defaults to true.
// +optional
EnableServiceLinks *bool `json:"enableServiceLinks,omitempty" protobuf:"varint,30,opt,name=enableServiceLinks"`
// PreemptionPolicy is the Policy for preempting pods with lower priority.
// One of Never, PreemptLowerPriority.
// Defaults to PreemptLowerPriority if unset.
// This field is alpha-level and is only honored by servers that enable the NonPreemptingPriority feature.
// +optional
PreemptionPolicy *PreemptionPolicy `json:"preemptionPolicy,omitempty" protobuf:"bytes,31,opt,name=preemptionPolicy"`
// Overhead represents the resource overhead associated with running a pod for a given RuntimeClass.
// This field will be autopopulated at admission time by the RuntimeClass admission controller. If
// the RuntimeClass admission controller is enabled, overhead must not be set in Pod create requests.
// The RuntimeClass admission controller will reject Pod create requests which have the overhead already
// set. If RuntimeClass is configured and selected in the PodSpec, Overhead will be set to the value
// defined in the corresponding RuntimeClass, otherwise it will remain unset and treated as zero.
// More info: https://git.k8s.io/enhancements/keps/sig-node/20190226-pod-overhead.md
// This field is alpha-level as of Kubernetes v1.16, and is only honored by servers that enable the PodOverhead feature.
// +optional
Overhead ResourceList `json:"overhead,omitempty" protobuf:"bytes,32,opt,name=overhead"`
// TopologySpreadConstraints describes how a group of pods ought to spread across topology
// domains. Scheduler will schedule pods in a way which abides by the constraints.
// This field is only honored by clusters that enable the EvenPodsSpread feature.
// All topologySpreadConstraints are ANDed.
// +optional
// +patchMergeKey=topologyKey
// +patchStrategy=merge
// +listType=map
// +listMapKey=topologyKey
// +listMapKey=whenUnsatisfiable
TopologySpreadConstraints []TopologySpreadConstraint `json:"topologySpreadConstraints,omitempty" patchStrategy:"merge" patchMergeKey:"topologyKey" protobuf:"bytes,33,opt,name=topologySpreadConstraints"`
}

Pod的状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// These are the valid statuses of pods.
const (
// PodPending means the pod has been accepted by the system, but one or more of the containers
// has not been started. This includes time before being bound to a node, as well as time spent
// pulling images onto the host.
PodPending PodPhase = "Pending"
// PodRunning means the pod has been bound to a node and all of the containers have been started.
// At least one container is still running or is in the process of being restarted.
PodRunning PodPhase = "Running"
// PodSucceeded means that all containers in the pod have voluntarily terminated
// with a container exit code of 0, and the system is not going to restart any of these containers.
PodSucceeded PodPhase = "Succeeded"
// PodFailed means that all containers in the pod have terminated, and at least one container has
// terminated in a failure (exited with a non-zero exit code or was stopped by the system).
PodFailed PodPhase = "Failed"
// PodUnknown means that for some reason the state of the pod could not be obtained, typically due
// to an error in communicating with the host of the pod.
PodUnknown PodPhase = "Unknown"
)

亲和性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// Pod affinity is a group of inter pod affinity scheduling rules.
type PodAffinity struct {
// NOT YET IMPLEMENTED. TODO: Uncomment field once it is implemented.
// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system will try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// podAffinityTerm are intersected, i.e. all terms must be satisfied.
// +optional
// RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`

// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system may or may not try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// podAffinityTerm are intersected, i.e. all terms must be satisfied.
// +optional
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty" protobuf:"bytes,1,rep,name=requiredDuringSchedulingIgnoredDuringExecution"`
// The scheduler will prefer to schedule pods to nodes that satisfy
// the affinity expressions specified by this field, but it may choose
// a node that violates one or more of the expressions. The node that is
// most preferred is the one with the greatest sum of weights, i.e.
// for each node that meets all of the scheduling requirements (resource
// request, requiredDuringScheduling affinity expressions, etc.),
// compute a sum by iterating through the elements of this field and adding
// "weight" to the sum if the node has pods which matches the corresponding podAffinityTerm; the
// node(s) with the highest sum are the most preferred.
// +optional
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty" protobuf:"bytes,2,rep,name=preferredDuringSchedulingIgnoredDuringExecution"`
}

// Pod anti affinity is a group of inter pod anti affinity scheduling rules.
type PodAntiAffinity struct {
// NOT YET IMPLEMENTED. TODO: Uncomment field once it is implemented.
// If the anti-affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the anti-affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system will try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// podAffinityTerm are intersected, i.e. all terms must be satisfied.
// +optional
// RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`

// If the anti-affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the anti-affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system may or may not try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// podAffinityTerm are intersected, i.e. all terms must be satisfied.
// +optional
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty" protobuf:"bytes,1,rep,name=requiredDuringSchedulingIgnoredDuringExecution"`
// The scheduler will prefer to schedule pods to nodes that satisfy
// the anti-affinity expressions specified by this field, but it may choose
// a node that violates one or more of the expressions. The node that is
// most preferred is the one with the greatest sum of weights, i.e.
// for each node that meets all of the scheduling requirements (resource
// request, requiredDuringScheduling anti-affinity expressions, etc.),
// compute a sum by iterating through the elements of this field and adding
// "weight" to the sum if the node has pods which matches the corresponding podAffinityTerm; the
// node(s) with the highest sum are the most preferred.
// +optional
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty" protobuf:"bytes,2,rep,name=preferredDuringSchedulingIgnoredDuringExecution"`
}

容器

每一个 Kubernetes 的 Pod 其实都具有两种不同的容器,两种不同容器的职责其实十分清晰,一种是 InitContainer,这种容器会在 Pod 启动时运行,主要用于初始化一些配置,另一种是 Pod 在 Running 状态时内部存活的 Container,它们的主要作用是对外提供服务或者作为工作节点处理异步任务等等。

kubernetes-pod-init-and-regular-containers

通过对不同容器类型的命名我们也可以看出,InitContainer 会比 Container 优先启动,在 kubeGenericRuntimeManager.SyncPod 方法中会先后启动两种容器。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
func (m *kubeGenericRuntimeManager) SyncPod(pod *v1.Pod, _ v1.PodStatus, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, backOff *flowcontrol.Backoff) (result kubecontainer.PodSyncResult) {
// Step 1: Compute sandbox and container changes.
// Step 2: Kill the pod if the sandbox has changed.
// Step 3: kill any running containers in this pod which are not to keep.
// Step 4: Create a sandbox for the pod if necessary.
// ...

// Step 5: start the init container.
if container := podContainerChanges.NextInitContainerToStart; container != nil {
msg, _ := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeInit)
}

// Step 6: start containers in podContainerChanges.ContainersToStart.
for _, idx := range podContainerChanges.ContainersToStart {
container := &pod.Spec.Containers[idx]

msg, _ := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeRegular)
}

return
}

通过分析私有方法 startContainer 的实现我们得出:容器的类型最终只会影响在 Debug 时创建的标签,所以对于 Kubernetes 来说两种容器的启动和执行也就只有顺序先后的不同。

每一个 Pod 中的容器是可以通过 卷(Volume) 的方式共享文件目录的,这些 Volume 能够存储持久化的数据;在当前 Pod 出现故障或者滚动更新时,对应 Volume 中的数据并不会被清除,而是会在 Pod 重启后重新挂载到期望的文件目录中:

kubernetes-containers-share-volumes

kubelet.go 文件中的私有方法 syncPod 会调用 WaitForAttachAndMount 方法为等待当前 Pod 启动需要的挂载文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
func (vm *volumeManager) WaitForAttachAndMount(pod *v1.Pod) error {
expectedVolumes := getExpectedVolumes(pod)
uniquePodName := util.GetUniquePodName(pod)

vm.desiredStateOfWorldPopulator.ReprocessPod(uniquePodName)

wait.PollImmediate(
podAttachAndMountRetryInterval,
podAttachAndMountTimeout,
vm.verifyVolumesMountedFunc(uniquePodName, expectedVolumes))

return nil
}

我们会在 后面的章节 详细地介绍 Kubernetes 中卷的创建、挂载是如何进行的,在这里我们需要知道的是卷的挂载是 Pod 启动之前必须要完成的工作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func (kl *Kubelet) syncPod(o syncPodOptions) error {
// ...

if !kl.podIsTerminated(pod) {
kl.volumeManager.WaitForAttachAndMount(pod)
}

pullSecrets := kl.getPullSecretsForPod(pod)

result := kl.containerRuntime.SyncPod(pod, apiPodStatus, podStatus, pullSecrets, kl.backOff)
kl.reasonCache.Update(pod.UID, result)

return nil
}

在当前 Pod 的卷创建完成之后,就会调用上一节中提到的 SyncPod 公有方法继续进行同步 Pod 信息和创建、启动容器的工作。

网络

同一个 Pod 中的多个容器会被共同分配到同一个 Host 上并且共享网络栈,也就是说这些 Pod 能够通过 localhost 互相访问到彼此的端口和服务,如果使用了相同的端口也会发生冲突,同一个 Pod 上的所有容器会连接到同一个网络设备上,这个网络设备就是由 Pod Sandbox 中的沙箱容器在 RunPodSandbox 方法中启动时创建的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
func (ds *dockerService) RunPodSandbox(ctx context.Context, r *runtimeapi.RunPodSandboxRequest) (*runtimeapi.RunPodSandboxResponse, error) {
config := r.GetConfig()

// Step 1: Pull the image for the sandbox.
image := defaultSandboxImage

// Step 2: Create the sandbox container.
createConfig, _ := ds.makeSandboxDockerConfig(config, image)
createResp, _ := ds.client.CreateContainer(*createConfig)

resp := &runtimeapi.RunPodSandboxResponse{PodSandboxId: createResp.ID}

ds.setNetworkReady(createResp.ID, false)

// Step 3: Create Sandbox Checkpoint.
ds.checkpointManager.CreateCheckpoint(createResp.ID, constructPodSandboxCheckpoint(config))

// Step 4: Start the sandbox container.
ds.client.StartContainer(createResp.ID)

// Step 5: Setup networking for the sandbox.
cID := kubecontainer.BuildContainerID(runtimeName, createResp.ID)
networkOptions := make(map[string]string)
ds.network.SetUpPod(config.GetMetadata().Namespace, config.GetMetadata().Name, cID, config.Annotations, networkOptions)

return resp, nil
}

沙箱容器其实就是 pause 容器,上述方法引用的 defaultSandboxImage 其实就是官方提供的 k8s.gcr.io/pause:3.1 镜像,这里会创建沙箱镜像和检查点并启动容器。

kubernetes-pod-network

每一个节点上都会由 Kubernetes 的网络插件 Kubenet 创建一个基本的 cbr0 网桥并为每一个 Pod 创建 veth 虚拟网络设备,同一个 Pod 中的所有容器就会通过这个网络设备共享网络,也就是能够通过 localhost 互相访问彼此暴露的端口和服务。

小结

Kubernetes 中的每一个 Pod 都包含多个容器,这些容器在通过 Kubernetes 创建之后就能共享网络和存储,这其实是 Pod 非常重要的特性,我们能通过这个特性构建比较复杂的服务拓扑和依赖关系。

生命周期

想要深入理解 Pod 的实现原理,最好最快的办法就是从 Pod 的生命周期入手,通过理解 Pod 创建、重启和删除的原理我们最终就能够系统地掌握 Pod 的生命周期与核心原理。

kubernetes-pod-lifecycle

当 Pod 被创建之后,就会进入健康检查状态,当 Kubernetes 确定当前 Pod 已经能够接受外部的请求时,才会将流量打到新的 Pod 上并继续对外提供服务,在这期间如果发生了错误就可能会触发重启机制,在 Pod 被删除之前都会触发一个 PreStop 的钩子,其中的方法完成之后 Pod 才会被删除,接下来我们就会按照这里的顺序依次介绍 Pod 『从生到死』的过程。

创建

Pod 的创建都是通过 SyncPod 来实现的,创建的过程大体上可以分为六个步骤:

  1. 计算 Pod 中沙盒和容器的变更;
  2. 强制停止 Pod 对应的沙盒;
  3. 强制停止所有不应该运行的容器;
  4. 为 Pod 创建新的沙盒;
  5. 创建 Pod 规格中指定的初始化容器;
  6. 依次创建 Pod 规格中指定的常规容器;

我们可以看到 Pod 的创建过程其实是比较简单的,首先计算 Pod 规格和沙箱的变更,然后停止可能影响这一次创建或者更新的容器,最后依次创建沙盒、初始化容器和常规容器。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
func (m *kubeGenericRuntimeManager) SyncPod(pod *v1.Pod, _ v1.PodStatus, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, backOff *flowcontrol.Backoff) (result kubecontainer.PodSyncResult) {
podContainerChanges := m.computePodActions(pod, podStatus)
if podContainerChanges.CreateSandbox {
ref, _ := ref.GetReference(legacyscheme.Scheme, pod)
}

if podContainerChanges.KillPod {
if podContainerChanges.CreateSandbox {
m.purgeInitContainers(pod, podStatus)
}
} else {
for containerID, containerInfo := range podContainerChanges.ContainersToKill {
m.killContainer(pod, containerID, containerInfo.name, containerInfo.message, nil) }
}
}

podSandboxID := podContainerChanges.SandboxID
if podContainerChanges.CreateSandbox {
podSandboxID, _, _ = m.createPodSandbox(pod, podContainerChanges.Attempt)
}
podSandboxConfig, _ := m.generatePodSandboxConfig(pod, podContainerChanges.Attempt)

if container := podContainerChanges.NextInitContainerToStart; container != nil {
msg, _ := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeInit)
}

for _, idx := range podContainerChanges.ContainersToStart {
container := &pod.Spec.Containers[idx]
msg, _ := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeRegular)
}

return
}

简化后的 SyncPod 方法的脉络非常清晰,可以很好地理解整个创建 Pod 的工作流程;而初始化容器和常规容器被调用 startContainer 来启动:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, container *v1.Container, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, containerType kubecontainer.ContainerType) (string, error) {
imageRef, _, _ := m.imagePuller.EnsureImageExists(pod, container, pullSecrets)

// ...
containerID, _ := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)

m.internalLifecycle.PreStartContainer(pod, container, containerID)

m.runtimeService.StartContainer(containerID)

if container.Lifecycle != nil && container.Lifecycle.PostStart != nil {
kubeContainerID := kubecontainer.ContainerID{
Type: m.runtimeName,
ID: containerID,
}
msg, _ := m.runner.Run(kubeContainerID, pod, container, container.Lifecycle.PostStart)
}

return "", nil
}

在启动每一个容器的过程中也都按照相同的步骤进行操作:

  1. 通过镜像拉取器获得当前容器中使用镜像的引用;
  2. 调用远程的 runtimeService 创建容器;
  3. 调用内部的生命周期方法 PreStartContainer 为当前的容器设置分配的 CPU 等资源;
  4. 调用远程的 runtimeService 开始运行镜像;
  5. 如果当前的容器包含 PostStart 钩子就会执行该回调;

每次 SyncPod 被调用时不一定是创建新的 Pod 对象,它还会承担更新、删除和同步 Pod 规格的职能,根据输入的新规格执行相应的操作。

健康检查

如果我们遵循 Pod 的最佳实践,其实应该尽可能地为每一个 Pod 添加 livenessProbereadinessProbe 的健康检查,这两者能够为 Kubernetes 提供额外的存活信息,如果我们配置了合适的健康检查方法和规则,那么就不会出现服务未启动就被打入流量或者长时间未响应依然没有重启等问题。

在 Pod 被创建或者被移除时,会被加入到当前节点上的 ProbeManager 中,ProbeManager 会负责这些 Pod 的健康检查:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
start := kl.clock.Now()
for _, pod := range pods {
kl.podManager.AddPod(pod)
kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)
kl.probeManager.AddPod(pod)
}
}

func (kl *Kubelet) HandlePodRemoves(pods []*v1.Pod) {
start := kl.clock.Now()
for _, pod := range pods {
kl.podManager.DeletePod(pod)
kl.deletePod(pod)
kl.probeManager.RemovePod(pod)
}
}

简化后的 HandlePodAdditionsHandlePodRemoves 方法非常直白,我们可以直接来看 ProbeManager 如何处理不同节点的健康检查。

kubernetes-probe-manager

每一个新的 Pod 都会被调用 ProbeManagerAddPod 函数,这个方法会初始化一个新的 Goroutine 并在其中运行对当前 Pod 进行健康检查:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
func (m *manager) AddPod(pod *v1.Pod) {
key := probeKey{podUID: pod.UID}
for _, c := range pod.Spec.Containers {
key.containerName = c.Name

if c.ReadinessProbe != nil {
key.probeType = readiness
w := newWorker(m, readiness, pod, c)
m.workers[key] = w
go w.run()
}

if c.LivenessProbe != nil {
key.probeType = liveness
w := newWorker(m, liveness, pod, c)
m.workers[key] = w
go w.run()
}
}
}

在执行健康检查的过程中,Worker 只是负责根据当前 Pod 的状态定期触发一次 Probe,它会根据 Pod 的配置分别选择调用 ExecHTTPGetTCPSocket 三种不同的 Probe 方式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
func (pb *prober) runProbe(probeType probeType, p *v1.Probe, pod *v1.Pod, status v1.PodStatus, container v1.Container, containerID kubecontainer.ContainerID) (probe.Result, string, error) {
timeout := time.Duration(p.TimeoutSeconds) * time.Second
if p.Exec != nil {
command := kubecontainer.ExpandContainerCommandOnlyStatic(p.Exec.Command, container.Env)
return pb.exec.Probe(pb.newExecInContainer(container, containerID, command, timeout))
}
if p.HTTPGet != nil {
scheme := strings.ToLower(string(p.HTTPGet.Scheme))
host := p.HTTPGet.Host
port, _ := extractPort(p.HTTPGet.Port, container)
path := p.HTTPGet.Path
url := formatURL(scheme, host, port, path)
headers := buildHeader(p.HTTPGet.HTTPHeaders)
if probeType == liveness {
return pb.livenessHttp.Probe(url, headers, timeout)
} else { // readiness
return pb.readinessHttp.Probe(url, headers, timeout)
}
}
if p.TCPSocket != nil {
port, _ := extractPort(p.TCPSocket.Port, container)
host := p.TCPSocket.Host
return pb.tcp.Probe(host, port, timeout)
}
return probe.Unknown, "", fmt.Errorf("Missing probe handler for %s:%s", format.Pod(pod), container.Name)
}

Kubernetes 在 Pod 启动后的 InitialDelaySeconds 时间内会等待 Pod 的启动和初始化,在这之后会开始健康检查,默认的健康检查重试次数是三次,如果健康检查正常运行返回了一个确定的结果,那么 Worker 就是记录这次的结果,在连续失败 FailureThreshold 次或者成功 SuccessThreshold 次,那么就会改变当前 Pod 的状态,这也是为了避免由于服务不稳定带来的抖动。

删除

当 Kubelet 在 HandlePodRemoves 方法中接收到来自客户端的删除请求时,就会通过一个名为 deletePod 的私有方法中的 Channel 将这一事件传递给 PodKiller 进行处理:

1
2
3
4
5
6
7
8
9
10
func (kl *Kubelet) deletePod(pod *v1.Pod) error {
kl.podWorkers.ForgetWorker(pod.UID)

runningPods, _ := kl.runtimeCache.GetPods()
runningPod := kubecontainer.Pods(runningPods).FindPod("", pod.UID)
podPair := kubecontainer.PodPair{APIPod: pod, RunningPod: &runningPod}

kl.podKillingCh <- &podPair
return nil
}

Kubelet 除了将事件通知给 PodKiller 之外,还需要将当前 Pod 对应的 Worker 从持有的 podWorkers 中删除;PodKiller 其实就是 Kubelet 持有的一个 Goroutine,它会在后台持续运行并监听来自 podKillingCh 的事件:

kubernetes-pod-killer

经过一系列的方法调用之后,最终调用容器运行时的 killContainersWithSyncResult 方法,这个方法会同步地杀掉当前 Pod 中全部的容器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func (m *kubeGenericRuntimeManager) killContainersWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (syncResults []*kubecontainer.SyncResult) {
containerResults := make(chan *kubecontainer.SyncResult, len(runningPod.Containers))

for _, container := range runningPod.Containers {
go func(container *kubecontainer.Container) {
killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, container.Name)
m.killContainer(pod, container.ID, container.Name, "Need to kill Pod", gracePeriodOverride)
containerResults <- killContainerResult
}(container)
}
close(containerResults)

for containerResult := range containerResults {
syncResults = append(syncResults, containerResult)
}
return
}

对于每一个容器来说,它们在被停止之前都会先调用 PreStop 的钩子方法,让容器中的应用程序能够有时间完成一些未处理的操作,随后调用远程的服务停止运行的容器:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
func (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, reason string, gracePeriodOverride *int64) error {
containerSpec := kubecontainer.GetContainerSpec(pod, containerName);

gracePeriod := int64(minimumGracePeriodInSeconds)
switch {
case pod.DeletionGracePeriodSeconds != nil:
gracePeriod = *pod.DeletionGracePeriodSeconds
case pod.Spec.TerminationGracePeriodSeconds != nil:
gracePeriod = *pod.Spec.TerminationGracePeriodSeconds
}

m.executePreStopHook(pod, containerID, containerSpec, gracePeriod)
m.internalLifecycle.PreStopContainer(containerID.ID)
m.runtimeService.StopContainer(containerID.ID, gracePeriod)
m.containerRefManager.ClearRef(containerID)

return err
}

从这个简化版本的 killContainer 方法中,我们可以大致看出停止运行容器的大致逻辑,先从 Pod 的规格中计算出当前停止所需要的时间,然后运行钩子方法和内部的生命周期方法,最后将容器停止并清除引用。

总结

在这篇文章中,我们已经介绍了 Pod 中的几个重要概念 — 容器、卷和网络以及从创建到删除整个过程是如何实现的。

Kubernetes 中 Pod 的运行和管理总是与 kubelet 以及它的组件密不可分,后面的文章中也会介绍 kubelet 究竟是什么,它在整个 Kubernetes 中扮演什么样的角色。