iptables 入门出坑之后,加深个人理解无非是实战,此篇我们来通过示例来加深个人理解。

环境准备

本人是 mbp,由于 Docker Desktop For Mac 实在太黑盒了,打算用 minikube 来进行 k8s 实验环境的部署。

对于需要的 minikube, kubectl 这种基本依赖可执行文件具体就不写步骤了。
最终执行如下命令拉起 k8s 集群:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
minikube start --image-mirror-country='cn' --image-repository='registry.cn-hangzhou.aliyuncs.com/google_containers'  --kubernetes-version=v1.21.0
😄 Darwin 11.6 上的 minikube v1.21.0
❗ Kubernetes 1.21.0 has a known performance issue on cluster startup. It might take 2 to 3 minutes for a cluster to start.
❗ For more information, see: https://github.com/kubernetes/kubeadm/issues/2395
🎉 minikube 1.23.2 is available! Download it: https://github.com/kubernetes/minikube/releases/tag/v1.23.2
💡 To disable this notice, run: 'minikube config set WantUpdateNotification false'

✨ 根据现有的配置文件使用 docker 驱动程序
👍 Starting control plane node minikube in cluster minikube
🚜 Pulling base image ...
💾 Downloading Kubernetes v1.21.0 preload ...
> preloaded-images-k8s-v11-v1...: 498.90 MiB / 498.90 MiB 100.00% 17.39 Mi


> index.docker.io/kicbase/sta...: 359.09 MiB / 359.09 MiB 100.00% 4.53 MiB


❗ minikube was unable to download gcr.io/k8s-minikube/kicbase:v0.0.23, but successfully downloaded kicbase/stable:v0.0.23 as a fallback image
🔥 Creating docker container (CPUs=2, Memory=4000MB) ...
❗ This container is having trouble accessing https://k8s.gcr.io
💡 To pull new external images, you may need to configure a proxy: https://minikube.sigs.k8s.io/docs/reference/networking/proxy/
🐳 正在 Docker 20.10.7 中准备 Kubernetes v1.21.0…
▪ Generating certificates and keys ...
▪ Booting up control plane ...
▪ Configuring RBAC rules ...
🔎 Verifying Kubernetes components...
▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟 Enabled addons: storage-provisioner, default-storageclass
🏄 Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

构建实验 manifests,准备的yaml 文件如下:
本例的service typeClusterIP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
securityContext:
privileged: true
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx-service
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 80
targetPort: 80
---
apiVersion: v1
kind: Pod
metadata:
name: web-server
labels:
app: web-server
spec:
containers:
- name: nginx
image: nginx
securityContext:
privileged: true
ports:
- containerPort: 80

待资源全部就绪后,查看资源详情,重点关注 PodServiceIP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
kubectl get all -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/nginx-deployment-74bc56fb4b-9xqkt 1/1 Running 0 4m23s 172.17.0.5 minikube <none> <none>
pod/nginx-deployment-74bc56fb4b-c9zp4 1/1 Running 0 4m23s 172.17.0.4 minikube <none> <none>
pod/nginx-deployment-74bc56fb4b-h7rhz 1/1 Running 0 4m23s 172.17.0.6 minikube <none> <none>
pod/web-server 1/1 Running 0 4m23s 172.17.0.3 minikube <none> <none>

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 3h22m <none>
service/nginx-service ClusterIP 10.97.79.231 <none> 80/TCP 4m23s app=nginx

NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
deployment.apps/nginx-deployment 3/3 3 3 4m23s nginx nginx app=nginx

NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
replicaset.apps/nginx-deployment-74bc56fb4b 3 3 3 4m23s nginx nginx app=nginx,pod-template-hash=74bc56fb4b

至此实验环境准备完成。

流量转发分析

首先需要登陆到 minukube 节点:

1
2
3
4
5
minikube ssh
docker@minikube:~$ iptables
iptables v1.8.4 (legacy): no command specified
Try `iptables -h' or 'iptables --help' for more information.
docker@minikube:~$sudo su # 切换到root 用户

kube-proxy iptable 模式分析之ClusterIP

流量的源与目的大概如下:

graph LR
    web[web-server:172.17.0.3]-->nginx[nginx-service:10.97.79.231]

分析下为啥是从本机出去的流量?
因为kube-proxy是以daemonSet的形式部署在所有节点上的,所以每个节点都会有相同的iptable规则,当任何一个节点上的pod访问service时,其实都是可以在该pod所在的node的的iptable中找到对应的service规则从而找到service所代理的pod的,而对于node而言,寄宿在自己上的pod的发出的流量就是从本机的某进程出去的流量。

iptables 的学习中可以知道,从本机出去的流量经过的链路流程如下:

graph LR
   output[OUTPUT链]-->postroting[POSTROUTING链]

OUTPUT 链分析

OUTPUT 链涉及到4张表:rawmanglenatfilter。以下分析中会忽略与本例无关的 rule重点看nat表和filter表,raw表和mangle表是空表

NAT
1
2
3
4
5
6
7
8
iptables -t nat -nvL
...
Chain OUTPUT (policy ACCEPT 819 packets, 49140 bytes)
pkts bytes target prot opt in out source destination
15746 946K KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
72 4789 DOCKER_OUTPUT all -- * * 0.0.0.0/0 192.168.65.2
10852 651K DOCKER all -- * * 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL
...

可以看到,所有OUTPUT流量都被导向了名叫KUBE-SERVICES的自定义链,我们来看看它是做什么的。

1
2
3
4
5
6
7
8
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination
...
0 0 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.97.79.231 /* default/nginx-service cluster IP */ tcp dpt:80
0 0 KUBE-SVC-V2OKYYMBY3REGZOG tcp -- * * 0.0.0.0/0 10.97.79.231 /* default/nginx-service cluster IP */ tcp dpt:80
...
0 0 KUBE-SVC-TCOU7JCQXEZGVUNU udp -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
719 43140 KUBE-NODEPORTS all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

可以看到这里定义了所有namespaceservice相关的规则,其中就有我们创建的nginx-service规则(其他几条servicekube-dnsapiServer相关,大家感兴趣的话可以自己分析一下),可以看到它匹配的是目标地址为10.97.79.231,端口为80的数据包,而我们发往nginx-service的数据包正好匹配这条规则,我们看到这条规则的target是名叫KUBE-SVC-V2OKYYMBY3REGZOG的自定义链,我们来继续挖这个链:

1
2
3
4
5
Chain KUBE-SVC-V2OKYYMBY3REGZOG (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SEP-3VDHYO53IOQ2XWUD all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx-service */ statistic mode random probability 0.33333333349
0 0 KUBE-SEP-C54WIGIB4NQVIFB3 all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx-service */ statistic mode random probability 0.50000000000
0 0 KUBE-SEP-KN3IA7DQGTHQJWSD all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx-service */

我们看到这个KUBE-SVC-V2OKYYMBY3REGZOG链里面定义了3条规则,第一条规则有0.33333333349的概率匹配,也就是1/3的概率命中,第一条没命中的话第二条规则有1/2的概率命中,也就是2/3 * 1/2 = 1/3,第二条没命中的话就去第3条了。很明显,这里是在做负载均衡,那我们可以猜到这3条规则后面的target就是这个service代理的3个pod相关的规则了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Chain KUBE-SEP-3VDHYO53IOQ2XWUD (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 172.17.0.4 0.0.0.0/0 /* default/nginx-service */
0 0 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx-service */ tcp to:172.17.0.4:80

Chain KUBE-SEP-C54WIGIB4NQVIFB3 (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 172.17.0.5 0.0.0.0/0 /* default/nginx-service */
0 0 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx-service */ tcp to:172.17.0.5:80

Chain KUBE-SEP-KN3IA7DQGTHQJWSD (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 172.17.0.6 0.0.0.0/0 /* default/nginx-service */
0 0 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx-service */ tcp to:172.17.0.6:80

可以看到这3个自定义链的规则很类似,注意到第一条匹配的是是Pod自己访问自己的情况,会去KUBE-MARK-MASQ这个target,其他的情况会去第二条规则,也就是DNAT。在我们的假设中,是另外一个pod访问nginx-service,所以不会命中第一条,命中第二条DNAT

假设我们的数据包在KUBE-SVC-V2OKYYMBY3REGZOG链中被负载均衡分配到了第一个target,也就是KUBE-SEP-3VDHYO53IOQ2XWUD,那么DNAT之后,该数据包的destination10.97.79.231:80被改成了172.17.0.4:80,即:

graph LR
    orignal[orginal:172.17.0.3:xxx]-->dst1[10.97.79.231:80]
    afterdnat[after dnat:172.17.0.3:xxx]-->dst2[172.17.0.4:80]

至此,nat 表中 OUTPUT 链分析技术,进入filter表进行分析。

filter
1
2
3
4
5
6
7
8
9
iptables -t filter -nvL
...
Chain OUTPUT (policy ACCEPT 193K packets, 30M bytes)
pkts bytes target prot opt in out source destination
17326 1040K KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0 ctstate NEW /* kubernetes service portals */
1379K 225M KUBE-FIREWALL all -- * * 0.0.0.0/0 0.0.0.0/0
...
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination

可以看到所有所有的新的连接(ctstate NEW)都会匹配到第一条规则KUBE-SERVICES。但是我们会发现 KUBE-SERVICES 是一条空链,在此着重看第二条表KUBE-FIREWALL

1
2
3
4
Chain KUBE-FIREWALL (2 references)
pkts bytes target prot opt in out source destination
0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000
0 0 DROP all -- * * !127.0.0.0/8 127.0.0.0/8 /* block incoming localnet connections */ ! ctstate RELATED,ESTABLISHED,DNAT

可以看到,所有被标记了0x8000/0x8000的数据包都会被直接DROP掉,而我们的数据包一路走过来没有被标记,所以不会被DROP。这样一来filterOUTPUT规则也走完了,终于进入了下一个阶段 – POSTROUTRING链。

POSTROUTING

POSTROUTING 链主要涉及两张表:manglenat。由于mangle表示空表,只需关注nat表的POSTROUTING规则。

1
2
3
4
5
6
7
8
iptables -t nat -nvL
...
Chain POSTROUTING (policy ACCEPT 488 packets, 29280 bytes)
pkts bytes target prot opt in out source destination
8165 491K KUBE-POSTROUTING all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes postrouting rules */
13 780 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0
0 0 DOCKER_POSTROUTING all -- * * 0.0.0.0/0 192.168.65.2
...

首先进入第一个 targetKUBE-POSTROUTING

1
2
3
4
5
Chain KUBE-POSTROUTING (1 references)
pkts bytes target prot opt in out source destination
486 29160 RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 mark match ! 0x4000/0x4000
0 0 MARK all -- * * 0.0.0.0/0 0.0.0.0/0 MARK xor 0x4000
0 0 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ random-fully

最终会发现数据包是从docker0网卡发送出来,并没有做SNAT操作,source ip 依然是 172.17.0.3,但是这个时候的DST IP10.172.0.4 而不是service ip: 10.97.79.231

总结

ClusterIP 类型的数据包经历的链路是:

graph TB
    s((数据包))-->1[nat:OUTPUT]
    1-->2[nat:KUBE-SERVICES]
    2-->3[nat:KUBE-SVC-V2OKYYMBY3REGZOG]
    3-->4[nat:KUBE-SEP-*:3选一]
    4-->5[filter:OUTPUT]
    5-->6[filter:KUBE-FIREWALL]
    6-->7[nat:POSTROUTING]
    7-->8[nat:KUBE-POSTROUTING]
    8-->9((未被SNAT))