Kubernetes Pod DNS 解析优化-卡咪卡咪哈-一个博客

引言

群脉 SRE 运维 Kubernetes 集群时发现在高并发的情况下，业务中的网络请求（API 调用）会有大量 DNS 解析失败的情况出现，比如 PHP 业务（使用的 cURL）会报 Resolving timed out after xxx milliseconds。一开始我们怀疑是云服务商的 DNS 服务器有问题，但排查下来并不是，而是大量未被缓存的重复甚至是无效 DNS 查询导致 Kubernetes 内的 CoreDNS 服务以及 host 网络负载过高，进而导致部分失败，下文记录了我们排查和解决该问题的过程。

背景知识

Kubernetes Pod DNS 策略

Default：Pod 继承所在节点的名称解析配置（即 kubelet 的配置）。ClusterFirst：使用自动生成的 /etc/resolv.conf，其 nameserver 通常是部署在 Kubernetes 内的 CoreDNS，以用来解析集群内部的 Pod 和 Service 地址。ClusterFirstWithHostNet：对于以 hostNetwork 方式运行的 Pod，应显式设置其 DNS 策略 ClusterFirstWithHostNet 以生成和 ClusterFirst 同样的 /etc/resolv.conf。None：此设置允许 Pod 忽略 Kubernetes 环境中的 DNS 设置。Pod 会使用其 dnsConfig 字段所提供的 DNS 设置。

需要注意的是 Default 不是默认的 DNS 策略，如果未明确指定 dnsPolicy，则使用 ClusterFirst。因此默认情况下 DNS 解析请求都被发给了 Kubernetes 内的 CoreDNS，这也是在高并发的情况下，CoreDNS 负载增高的原因。

CoreDNS 的作用及配置

CoreDNS 是一个灵活可扩展的 DNS 服务器，可以作为 Kubernetes 集群 DNS 服务器用以解析内部的 Pod 和 Service 地址。与 Kubernetes 一样，CoreDNS 项目由 CNCF 托管。从 Kubernetes 1.13 开始，CoreDNS 在默认情况下使用，以替换之前版本的 kube-dns，其 Service IP 依然沿用 10.96.0.10。

在 Kubernetes 中，CoreDNS 安装时使用如下默认 Corefile 配置。

apiVersion:v1kind:ConfigMapmetadata:name:corednsnamespace:kube-systemdata:Corefile:| .:53 { errors health { lameduck 5s } ready kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure fallthrough in-addr.arpa ip6.arpa ttl 30 } prometheus :9153 forward ./etc/resolv.conf cache 30 loop reload loadbalance }

Corefile 配置中声明启用了哪些插件，其中：

kubernetes：CoreDNS 将基于 Kubernetes 的 Pod 和 Service 的 IP 答复 DNS 查询。forward：不在 Kubernetes 集群域内的任何查询都将转发到预定义的解析器 (/etc/resolv.conf)。cache：启用前端缓存。

优化 Pod 中的 /etc/resolv.conf 配置

Kubernetes 默认配置导致的问题

Kubernetes 上运行的 Pod，其域名解析的规则和一般的 Linux 一样，都是根据 /etc/resolv.conf 文件中的配置。但 Kubernetes 为了使其内部域名在 Pod 中可以解析，在 /etc/resolv.conf 中配置了一些规则，使得可以解析 Kubernetes 集群的 Pod 和 Service 的名字，但在这个过程中也产生了很多不必要的请求。

如下是一个在 default namespace 中启动的 Pod 的 /etc/resolv.conf：

nameserver 10.96.0.10 search default.svc.cluster.local svc.cluster.local cluster.local options ndots:5

其中 nameserver 配置的 10.96.0.10 即为 CoreDNS 的 Service IP。

在该 Pod 中执行 ping -c 1 www.baidu.com，同时在 Pod 中抓包：

tcpdump -nnvXSs 0 -i any tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 10.32.45.220.11405 > 10.96.0.10.53: 19056+ A? www.baidu.com.default.svc.cluster.local. (53) 10.32.45.220.58995 > 10.96.0.10.53: 7293+ AAAA? www.baidu.com.default.svc.cluster.local. (53) 10.32.45.220.13217 > 10.96.0.10.53: 7921+ A? www.baidu.com.svc.cluster.local. (49) 10.32.45.220.56436 > 10.96.0.10.53: 39158+ AAAA? www.baidu.com.svc.cluster.local. (49) 10.32.45.220.26641 > 10.96.0.10.53: 48382+ A? www.baidu.com.cluster.local. (45) 10.32.45.220.42045 > 10.96.0.10.53: 60930+ AAAA? www.baidu.com.cluster.local. (45) 10.32.45.220.46382 > 10.96.0.10.53: 64654+ A? www.baidu.com. (31) 10.96.0.10.53 > 10.32.45.220.46382: 64654 3/0/0 www.baidu.com. CNAME www.a.shifen.com., www.a.shifen.com. A 180.101.49.11, www.a.shifen.com. A 180.101.49.12 (138) 10.32.45.220.46895 > 10.96.0.10.53: 60052+ AAAA? www.baidu.com. (31) 10.96.0.10.53 > 10.32.45.220.46895: 60052 1/1/0 www.baidu.com. CNAME www.a.shifen.com. (164) 10.32.45.220 > 180.101.49.11: ICMP echo request, id 377, seq 1, length 64 180.101.49.11 > 10.32.45.220: ICMP echo reply, id 377, seq 1, length 64

可以看到，在正确解析之前依次查询了下面几个域名：

www.baidu.com.default.svc.cluster.local.www.baidu.com.svc.cluster.local.www.baidu.com.cluster.local.

但这些查询都没有返回正确结果，因为实际上并不存在这样的域名，这些请求属于不必要的请求。而之所以要发这些请求，是由 /etc/resolv.conf 中的 search 和 ndots 共同决定的。

search 和 ndots 的作用

search Search list for host-name lookup. By default, the search list contains one entry, the local domain name. It is determined from the local hostname returned by gethostname(2); the local domain name is taken to be everything after the first .. Finally, if the hostname does not contain a ., the root domain is assumed as the local domain name. This may be changed by listing the desired domain search path following the search keyword with spaces or tabs separating the names. Resolver queries having fewer than ndots dots (default is 1) in them will be attempted using each component of the search path in turn until a match is found. For environments with multiple subdomains please read options ndots:n below to avoid man-in-the-middle attacks and unnecessary traffic for the root-dns-servers. Note that this process may be slow and will generate a lot of network traffic if the servers for the listed domains are not local, and that queries will time out if no server is available for one of the domains. ndots:n Sets a threshold for the number of dots which must appear in a name given to res_query(3) (see resolver(3)) before an initial absolute query will be made. The default for n is 1, meaning that if there are any dots in a name, the name will be tried first as an absolute name before any search list elements are appended to it. The value for this option is silently capped to 15.

在进行域名的 DNS 解析时，操作系统先判断其是否是一个 FQDN（Fully qualified domain name，即完整域名，指以 . 结尾的），如果是，则会直接查询 DNS 服务器；如果不是，则就要根据 search 和 ndots 的设置进行 FQDN 的拼接再将其发到 DNS 服务器进行解析。

ndots 表示的是完整域名中必须出现的 . 的个数，如果域名中的 . 的个数不小于 ndots，则该域名会被认为是一个 FQDN，操作系统会直接将其发给 DNS 服务器进行查询；否则，操作系统会在 search 搜索域中依次查询。

例如上面的例子，ndots 为 5，查询的域名 www.baidu.com 不以 . 结尾，且 . 的个数少于 5，因此操作系统会依次在 default.svc.cluster.local svc.cluster.local cluster.local 三个域中进行搜索，这 3 个搜索域都是由 Kubernetes 注入的。

Kubernetes 为什么要使用搜索域？

目的是为了使 Pod 可以解析内部域名，比如通过 Service name 访问 Service。

例如 default namespace 下的 Pod a 需要访问同 namespace 中的 Service service-b 时，直接使用 service-b 就可以访问了，这就是通过在 default.svc.cluster.local 搜索域中搜索完成的。同理 svc.cluster.local 支持了对不同 namespace 下的 Service 访问，如 service-c.sre（意为 sre namespace 下的 service-c）。

Kubernetes 默认配置下 ndots 的值是 5，其原因官方在 issue 33554 中做了解释：

Kubernetes 需要一个标志来识别集群内的域名，因此 svc 是必须包含在域名内的。有些人需要配置集群 $zone 的后缀以支持多集群，所以一个完整的 Service 的域名为 $service.$namespace.svc.$zone，为了不在代码中配置完整的域名，就需要设置 ndots 和 search。同 namespace 下的 Service 的请求是最常用的，因此需要解析 $service，此时需 ndots >= 1，且 search 列表中第一个应为 $namespace.D.$zone。跨 namespace 的 Service 也经常会被请求，因此需要解析 $service.$namespace，此时需 ndots >= 2，且 search 列表中第二个应为 svc.$zone。为了解析 $service.$namespace.svc，此时需 ndots >= 3，且 search 列表包含 $zone。在 Kubernetes 1.4 之前，StatefulSet 为 PetSet，当每个 Pet 被创建时，它会获得一个匹配的 DNS 子域，域格式为 $petname.$service.$namespace.svc.$zone，此时需 ndots >= 4。Kubernetes 还支持 SRV 记录，因此 _$port.$proto.$service.$namespace.svc.$zone 需要可以解析，此时 ndots = 5。

这就是为什么 ndots 为 5。总结来说是为了支持更复杂的 Pod 内的域名解析，但通常情况下群脉只会用到同 namespace 下的 Service（形如 service-b）和跨 namespace 下 Service（形如 service-c.sre）的访问，因此 ndots 的默认值设为 2 便可满足业务需求并避免大量无效的解析请求。

修改 ndots

有以下可选方案：

通过基础镜像中的启动脚本统一修改，因为群脉有统一的基础镜像，所以我们采用的这种。

通过 pod.Spec.DNSConfig 改写。

apiVersion: v1 kind: Pod metadata: namespace: default name: dns-example spec: dnsConfig: options: – name: ndots value: “2”

结果验证

在修改后的 Pod 中执行 ping -c 1 www.baidu.com，同时在 Pod 中抓包：

tcpdump -nnvXSs 0 -i any tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 10.32.45.220.2853 > 10.96.0.10.53: 23731+ A? www.baidu.com. (31) 10.96.0.10.53 > 10.32.45.220.2853: 23731* 3/0/0 www.baidu.com. CNAME www.a.shifen.com., www.a.shifen.com. A 180.101.49.12, www.a.shifen.com. A 180.101.49.11 (138) 10.32.45.220.32879 > 10.96.0.10.53: 16062+ AAAA? www.baidu.com. (31) 10.96.0.10.53 > 10.32.45.220.32879: 16062* 1/1/0 www.baidu.com. CNAME www.a.shifen.com. (164) 10.32.45.220 > 180.101.49.12: ICMP echo request, id 377, seq 1, length 64 180.101.49.12 > 10.32.45.220: ICMP echo reply, id 377, seq 1, length 64

可以看到并没有在搜索域中查询，而是直接查询了 www.baidu.com.。

options 其他选项的补充说明

timeout：设置等待 DNS 服务器返回的超时时间，默认为 5，单位为秒。这个有点高了，为了 fail fast，我们设为了 2。attempt：解析失败时的重试次数，默认为 2。为了提高成功率，我们设为了 3。single-request：默认情况下会并行执行 IPv4 and IPv6 的解析，但某些 DNS 服务器可能无法正确处理这些查询使请求超时。该选项可以禁用此行为，并依次执行 IPv6 and IPv4 的解析。rotate：采用轮询方式访问 nameserver。no-check-names：禁止对传入的主机名和邮件地址进行无效字符检查。use-vc：强制使用 TCP 进行 DNS 解析。

启用 nscd 做 DNS 缓存

在做了上述修改后，只是减少了单次请求中产生的无效的 DNS 查询，但是通过抓包可以看到，每次域名解析还是会请求 DNS 服务器，在高并发的情况下 CoreDNS 的压力仍然存在。

这种行为其实是不正常的，通常情况下，域名的解析不会经常变更，因此在域名解析获得真实 IP 之后，在语言运行时或操作系统中应当将其缓存一段时间，在下一次发起 DNS 解析请求时，直接返回，不再请求 DNS 服务器，直到缓存过期或请求失败。操作系统上一般会启用 nscd 服务来实现此功能，但 Pod 所用的 Docker 镜像中一般不会配置 nscd 服务，所以可以在有高并发业务的 Pod 上安装 nscd 服务来进行优化。

Dockerfile 配置

FROM registry.aliyuncs.com/acs/ubuntuRUN apt-get update && apt-get install -y nscd && rm -rf /var/lib/apt/lists/*CMD service nscd start; bash

nscd 配置

通过 man nscd.conf 命令可以看到 nscd 完整的配置，以下是其中一些需要注意的参数：

positive-time-to-live：解析成功时结果的缓存时间，nscd 会优先使用 DNS reply 的 TTL 值，因此这个配置通常不会起作用。negative-time-to-live：解析失败时结果的缓存时间。该配置一定要设为 0，避免在 DNS 解析失败时无法重试。reload-count：当缓存过期后，nscd 会主动发起 DNS 请求刷新缓存的次数（缓存被命中后会重置），当超过 roload-count 后，nscd 就不再缓存该域名直到有下次请求。默认值是 5，代表 SUCCESS 的缓存会在内存中自动 reload 5 次。

注：实际每次结果缓存的时间，也就是 reload 的间隔，为 DNS 应答 TTL + CACHE_PRUNE_INTERVAL，这里的 CACHE_PRUNE_INTERVAL 来自于相关的宏定义，无法修改。

// https://code.woboq.org/userspace/glibc/nscd/nscd.h.html#_M/CACHE_PRUNE_INTERVAL # define CACHE_PRUNE_INTERVAL 15

结果验证

在启用了 nscd 的 Pod 中执行两次 ping -c 1 www.baidu.com，同时在 Pod 中抓包：

# 第一次 tcpdump -nnvXSs 0 -i any tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes 10.32.45.220.2853 > 10.96.0.10.53: 23731+ A? www.baidu.com. (31) 10.96.0.10.53 > 10.32.45.220.2853: 23731* 3/0/0 www.baidu.com. CNAME www.a.shifen.com., www.a.shifen.com. A 180.101.49.12, www.a.shifen.com. A 180.101.49.11 (138) 10.32.45.220.32879 > 10.96.0.10.53: 16062+ AAAA? www.baidu.com. (31) 10.96.0.10.53 > 10.32.45.220.32879: 16062* 1/1/0 www.baidu.com. CNAME www.a.shifen.com. (164) 10.32.45.220 > 180.101.49.12: ICMP echo request, id 377, seq 1, length 64 180.101.49.12 > 10.32.45.220: ICMP echo reply, id 377, seq 1, length 64 # 第二次 10.32.45.220 > 180.101.49.12: ICMP echo request, id 377, seq 1, length 64 180.101.49.12 > 10.32.45.220: ICMP echo reply, id 377, seq 1, length 64 # nscd 刷新缓存 10.96.0.10.53 > 10.32.45.220.2853: 23731* 3/0/0 www.baidu.com. CNAME www.a.shifen.com., www.a.shifen.com. A 180.101.49.11, www.a.shifen.com. A 180.101.49.12 (138) 10.32.45.220.32879 > 10.96.0.10.53: 16062+ AAAA? www.baidu.com. (31) 10.96.0.10.53 > 10.32.45.220.32879: 16062* 1/1/0 www.baidu.com. CNAME www.a.shifen.com. (164)

可以看到第一次 ping 的时候仍请求了 10.96.0.10 即 CoreDNS，获取了 IP 地址，但在第二次 ping 时，没有再请求 10.96.0.10，而是直接访问了第一次得到的 IP 地址，说明缓存起了作用，这样就减少了在高并发的情况下发给 CoreDNS 的解析请求数。

在持续抓包的情况下可以看到，即使没有请求访问，仍抓到了 www.baidu.com 的 DNS 解析的包，这是 nscd 在更新自己的缓存结果，避免缓存的 IP 地址有误。

适用情况

由于将解析结果进行了缓存，在以下情况中域名的解析可能会受到影响，因此建议只对高并发的 pod 启用 nscd：

多条 Round-Robin 的情况下失去轮询功能，导致缓存周期内相关的负载均衡失效，比如 Kubernetes 内的 headless service。域名变更生效可能持续一个 TTL + 15s，对于一部分讲究变更快速生效的域名而言有一定的变更生效延误。

优化 nf_conntrack_* 内核参数

在调查的过程中我们发现，DNS 解析失败除了 CoreDNS 面对大量查询不堪重负之外，还有一个原因：就是 DNS 解析请求产生的 UDP 连接过多导致 Linux 中的 net.netfilter.nf_conntrack_count 不足，进而在 kernel 日志中报 conntrack table full, dropping packets，即部分 UDP 包被丢弃了。虽然经过前面两个优化后，不会再有这么多 UDP 连接了，但是为了保险起见，我们做了如下调整，仅供参考：

net.netfilter.nf_conntrack_max=1048576 net.netfilter.nf_conntrack_buckets=262144 net.netfilter.nf_conntrack_tcp_timeout_fin_wait=30 net.netfilter.nf_conntrack_tcp_timeout_time_wait=30 net.netfilter.nf_conntrack_tcp_timeout_close_wait=15