Possible Issues and Solutions in Kubernetes, Docker, and Containerd

During Docker, Containerd, and Kubernetes installations and usage, you may encounter various issues. You can review examples on this page for commonly encountered situations.

Problem

A server removed from a Kubernetes Cluster does not get the correct overlay network IP when added to another Kubernetes Cluster.

Reason/Cause

Pod overlay network'ü için kullanılan Flannel dosya ve ayarları sunucuda kalmakta ve manuel silinmesi gerekmektedir

Solution

In the Control-plane Kubernetes node, the relevant server is removed from the cluster with "kubectl delete node <NODE_NAME>", then the cluster settings on the server to be disconnected are cleared with the "sudo kubeadm reset" command. Then, the following operations are performed.

systemctl stop kubelet && systemctl stop containerd

rm -rf /var/lib/cni/

rm -rf /var/lib/kubelet/*

rm -rf /etc/cni/

ifconfig cni0 down && ip link delete cni0

ifconfig flannel.1 down && ip link delete flannel.1

systemctl restart containerd && systemctl restart kubelet

Problem	Error while installing docker on Centos 8.3.x servers
Reason/Cause	With the release of RHEL 8 and CentOS 8, the docker package was removed from the default package repositories, replaced by docker podman and buildah. RedHat has decided not to provide official support for Docker. For this reason, these packages prevent docker installation.
Solution	yum remove podman* -y yum remove buildah* -y

Problem

kubeadm error: "kubelet isn’t running or healthy and connection refused"

Reason/Cause

In Linux operating systems, "swap" and "selinux", which are usually active, should be turned off.

Solution

sudo swapoff -a sudo sed -i '/ swap / s/^/#/' /etc/fstab

sudo reboot

kubeadm reset kubeadm init --ignore-preflight-errors all

Problem	deleting namespace stuck at "Terminating" state
Sebep/Neden	deleting namespace stuck at "Terminating" state
Çözüm	kubectl get namespace "<NAMESPACE>" -o json \| tr -d "\n" \| sed "s/\"finalizers\": \[[^]]\+\]/\"finalizers\": []/" \| kubectl replace --raw /api/v1/namespaces/<NAMESPACE>/finalize -f -

Problem	"x509 certificate" issue during docker pull
Reason/Cause	If the relevant institution does not use https, the following line is added to the daemon file of docker. This process is repeated for all nodes using Docker.
Solution	$ sudo vi /etc/docker/daemon.json `"insecure-registries" : ["hub.docker.com:443", "registry-1.docker.io:443", "quay.io"]sudo systemctl daemon-reload sudo systemctl restart docker` #It is checked with the following. docker info
Reason/Cause	If the relevant institution uses https, the relevant institution must add the ssl certificate ("crt") to the servers.
Solution	cp ssl.crt /usr/local/share/ca-certificates/ update-ca-certificates service docker restart #Centos 7 sudo cp -p ssl.crt /etc/pki/ca-trust/source sudo cp ssl.crt /etc/pki/ca-trust/source/anchors/myregistrydomain.com.crt sudo update-ca-trust extract sudo systemctl daemon-reload sudo systemctl restart docker

Problem

If Nexus proxy is in use

Reason/Cause

If the relevant institution uses Nexus proxy, servers with docker are directed to this address.

Solution

$ sudo vi /etc/docker/daemon.json

{

"data-root":"/docker-data",

"insecure-registries":["nexusdocker.institutionaddress.com.tr"],

"registry-mirrors":["https://nexusdocker.institutionaddress.com.tr"],

"exec-opts": ["native.cgroupdriver=systemd"],

"log-driver": "json-file",

"log-opts": { "max-size": "100m" },

"storage-driver": "overlay2"

}

Problem

Kubernetes DNS Problem (connection timed out; no servers could be reached)

Reason/Cause Node stays on Ready,SchedulingDisabled

Test

kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml


kubectl get pods dnsutils

kubectl exec -i -t dnsutils -- nslookup kubernetes.default

If the result is as below everything is correct.

Server:    10.0.0.10
Address 1: 10.0.0.10

Name:      kubernetes.default
Address 1: 10.0.0.1

If the result is as follows, there is an error and the following steps need to be checked.Server: 10.96.0.10
Address 1: 10.96.0.10

nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1

Check the Resolv.conf file.

$ kubectl exec -ti dnsutils -- cat /etc/resolv.conf

(correct)

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local institution.gov.tr
options ndots:5

(false)

nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

kubectl rollout restart -n kube-system deployment/coredns

Solution

In a client, it was resolved by adding the institution's domain address to the /etc/resolv.conf file.

search institution.gov.tr

Problem

On Ubuntu servers with Kubernetes Clusters, HOST names cannot be resolved due to changes in DNS settings not being reflected in the `/etc/resolv.conf` file

Reason/Cause

On Ubuntu servers, changes made to the dns server may not always be reflected in resolv.conf or may be skipped. Since Kubernetes by default looks at the cat /etc/resolv.conf file on the server after its own internal dns, you should make sure that this file is correct.

Solution

On all server:

sudo rm /etc/resolv.conf

sudo ln -s /run/systemd/resolve/resolv.conf /etc/resolv.conf

sudo systemctl restart systemd-resolved

ls -l /etc/resolv.conf

cat /etc/resolv.conf

Only on the master server:

kubectl -n kube-system rollout restart deployment coredns

Problem	docker: Error response from daemon: Get https://registry-1.docker.io/v2/: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "institutionCertificateName-CA").
Reason/Cause	Firewall adds its own certificate by doing ssl inspection.
Solution	docker.io will be added to "ssl inspection exception" on firewall.

Problem

Node stucks at status NotReady and error message is as follows: "Unable to update cni config: no networks found in /etc/cni/net.d"

Reason/Cause

In master, kube-flannel somehow fails to create required folder and files.

Solution

(Alternative solutions: https://github.com/kubernetes/kubernetes/issues/54918)

$ sudo mkdir -p /etc/cni/net.d

$ sudo vi /etc/cni/net.d/10-flannel.conflist

#add the below.

{"name": "cbr0","plugins": [{"type": "flannel","delegate": {"hairpinMode": true,"isDefaultGateway": true}},{"type": "portmap","capabilities": {"portMappings": true}}]}

----------

{"name": "cbr0","cniVersion": "0.3.1","plugins": [{"type": "flannel","delegate": {"isDefaultGateway": true}},{"type": "portmap","capabilities": {"portMappings": true}}]}

------------

sudo chmod -Rf 777 /etc/cni /etc/cni/*

sudo chown -Rf apinizer:apinizer /etc/cni /etc/cni/*

sudo systemctl daemon-reload

sudo systemctl restart kubelet

#Check if there is still a pod that cannot take an image:

kubectl get pods -n kube-system

describe pod podAdi -n kube-system

Problem

Client certificates generated by kubeadm expire after 1 year - "internal server error. Error Detail: operation: [list] for kind: [pod] with name: [null] in namespace: [prod] failed"

Reason/Cause

Unable to connect to the server: x509: certificate has expired or is not yet valid

Solution

#These operations should be done on all master servers.

sudo kubeadm alpha certs check-expiration 
sudo kubeadm alpha certs renew all 

mkdir -p $HOME/.kube 
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config 
sudo chown $(id -u):$(id -g) $HOME/.kube/config

#all the master/control-plane nodes
sudo reboot -i

#further readings:
https://serverfault.com/questions/1065444/how-can-i-find-which-kubernetes-certificate-has-expired)
https://www.oak-tree.tech/blog/k8s-cert-yearly-renewwal

Problem

The connection to the server x.x.x.:6443 was refused - did you specify the right host or port?

Reason/Cause

That problem can occur from the reasons below:

If the disk capacity extended, swap migth have been opened.
The user may not have authorization.
You may not be on the Master Kubernetes Server.

Solution

sudo swapoff -a

sudo vi /etc/fstab (swap line must be commented out or deleted)

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

sudo reboot (optional)

Problem

kubelet.service: Main process exited, code=exited, status=255

Reason/Cause

Although there are various reasons for this problem, if the error says that there is no /etc/kubernetes/bootstrap-kubelet.conf file or any .conf file can be found, all configs can be created from scratch by following the procedures below.

Solution

#Existing configs and certificates are backed up and operations are performed

cd /etc/kubernetes/pki/
sudo mkdir /tmp/backup | sudo mkdir /tmp/backup2
sudo mv {apiserver.crt,apiserver-etcd-client.key,apiserver-kubelet-client.crt,front-proxy-ca.crt,front-proxy-client.crt,front-proxy-client.key,front-proxy-ca.key,apiserver-kubelet-client.key,apiserver.key,apiserver-etcd-client.crt} /tmp/backup/

sudo kubeadm init phase certs all --apiserver-advertise-address <MasterIP>
cd /etc/kubernetes/
sudo mv {admin.conf,controller-manager.conf,kubelet.conf,scheduler.conf} /tmp/backup2

sudo kubeadm init phase kubeconfig all
sudo systemctl restart docker && sudo systemctl restart containerd && sudo systemctl restart kubelet

---

# If the /etc/kubernetes/bootstrap-kubelet.conf file not found error also occurs on Worker Nodes, removing the node from the cluster and adding it again will fix the problem.

# Commands to be executed on the master node:

# First the problematic worker node is removed from the cluster

kubectl delete node <WORKER_NODE_NAME>

# Then a new join token is created

sudo kubeadm token create --print-join-command

# Commands to be executed on the worker node:

# Kubernetes configuration is reset.

sudo kubeadm reset

# Execute the join command received from the master.

sudo kubeadm join <MASTER_NODE_IP>:<PORT> --token <TOKEN> --discovery-token-ca-cert-hash sha256:<HASH>

Problem

ctr: failed to verify certificate: x509: certificate is not valid

Reason/Cause

The problem above is a problem that occurs when you do not have a trusted certificate when taking images from the Private registry.

Solution

We provide the solution with the -skip-verify parameter.

For example, the command to include it in the "k8s.io" namespace:

ctr --namespace k8s.io images pull xxx.harbor.com/apinizercloud/managerxxxx -skip-verify

Problem

Failure to distribute pods evenly

Reason/Cause Kubernetes does not distribute pods in a balanced way because, by default, pods are placed on nodes that are deemed most suitable based on available resources, without any specific strategy or limitations.

Solution

Add the YAML file showing how to distribute the pods in a balanced way using Pod Topology Spread Constraints after the second spec section.

spec:
     topologySpreadConstraints:
       - maxSkew: 1
         topologyKey: kubernetes.io/hostname
         whenUnsatisfiable: ScheduleAnyway
         labelSelector:
           matchExpressions:
             - key: node-role.kubernetes.io/control-plane
               operator: DoesNotExist

YML

Warning:
If you want to prevent pods from settling on these nodes using control plane labels, make sure that the control plane nodes are labeled correctly.

Control:
You can check if there is node-role.kubernetes.io/control-plane label on the node with the following command:

kubectl get nodes --show-labels

CODE

Problem

Non-Graceful Node Shutdown in Kubernetes

Reason/Cause

When a node in Kubernetes shuts down unexpectedly (Non-Graceful Shutdown), the Kubernetes Master detects this situation and takes necessary actions. However, this detection process may be delayed because it depends on the timeout parameters of the system.

Solution

The main parameters to take into account to set this duration are:

1. Node Status Update Frequency

kubelet --node-status-update-frequency=5s

CODE

The Node Status Update Frequency parameter determines how often the status of the node is updated by Kubelet running on a node to the Kubernetes API server, by default it is set to 10s.
We can enter this value lower than the default value so that Kubelet updates node state more frequently, which will allow Kubernetes to detect outages faster.

2. Node Monitor Grace Period

kube-apiserver --node-monitor-grace-period=20s

CODE

The Node Monitor Grace Period parameter sets the maximum amount of time the API server will wait before putting a node into the “NotReady” state, by default it is set to 40s.
This default value can be changed so that the API server will mark the node as “NotReady” sooner or later.

3. Pod Eviction Timeout

kube-controller-manager --pod-eviction-timeout=2m

CODE

Pod Eviction Timeout defines the maximum amount of time to wait after a node goes to “NotReady” state for pods in the node to be evicted to other nodes, by default it is set to 5 minutes.
The default value can be changed so that pods on the node can be moved to other nodes more quickly.

Problem

Avoiding Potential Conflicts When Including Cloned Worker Node in Cluster in Kubernetes

Reason/Cause

When a clone of a worker node already running in a Kubernetes cluster is included in the cluster, some configurations and credentials may conflict with the old node.

These overlaps include:

Duplicate machine-id values
Legacy Kubernetes configuration remnants
CNI and overlay network configuration conflicts

Solution

1. Kubeadm reset.

Completely reset the existing cluster configuration of the cloned worker node:

sudo kubeadm reset
sudo rm -rf $HOME/.kube

CODE

2. Clean the Overlay Network Created by Kubernetes.

CNI and other network components are cleaned:

rm -rf /var/lib/cni/
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/
ifconfig cni0 down && ip link delete cni0 
ifconfig flannel.1 down && ip link delete flannel.1
systemctl restart containerd && systemctl restart kubelet

CODE

3. The machine-id of the cloned machine is reset.

Recreate the machine-id according to the operating system:

---machine-id reset command for RHEL---
rm -f /etc/machine-id
systemd-machine-id-setup

---machine-id reset command for Ubuntu---
rm -f /etc/machine-id /var/lib/dbus/machine-id
systemd-machine-id-setup
cat /etc/machine-id > /var/lib/dbus/machine-id

CODE

4. Rejoining the Cluster.

The join command is received on the master node and executed on the cloned node:

kubeadm token create --print-join-command

CODE

Problem

If the Hostname of the Worker Node in an Existing Cluster in Kubernetes will Change

Reason/Cause

When the hostname of a worker node already running in Kubernetes cluster is changed, some configurations and credentials may conflict with the old hostname. Therefore, when doing this operation, the worker node whose hostname will be changed should be removed from the cluster and added after the hostname information is changed. (This should be done knowing that this operation may cause interruptions in the current working environment)

Solution

1. Drain and Delete Node.

Connect to the master node in the cluster:

kubectl get nodes
kubectl drain <NODES_OLD_HOSTNAME> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <NODES_OLD_HOSTNAME>

CODE

2. Connect to the worker node whose hostname will be changed.

After changing the hostname, CNI and other network components are cleaned up:

sudo kubeadm reset
sudo hostnamectl set-hostname <NODES_NEW_HOSTNAME>
sudo reboot

hostname
# If the old hostname is still listed for the IP address 127.0.0.1 in /etc/hosts, it should be replaced with the new hostname in this section.
sudo vi /etc/hosts

rm -rf /var/lib/cni/
rm -rf /var/lib/kubelet/*
rm -rf /etc/cni/

systemctl stop kubelet && systemctl stop containerd

ifconfig cni0 down && ip link delete cni0 
ifconfig flannel.1 down && ip link delete flannel.1

systemctl restart containerd && systemctl restart kubelet

CODE

3. Rejoining the Cluster.

The join command is obtained from the master node and executed on the cloned node:

kubeadm token create --print-join-command

CODE

Problem

Problem on Node Due to Read-Only Disk

Reason/Cause

When the Linux kernel detects an error in the underlying file system (ext4, xfs, etc.), it automatically places the partition in read-only mode to maintain data integrity.

When Kubernetes detects a disk error or other critical system issues on a node, it automatically adds a taint to prevent that node from receiving pods:

node.kubernetes.io/unschedulable:NoSchedule

Solution

1. The server is rebooted

sudo reboot

CODE

If the node has a disk or system error, some problems can be fixed after rebooting.

2. The taint on the node is removed

kubectl taint nodes <node-name> node.kubernetes.io/unschedulable:NoSchedule-

CODE

3. If the node still does not accept the pod, uncordon is applied

kubectl uncordon <node-name>

CODE

The uncordon command makes the node schedulable and ensures that the unschedulable taint is removed in the background.

Problem

ImageStatus failed: Id or size of image "k8s.gcr.io/kube-scheduler:v1.18.20" is not set

Reason/Cause This error occurs when upgrading the Docker version along with the server package update.
Kubernetes 1.18 does not support the Docker version, which causes incompatibility between kubelet and Docker.

Solution

The issue can be resolved by downgrading to versions 18-19-20, which are compatible with Kubernetes 1.18.

# The presence of the image and the operation of inspectin are checked.
docker inspect k8s.gcr.io/kube-scheduler:v1.18.20

# The Docker version is checked.
docker --version

# Current docker packages are removed
sudo apt remove -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# List compatible Docker versions
apt-cache madison docker-ce | grep 19.03

# Install Docker version 19.03
sudo apt install -y docker-ce=5:19.03.15~3-0~ubuntu-focal \
    docker-ce-cli=5:19.03.15~3-0~ubuntu-focal \
    containerd.io

# Version checked.
docker --version

# To prevent incorrect updates, packages are locked with the hold command.
sudo apt-mark hold docker-ce docker-ce-cli containerd.io

# The Docker and kubelet services are restarted.,
sudo systemctl restart docker.service
sudo systemctl restart kubelet.service

BASH

Problem

Intermittent Timeout Error from Manager Pod to Worker Pod via ClusterIP (Service)

Reason/Cause

The lack of the net.bridge.bridge-nf-call-iptables setting.

The kube-proxy component, which operates in Kubernetes' iptables mode, uses Linux bridges to route Service traffic. If the net.bridge.bridge-nf-call-iptables setting is set to 0, it prevents the bridging traffic from being routed through iptables. This causes issues with routing from the Service ClusterIP to the pod IPs. In such cases, kube-proxy typically issues the following warning:
"Missing br-netfilter module or unset sysctl br-nf-call-iptables..."

Solution

You need to correct the sysctl setting to enable Service routing.

It must be applied on all Kubernetes nodes.

1) Add the Setting to the Configuration File (or Verify It):

Ensure that the following setting is set to 1.
You can open the configuration file with the following command:

sudo vi /etc/sysctl.d/k8s.conf

BASH

Add the following line to the file (or modify it to 1 if it already exists):

net.bridge.bridge-nf-call-iptables = 1

BASH

2) Apply the Settings Immediately:

Run the following command to apply the changes to the system:

sudo sysctl --system

BASH

3) Verify the Settings:

Use the following command to verify the configuration:

sudo sysctl net.bridge.bridge-nf-call-iptables

BASH

The output should be 1.

4) Restart kube-proxy:

To allow kube-proxy to apply the new settings, restart the pod with the following command:

bosh kubectl delete pod -n kube-system -l k8s-app=kube-proxy

BASH

When these steps are applied, the timeout issues occurring during connections from the Manager Pod to the Worker Pod via ClusterIP will be resolved.

A server removed from a Kubernetes Cluster does not get the correct overlay network IP when added to another Kubernetes Cluster.

Error while installing docker on Centos 8.3.x servers

kubeadm error: "kubelet isn’t running or healthy and connection refused"

deleting namespace stuck at "Terminating" state

"x509 certificate" issue during docker pull

If Nexus proxy is in use

Kubernetes DNS Problem (connection timed out; no servers could be reached)

On Ubuntu servers with Kubernetes Clusters, HOST names cannot be resolved due to changes in DNS settings not being reflected in the /etc/resolv.conf file

docker: Error response from daemon: Get https://registry-1.docker.io/v2/: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "institutionCertificateName-CA").

Node stucks at status NotReady and error message is as follows: "Unable to update cni config: no networks found in /etc/cni/net.d"

Client certificates generated by kubeadm expire after 1 year - "internal server error. Error Detail: operation: [list] for kind: [pod] with name: [null] in namespace: [prod] failed"

The connection to the server x.x.x.:6443 was refused - did you specify the right host or port?

kubelet.service: Main process exited, code=exited, status=255

ctr: failed to verify certificate: x509: certificate is not valid

Failure to distribute pods evenly

Non-Graceful Node Shutdown in Kubernetes

Avoiding Potential Conflicts When Including Cloned Worker Node in Cluster in Kubernetes

If the Hostname of the Worker Node in an Existing Cluster in Kubernetes will Change

Problem on Node Due to Read-Only Disk

ImageStatus failed: Id or size of image "k8s.gcr.io/kube-scheduler:v1.18.20" is not set

Intermittent Timeout Error from Manager Pod to Worker Pod via ClusterIP (Service)

On Ubuntu servers with Kubernetes Clusters, HOST names cannot be resolved due to changes in DNS settings not being reflected in the `/etc/resolv.conf` file