Server removed from a Kubernetes cluster not getting correct overlay network IP when added to another Kubernetes cluster
Server removed from a Kubernetes cluster not getting correct overlay network IP when added to another Kubernetes cluster
Reason/Why: Flannel files and settings used for pod overlay network remain on the server and need to be manually deleted.
Solution: On the control-plane Kubernetes node, remove the relevant server from the cluster with “kubectl delete node <NODE_NAME>”, then clean the cluster settings on the server to be disconnected with “sudo kubeadm reset” command. Then perform the following operations:
Error received when installing Docker on CentOS 8.3.x servers
Error received when installing Docker on CentOS 8.3.x servers
Reason/Why: With the release of RHEL 8 and CentOS 8, the docker package was removed from default package repositories, docker was replaced with podman and buildah. RedHat decided not to provide official support for Docker. Therefore, these packages prevent Docker installation.
Solution:
Reason/Why: “swap” and “selinux” which are usually active in Linux operating systems should be disabled.
Solution:
Deleting namespace stuck at "Terminating" state
Deleting namespace stuck at "Terminating" state
Reason/Why: Namespace deletion operation may get stuck due to finalizers.
Solution:
Docker pull "x509 certificate" issue
Docker pull "x509 certificate" issue
Reason/Why: If the relevant organization is not using HTTPS, the following line is added to Docker’s daemon file. This operation is repeated for all nodes using Docker.
Solution (For organizations not using HTTPS):
Solution (For organizations using HTTPS): If the relevant organization is using HTTPS, SSL certificate (“crt”) from the relevant organization needs to be added to servers.
If Nexus proxy is used
If Nexus proxy is used
Reason/Why: If the relevant organization uses Nexus proxy, Docker servers are redirected to this address.
Solution:
Kubernetes DNS Problem (connection timed out; no servers could be reached)
Kubernetes DNS Problem (connection timed out; no servers could be reached)
Reason/Why: Node may stay in Ready, SchedulingDisabled state.
Test:
If we get the following result, everything is correct:
If we get the following result, there is an error and the following steps need to be checked:
Take a look at the resolv.conf file:
(Correct)
(Incorrect)
Solution: It was resolved by adding the organization’s domain address to the /etc/resolv.conf file in a customer.
HOST name resolution due to DNS setting changes not reflecting in `/etc/resolv.conf` file on Ubuntu servers where Kubernetes clusters are located
HOST name resolution due to DNS setting changes not reflecting in `/etc/resolv.conf` file on Ubuntu servers where Kubernetes clusters are located
Reason/Why: On Ubuntu operating system servers, changes made regarding DNS server may not always reflect in resolv.conf or may be skipped. Since Kubernetes by default looks at the cat /etc/resolv.conf file on the server after its internal DNS, this file must be ensured to be correct.
Solution:
On all servers:
Only on master server:
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": x509: certificate signed by unknown authority
docker: Error response from daemon: Get "https://registry-1.docker.io/v2/": x509: certificate signed by unknown authority
Node Stays NotReady and "Unable to update cni config: no networks found in /etc/cni/net.d"
Node Stays NotReady and "Unable to update cni config: no networks found in /etc/cni/net.d"
Reason/Why: kube-flannel on Master cannot create the necessary folder and files somehow.
Solution: (Alternative solutions are also available: GitHub Issue #54918)
The following content is added:
Client certificates generated by kubeadm expire after 1 year - "internal server error. Error Detail: operation: [list] for kind: [pod] with name: [null] in namespace: [prod] failed"
Client certificates generated by kubeadm expire after 1 year - "internal server error. Error Detail: operation: [list] for kind: [pod] with name: [null] in namespace: [prod] failed"
Reason/Why: Unable to connect to the server: x509: certificate has expired or is not yet valid.
Solution: These operations must be performed on all master servers.
Error: "The connection to the server x.x.x.:6443 was refused - did you specify the right host or port?"
Error: "The connection to the server x.x.x.:6443 was refused - did you specify the right host or port?"
Reason/Why: The above problem can occur due to any of the following reasons:
- Swap may need to be closed again as it can be opened when disk is added.
- User may not have permissions.
- We may not be on the master server.
Solution:
kubelet.service: Main process exited, code=exited, status=255
kubelet.service: Main process exited, code=exited, status=255
Reason/Why: While this problem has various causes, if the error says that any .conf file such as /etc/kubernetes/bootstrap-kubelet.conf cannot be found, all configs can be recreated from scratch by applying the following operations.
Solution (Master Node): Operations are performed by backing up existing configs and certificates.
Solution (Worker Node): If the error of not finding the /etc/kubernetes/bootstrap-kubelet.conf file occurs on Worker Nodes as well, removing the node from the cluster and adding it again will solve the problem.
Commands to be run on master node:
Commands to be run on worker node:
ctr: failed to verify certificate: x509: certificate is not valid
ctr: failed to verify certificate: x509: certificate is not valid
Reason/Why: The above problem is an issue that occurs when you don’t have a trusted certificate when pulling images from Private registry.
Solution: We solve it with the -skip-verify parameter. Example command including it in the “k8s.io” namespace:
Pods Cannot Be Distributed in a Balanced Manner
Pods Cannot Be Distributed in a Balanced Manner
Reason/Why: Kubernetes does not distribute pods in a balanced manner because by default, pods are placed on nodes that seem most suitable according to available resources, without a specific strategy or constraint.
Solution: Add the YAML file showing how pods will be distributed in a balanced manner using Pod Topology Spread Constraints after the second spec section.
Check: You can check if the node-role.kubernetes.io/control-plane label exists on the node with the following command.
Non-Graceful Node Shutdown in Kubernetes (Unexpected Shutdown of K8s Node)
Non-Graceful Node Shutdown in Kubernetes (Unexpected Shutdown of K8s Node)
Reason/Why: When a node shuts down unexpectedly in Kubernetes (Non-Graceful Shutdown), Kubernetes Master detects this situation and performs necessary operations. However, this detection process may be delayed as it depends on the system’s timeout parameters.
Solution: The main parameters to consider for adjusting this time are:
1. Node Status Update Frequency
- The Node Status Update Frequency parameter determines how often Kubelet running on a node will update the node’s status to the Kubernetes API server, default value is 10s.
- We can enter this value lower than the default value for Kubelet to update node status more frequently, which allows Kubernetes to detect interruptions faster.
2. Node Monitor Grace Period
- The Node Monitor Grace Period parameter determines the maximum time the API server will wait before marking a node as “NotReady”, default value is 40s.
- This default value can be changed for the API server to mark the node as “NotReady” sooner or later.
3. Pod Eviction Timeout
- Pod Eviction Timeout defines the maximum time to wait for pods on a node to be relocated (eviction) to other nodes after a node goes to “NotReady” state, default value is 5 minutes.
- The default value can be changed for pods on the node to be moved to other nodes faster.
Preventing Possible Conflicts When Adding a Cloned Worker Node to Cluster in Kubernetes
Preventing Possible Conflicts When Adding a Cloned Worker Node to Cluster in Kubernetes
Reason/Why: When a clone of a worker node already running in a Kubernetes cluster is added to the cluster, some configurations and credentials may conflict with the old node. These conflicts include:
- Duplicate machine-id values
- Old Kubernetes configuration remnants
- CNI and overlay network configuration conflicts
Solution:
1. Kubeadm reset. The cloned worker node’s existing cluster configuration is completely reset:
2. Kubernetes Created Overlay Network is Cleaned. CNI and other network components are cleaned:
3. Cloned Machine’s machine-id is Reset. machine-id is regenerated according to the operating system:
4. Rejoining Cluster. Join command is obtained from master node and executed on cloned node:
If Hostname of Worker Node in Existing Cluster Will Change in Kubernetes
If Hostname of Worker Node in Existing Cluster Will Change in Kubernetes
Reason/Why: When the hostname of a worker node already running in a Kubernetes cluster is changed, some configurations and credentials may conflict with the old hostname. Therefore, when performing this operation, the worker node whose hostname will be changed should be removed from the cluster and added again after the hostname information is changed.
Solution:
1. Drain and Delete Node. Connect to master node in cluster:
2. Connect to Worker Node Whose Hostname Will Change. After hostname is changed, CNI and other network components are cleaned:
3. Rejoining Cluster. Join command is obtained from master node and executed on cloned node:
Issue on Node Due to Read-Only Disk
Issue on Node Due to Read-Only Disk
Reason/Why: When Linux kernel detects an error in the underlying file system (ext4, xfs, etc.), it automatically puts the relevant disk partition into read-only mode to protect data integrity. When Kubernetes detects a disk error or other critical system issues on a node, it automatically adds a taint to prevent that node from accepting pods: node.kubernetes.io/unschedulable:NoSchedule
Solution:
1. Server is rebooted
If the node has disk or system errors, some issues may be fixed after reboot.
2. Taint on node is removed
3. If node still doesn’t accept pods, uncordon is applied
The uncordon command makes the node schedulable and ensures that the unschedulable taint is removed in the background.
ImageStatus failed: Id or size of image "k8s.gcr.io/kube-scheduler:v1.18.20" is not set
ImageStatus failed: Id or size of image "k8s.gcr.io/kube-scheduler:v1.18.20" is not set
Reason/Why: This is an error encountered when Docker version is upgraded along with server package update. Kubernetes 1.18 does not support Docker version, causing incompatibility in communication between kubelet and Docker.
Solution: The problem is solved by downgrading Docker version to versions 18-19-20 compatible with Kubernetes 1.18.
Intermittent Timeout Error from Manager Pod to Worker Pod via ClusterIP (Service)
Intermittent Timeout Error from Manager Pod to Worker Pod via ClusterIP (Service)
Reason/Why: Missing net.bridge.bridge-nf-call-iptables setting. Kubernetes’ iptables mode running kube-proxy component uses Linux bridges to route Service traffic. Having net.bridge.bridge-nf-call-iptables set to 0 prevents bridge traffic from being routed through iptables. This causes confusion in routing from Service ClusterIP to pod IPs. Kube-proxy usually gives the following warning in this case: “Missing br-netfilter module or unset sysctl br-nf-call-iptables…”
Solution: You need to enable Service routing by fixing the sysctl setting. Must Be Applied on All Kubernetes Nodes.
1) Adding Setting to Configuration File (or Checking): Make sure the following setting is set to 1. You can open the configuration file with the following command:
Add the following line to the file (or edit it to 1 if it exists):
2) Loading Settings Immediately: Run the following command to load changes in the file to the system:
3) Verifying Settings: Use the following command to verify configuration:
Output should return 1.
4) Restarting Kube-proxy: Restart the pod with the following command so kube-proxy can receive new settings:
When these steps are applied, timeout issues occurring in connections from Manager Pod to Worker Pod via ClusterIP will be resolved.

