 
- Kubernetes Tutorial
- Kubernetes - Home
- Kubernetes - Overview
- Kubernetes - Architecture
- Kubernetes - Setup
- Kubernetes - Setup on Ubuntu
- Kubernetes - Images
- Kubernetes - Jobs
- Kubernetes - Labels & Selectors
- Kubernetes - Namespace
- Kubernetes - Node
- Kubernetes - Service
- Kubernetes - POD
- Kubernetes - Replication Controller
- Kubernetes - Replica Sets
- Kubernetes - Deployments
- Kubernetes - Volumes
- Kubernetes - Secrets
- Kubernetes - Network Policy
- Advanced Kubernetes
- Kubernetes - API
- Kubernetes - Kubectl
- Kubernetes - Kubectl Commands
- Kubernetes - Creating an App
- Kubernetes - App Deployment
- Kubernetes - Autoscaling
- Kubernetes - Dashboard Setup
- Kubernetes - Helm Package Management
- Kubernetes - CI/CD Integration
- Kubernetes - Persistent Storage and PVCs
- Kubernetes - RBAC
- Kubernetes - Logging & Monitoring
- Kubernetes - Service Mesh with Istio
- Kubernetes - Backup and Disaster Recovery
- Managing ConfigMaps and Secrets
- Running Stateful Applications
- Multi-Cluster Management
- Security Best Practices
- Kubernetes CRDs
- Debugging Pods and Nodes
- K9s for Cluster Management
- Managing Taints and Tolerations
- Horizontal and Vertical Pod Autoscaling
- Minikube for Local Development
- Kubernetes in Docker
- Deploying Microservices
- Blue-Green Deployments
- Canary Deployments with Commands
- Troubleshooting Kubernetes with Commands
- Scaling Applications with Kubectl
- Advanced Scheduling Techniques
- Upgrading Kubernetes Clusters
- Kubernetes Useful Resources
- Kubernetes - Quick Guide
- Kubernetes - Useful Resources
- Kubernetes - Discussion
Troubleshooting Kubernetes with Commands
   
 
Working with Kubernetes can be both exciting and challenging. While it offers powerful tools for managing containerized applications, issues are inevitable. Whether it's a pod stuck in CrashLoopBackOff or a service not responding, knowing how to troubleshoot effectively is crucial.
In this chapter, we'll walk through some common scenarios and the kubectl commands that can help us diagnose and resolve them.
Checking Cluster Health
Before diving into specific issues, it's essential to ensure our cluster is healthy.
View Cluster Nodes
Use the following command to view cluster nodes -
$ kubectl get nodes
Output
NAME STATUS ROLES AGE VERSION controlplane Ready control-plane 71m v1.31.6 node01 Ready <none> 71m v1.31.6
This command lists all nodes in the cluster. We should see all nodes in the Ready state. If any node is NotReady, it might indicate issues with the node's health or connectivity.
Inspect Node Details
Use the following command to inspect node details -
$ kubectl describe node <node-name>
It provides detailed information about a specific node, including resource usage, conditions, and events. It's useful for identifying issues like disk pressure or memory shortages.
Investigating Pods
Pods are the smallest deployable units in Kubernetes. When things go wrong, pods are often the first place to look.
List All Pods
You can use the following command to list all the pods -
$ kubectl get pods -A
Output
NAMESPACE NAME READY STATUS RESTARTS AGE kube-flannel kube-flannel-ds-fbqdb 1/1 Running 0 75m kube-flannel kube-flannel-ds-jln5p 1/1 Running 0 75m kube-system coredns-7c65d6cfc9-4t6gh 1/1 Running 0 75m kube-system coredns-7c65d6cfc9-gprrn 1/1 Running 0 75m kube-system etcd-controlplane 1/1 Running 0 75m kube-system hostpath-provisioner-5558658586-md76l 1/1 Running 0 75m
It lists all pods across all namespaces. Look for pods not in the Running or Completed state.
Describe a Pod
Use the following command to describe a pod -
$ kubectl describe pod <pod-name> -n <namespace>
For instance:
$ kubectl describe pod kube-flannel-ds-fbqdb -n kube-flannel
Output
Name: kube-flannel-ds-fbqdb Namespace: kube-flannel Priority: 2000001000 Priority Class Name: system-node-critical Service Account: flannel Node: controlplane/172.16.8.5 Start Time: Tue, 29 Apr 2025 11:15:04 +0000
It provides detailed information about the pod, including events, which can indicate issues like failed scheduling or image pull errors.
View Pod Logs
To view pod logs, use the following command -
$ kubectl logs kube-flannel-ds-fbqdb -n kube-flannel
Output
I0429 11:15:25.762540 1 kube.go:139] Waiting 10m0s for node controller to sync I0429 11:15:25.762636 1 kube.go:469] Starting kube subnet manager I0429 11:15:25.769460 1 kube.go:490] Creating the node lease for IPv4. This is the n.Spec.PodCIDRs: [10.244.1.0/24]
It shows the logs from the pod's main container. If the pod has multiple containers, specify the container name:
$ kubectl logs <pod-name> -c <container-name> -n <namespace>
Common Pod Issues
Common Pod issues include application errors, misconfigured environment variables, or insufficient resources. Let's explore some frequent pod-related problems and how to troubleshoot them.
CrashLoopBackOff
$ kubectl get pods
Output
NAME READY STATUS RESTARTS AGE crashloop-pod 0/1 CrashLoopBackOff 5 30s
This status indicates that a pod is repeatedly crashing.
Describe the pod to view recent events:
$ kubectl describe pod crashloop-pod
Output
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 30s default-scheduler Successfully assigned default/crashloop-pod to node1 Normal Pulled 30s kubelet, node1 Container image "busybox" already present on machine Warning BackOff 5s kubelet, node1 Back-off restarting failed container Normal Killing 5s kubelet, node1 Killing container with id docker://crashloop-container:container has runAsNonRoot and image has non-numeric user (busybox), cannot find user busybox in /etc/passwd
Check the pod's logs for error messages:
$ kubectl logs crashloop-pod
Output
sh: nonexistent-command: not found
This error occurs when the container tries to run an invalid command. We can troubleshoot this issue by:
- Fixing it by setting a valid command or removing the invalid command from the YAML file.
- Deleting and recreating the pod after making the changes.
- Ensuring the pod's status is Running and checking the logs to verify the fix.
ImagePullBackOff / ErrImagePull
This error occurs when Kubernetes can't pull the container image, usually due to authentication issues or incorrect image names.
Describe the pod to get error details:
$ kubectl describe pod <pod-name> -n <namespace>
Verify the image name and tag in your deployment configuration. Ensure the image exists in the specified registry and that Kubernetes has access to it. Authentication issues with private registries are a common culprit.
Pending Pods
If a pod is stuck in the Pending state, it usually means it hasn't been scheduled to a node, possibly due to resource constraints or node selectors.
Describe the pod to see scheduling events:
$ kubectl describe pod <pod-name> -n <namespace>
Check for resource constraints or node selectors that might prevent scheduling. Insufficient cluster resources or misconfigured affinity rules can cause this issue.
Service and Networking Issues
Services expose applications running on pods. Networking problems can prevent communication between services and pods.
List the Services
Use the following command to list the services -
$ kubectl get svc -A
Output
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 50m kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 50m kube-system kubelet-csr-approver ClusterIP 10.107.94.7 <none> 8080/TCP 50m
It lists all the services across namespaces.
Describe a Service
Use the following command to describe a service -
$ kubectl describe svc kubelet-csr-approver -n kube-system
Output
Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 10.107.94.7 IPs: 10.107.94.7 Port: metrics 8080/TCP TargetPort: metrics/TCP Endpoints: 10.244.1.3:8080,10.244.1.6:8080 Session Affinity: None Internal Traffic Policy: Cluster Events: <none>
It provides details about the service's configuration and endpoints.
Test Service Connectivity
Use port forwarding to test service access:
$ kubectl port-forward svc/kubelet-csr-approver 8080:8080 -n kube-system
Output
Forwarding from 127.0.0.1:8080 -> 8080 Forwarding from [::1]:8080 -> 8080 Handling connection for 8080 Handling connection for 8080
Then, access the service at http://localhost:8080.
DNS Resolution
If a pod can't resolve service names, check the DNS configuration:
$ kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>
Ensure the CoreDNS pods are running and healthy.
Node-Level Troubleshooting
Sometimes, issues stem from the nodes themselves.
Check Node Status
Check the node status using the following command -
$ kubectl get nodes
Output
NAME STATUS ROLES AGE VERSION controlplane Ready control-plane 6m37s v1.31.6 node01 Ready <none> 6m23s v1.31.6
Nodes should be in the Ready state.
Describe a Node
If you need to get the description of a node, then use the following command -
$ kubectl describe node <node-name>
Look for conditions like MemoryPressure or DiskPressure.
Configuration and Secrets
Misconfigured ConfigMaps or Secrets can cause application failures.
List ConfigMaps
List ConfigMaps using the following command -
$ kubectl get configmaps -n dev
Output
NAME DATA AGE app-config 3 25m env-settings 2 25m
Describe a ConfigMap
Use the following command to describe a ConfigMap -
$ kubectl describe configmap app-config -n dev
Output
Name: app-config Namespace: dev Data ==== PORT: 80 DB_HOST: db-service DB_PORT: not-a-number
In this example, the DB_PORT value is incorrectly set to not-a-number, but it should be an integer (a valid port number). Here's how we can resolve it:
$ kubectl patch configmap app-config -n dev -p '{"data":{"DB_PORT":"5432"}}'
Output
configmap/app-config patched
It will patch the DB_PORT value to 5432 without opening an editor.
Events and Audit Trails
Events provide a timeline of significant occurrences in the cluster.
View Events
Use the following commands to view events -
$ kubectl get events -A --sort-by='.metadata.creationTimestamp'
Output
kube-system 15m Normal Pulled pod/metrics-server-54bf7cdd6-7khch Successfully pulled image "registry.k8s.io/metrics-server/metrics-server:v0.7.2" in 2.458s (2.458s including waiting). Image size: 19494617 bytes. kube-system 15m Normal Created pod/metrics-server-54bf7cdd6-7khch Created container: metrics-server kube-system 15m Normal Started pod/metrics-server-54bf7cdd6-7khch Started container metrics-server
It lists all the events across namespaces, sorted by time. Events can reveal issues like failed mounts, scheduling problems, or image pull errors.
Advanced Debugging
For more complex issues, additional tools and commands can help.
Execute Commands in a Pod
If you need to execute a command in a Pod, then use the following command -
$ kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
Example: Exec into a Pod
$ kubectl exec -it my-app-5f4d8c6f6c-rm87x -n dev -- /bin/bash
It opens a shell inside the pod. Use this shell to inspect configuration files, test DNS (nslookup), or run service-level commands (curl, ping, etc.).
Run a Debug Pod
Deploy a temporary pod with debugging tools:
$ kubectl run debug-pod --rm -i --tty --image=busybox -- /bin/sh
This is useful for testing network connectivity or DNS resolution.
Debug Stateful Applications
For StatefulSets or persistent volumes, check the Persistent Volume Claims (PVCs):
$ kubectl get pvc -n dev
Output
NAME STATUS VOLUME CAPACITY ACCESS MODES data-my-db-0 Pending
Describe the PVC
Use the following command to describe the PVC -
$ kubectl describe pvc data-my-db-0 -n dev
Output
Events: Type Reason Message ---- ------ ------- Warning ProvisioningFailed no persistent volumes available for this claim
For this issue, check if a compatible StorageClass and PV exist. Create one if missing or update the PVC to use an available class. Examine storage class or volume binding issues.
Inspect DaemonSets and Jobs
DaemonSets and Jobs often cause cluster-level issues if improperly configured.
Check DaemonSet Status
Use the following command to check DaemonSet status -
$ kubectl get ds -n kube-system
Output
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 39m
Check Job Status
To check the status of a Job, use the following command -
$ kubectl get jobs -n dev
Output
NAME COMPLETIONS DURATION AGE my-job 1/1 2s 10m failed-job 0/1 1m 15m
It shows how many of the required job completions have occurred. 1/1 means it's successfully completed; 0/1 indicates it hasn't completed or failed.
For failed Jobs, inspect pod logs or events:
$ kubectl logs <pod-name> $ kubectl describe job <job-name>
For DaemonSets:
- Ensure all nodes that should be running the DaemonSet have pods.
- If pods aren't created, inspect taints or node selectors.
Check RBAC Permissions
When an application or user can't access resources, it might be due to Role-Based Access Control (RBAC).
Verify the permissions:
$ kubectl auth can-i create deployments --as=dev-user
Output
no
It means that the dev-user does not have the necessary permissions to create deployments in the cluster.
To fix this, update the Role-Based Access Control (RBAC) configuration to give the dev-user the necessary permissions to create deployments.
Debugging a Container with a Sidecar
We can add a sidecar container to an existing pod to help with debugging, inspecting shared volumes, network, or file systems without interrupting the main application.
$ kubectl debug pod/my-app -n dev --image=busybox --target=main-container
Output
Pod "my-app-debug" created.
To use the sidecar container, run:
$ kubectl exec -it my-app-debug -n dev -- /bin/sh
This command creates a sidecar container using the busybox image and attaches it to the existing my-app pod in the dev namespace. It will run alongside the main-container (the primary container in the pod) and allow debugging without affecting the main container.
We can now use kubectl exec to get a shell into the new debugging container and start investigating issues. This allows interactive debugging without affecting the main container.
Leverage External Tools
Third-party tools can enhance your troubleshooting capabilities:
- k9s: A terminal UI for managing Kubernetes clusters.
- Lens: A Kubernetes dashboard with real-time metrics.
- Prometheus/Grafana: For advanced monitoring and alerting.
Conclusion
Troubleshooting Kubernetes involves a combination of inspecting resources, analyzing logs, and testing connectivity. By mastering these command-line techniques, we can efficiently diagnose and resolve issues in our clusters.
Remember, the key steps include:
- Inspecting pods, deployments, and services.
- Analyzing logs for error messages.
- Testing network connectivity between components.
- Monitoring resource usage to identify bottlenecks.
With these tools and strategies, we're well-equipped to maintain healthy and resilient Kubernetes environments.