In some rare cases it might happen that the CCM/CSI migration fails. This document provides a quick checklist that you can follow in order to debug the potential issue.
If you don’t manage to solve the problem by following this guide, you can create a new issue in the KubeOne GitHub repository. The issue should include details about the issue, including which migration phase failed, and logs for the failing component.
Check the status of your nodes:
kubectl get nodes
All nodes in the cluster should be Ready. You should have 3 control plane nodes, while the number of worker nodes depend on your configuration. In case there’s a node that’s NotReady, describe the node to check its status and events:
kubectl describe node NODE_NAME
Check the status of pods in the kube-system
namespace. All pods should be
Running and not restarting or crashlooping:
kubectl get pods -n kube-system
If there’s a pod that’s not running properly, describe the pod to check its events inspect and check the logs:
kubectl describe pod -n kube-system POD_NAME
kubectl logs -n kube-system POD_NAME
Note: you can get logs for previous runs of the pod by using the -p
flag,
for example: kubectl logs -p -n kube-system POD_NAME
a) In case there’s a control plane component that’s failing (such as
kube-apiserver or kube-controller-manager), you’ll need to restart the
container itself. In this case, you can’t use kubectl delete
to restart
the component because the control plane components are managed by static
manifests.
<component-name>-<node-name>
.sudo crictl ps
sudo crictl stop CONTAINER_ID
sudo crictl rm CONTAINER_ID
kubectl
b) In case some other component is running, you can try restarting it by deleting the pod:
kubectl delete pod -n kube-system POD_NAME
If the previous steps didn’t reveal the issue, SSH to the node and inspect the Kubelet logs. That can be done by using the following command:
sudo journalctl -fu kubelet.service
You can try restarting kubelet by running the following command:
sudo systemctl restart kubelet
If none of the previous steps help you resolve the issue, you can try restarting the affected instance. In some cases, restarting the instance can make the issue go away.
kubectl drain NODE_NAME
kubectl cordon NODE_NAME
sudo restart
kubectl uncrodon NODE_NAME