The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
1. Issue or feature description
On Centos 7.9, GPU operator installed successfully and all pods became ready but after rebooting of node, pods went into crashloopbackoff state.
gpu-operator-resources gpu-feature-discovery-qcdp4 0/1 Init:CrashLoopBackOff 10 44m
gpu-operator-resources nvidia-container-toolkit-daemonset-rkg4b 0/1 Init:CrashLoopBackOff 10 44m
gpu-operator-resources nvidia-cuda-validator-ssgbh 0/1 Completed 0 42m
gpu-operator-resources nvidia-dcgm-exporter-kj45b 0/1 Init:CrashLoopBackOff 10 44m
gpu-operator-resources nvidia-device-plugin-daemonset-zdc4w 0/1 Init:CrashLoopBackOff 11 44m
gpu-operator-resources nvidia-device-plugin-validator-qbhtk 0/1 Completed 0 42m
gpu-operator-resources nvidia-driver-daemonset-svsmn 0/1 CrashLoopBackOff 10 44m
gpu-operator-resources nvidia-operator-validator-j9m2z 0/1 Init:CrashLoopBackOff 11 44m
2. Steps to reproduce the issue
- Install Kubernetes
- Install GPU operator and make sure all pods are running. Test sample pod with GPU to see if everything is working.
- Reboot node
- Pod go into crashloopbackoff
3. Information to attach (optional if deemed irrelevant)
Warning BackOff 3m16s (x143 over 34m) kubelet Back-off restarting failed container
Reason: ContainerCannotRun
**Message: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown**
docker info| grep -i runtime
Runtimes: nvidia runc
Default Runtime: nvidia
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_coreandipmi_msghandlerloaded on the nodes?kubectl describe clusterpolicies --all-namespaces) ==> Yes, its deployed via GPU operator1. Issue or feature description
On Centos 7.9, GPU operator installed successfully and all pods became ready but after rebooting of node, pods went into crashloopbackoff state.
gpu-operator-resources gpu-feature-discovery-qcdp4 0/1 Init:CrashLoopBackOff 10 44m
gpu-operator-resources nvidia-container-toolkit-daemonset-rkg4b 0/1 Init:CrashLoopBackOff 10 44m
gpu-operator-resources nvidia-cuda-validator-ssgbh 0/1 Completed 0 42m
gpu-operator-resources nvidia-dcgm-exporter-kj45b 0/1 Init:CrashLoopBackOff 10 44m
gpu-operator-resources nvidia-device-plugin-daemonset-zdc4w 0/1 Init:CrashLoopBackOff 11 44m
gpu-operator-resources nvidia-device-plugin-validator-qbhtk 0/1 Completed 0 42m
gpu-operator-resources nvidia-driver-daemonset-svsmn 0/1 CrashLoopBackOff 10 44m
gpu-operator-resources nvidia-operator-validator-j9m2z 0/1 Init:CrashLoopBackOff 11 44m
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
kubectl get pods --all-namespaceskubectl get ds --all-namespaceskubectl describe pod -n NAMESPACE POD_NAMEkubectl describe pod -n gpu-operator-resources nvidia-driver-daemonset-svsmn
Warning BackOff 3m16s (x143 over 34m) kubelet Back-off restarting failed container
[] If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAMEOutput of running a container on the GPU machine:
docker run -it alpine echo foo[ ]
$ docker run -it alpine echo foo
foo
Docker configuration file:
cat /etc/docker/daemon.jsonDocker runtime configuration:
docker info | grep runtimedocker info| grep -i runtime
Runtimes: nvidia runc
Default Runtime: nvidia
ls -la /run/nvidials -la /usr/local/nvidia/toolkitls -la /run/nvidia/driverjournalctl -u kubelet > kubelet.logs