โถ๏ธ Kubernetes crash recovery commands I used 99% of the time:
1. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฝ๐ผ๐ฑ๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐: Check the status of all pods across namespaces to identify failures.
2. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐๐ฐ๐ฟ๐ถ๐ฏ๐ฒ ๐ฝ๐ผ๐ฑ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ: Gather detailed information about a failed pod.
3. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐น๐ผ๐ด๐ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ -๐ฐ ๐ฐ๐ผ๐ป๐๐ฎ๐ถ๐ป๐ฒ๐ฟ_๐ป๐ฎ๐บ๐ฒ: View logs of a specific container inside a pod to troubleshoot issues.
4. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฒ๐๐ฒ๐ป๐๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐ --๐๐ผ๐ฟ๐-๐ฏ๐='.๐บ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ.๐ฐ๐ฟ๐ฒ๐ฎ๐๐ถ๐ผ๐ป๐ง๐ถ๐บ๐ฒ๐๐๐ฎ๐บ๐ฝ': Review recent events for clues on crashes and errors.
5. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ป๐ผ๐ฑ๐ฒ๐: Verify the status of nodes in the cluster, checking for node failures.
โ6. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฟ๐ฎ๐ถ๐ป ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ --๐ถ๐ด๐ป๐ผ๐ฟ๐ฒ-๐ฑ๐ฎ๐ฒ๐บ๐ผ๐ป๐๐ฒ๐๐: Safely evacuate and cordon a node for recovery operations.
7. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฐ๐ผ๐ฟ๐ฑ๐ผ๐ป ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Mark a node as unschedulable to prevent new pods from being scheduled during recovery.
8. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐น๐ฒ๐๐ฒ ๐ฝ๐ผ๐ฑ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ --๐ด๐ฟ๐ฎ๐ฐ๐ฒ-๐ฝ๐ฒ๐ฟ๐ถ๐ผ๐ฑ=0 --๐ณ๐ผ๐ฟ๐ฐ๐ฒ: Forcefully delete a crashed pod to restart it or clear it for recovery.
9. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฟ๐ผ๐น๐น๐ผ๐๐ ๐๐ป๐ฑ๐ผ ๐ฑ๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ ๐ฑ๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐_๐ป๐ฎ๐บ๐ฒ: Roll back a deployment in case a new rollout causes crashes.
10. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฒ๐ ๐ฒ๐ฐ -๐ถ๐ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ -- /๐ฏ๐ถ๐ป/๐๐ต: Access a container to debug and resolve application issues directly inside the pod.
11. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฐ๐ผ๐บ๐ฝ๐ผ๐ป๐ฒ๐ป๐๐๐๐ฎ๐๐๐๐ฒ๐: Check the health of core cluster components like etcd, kube-apiserver, and more.
โ12. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ผ๐ฝ ๐ป๐ผ๐ฑ๐ฒ๐: Monitor node resource usage to detect resource exhaustion causing crashes.
13. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ผ๐ฝ ๐ฝ๐ผ๐ฑ๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐: Check pod resource usage across namespaces, identifying bottlenecks leading to crashes.
14. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐น๐ฒ๐๐ฒ ๐ป๐ผ๐ฑ๐ฒ ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Remove a failed node from the cluster to allow recovery operations.
15. ๐ฒ๐๐ฐ๐ฑ๐ฐ๐๐น --๐ฒ๐ป๐ฑ๐ฝ๐ผ๐ถ๐ป๐๐=๐ต๐๐๐ฝ๐://๐ฒ๐๐ฐ๐ฑ-๐๐ฒ๐ฟ๐๐ฒ๐ฟ:2379 ๐๐ป๐ฎ๐ฝ๐๐ต๐ผ๐ ๐ฟ๐ฒ๐๐๐ผ๐ฟ๐ฒ ๐ฏ๐ฎ๐ฐ๐ธ๐๐ฝ.๐ฑ๐ฏ: Restore etcd from a snapshot in case of etcd failure..
โ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฎ๐ฝ๐ฝ๐น๐ -๐ณ ๐ฏ๐ฎ๐ฐ๐ธ๐๐ฝ.๐๐ฎ๐บ๐น: Reapply configurations from a backup manifest during recovery.
17. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ฎ๐ถ๐ป๐ ๐ป๐ผ๐ฑ๐ฒ๐ ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ ๐ธ๐ฒ๐=๐๐ฎ๐น๐๐ฒ:๐ก๐ผ๐ฆ๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ: Prevent scheduling on a node experiencing issues during recovery.
18. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฒ๐ป๐ฑ๐ฝ๐ผ๐ถ๐ป๐๐ ๐๐ฒ๐ฟ๐๐ถ๐ฐ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Verify service endpoints during recovery to ensure services are resolving correctly.
๐ฑ ๐๐ผ๐น๐น๐ผ๐ @prodevopsguy ๐๐จ๐ซ ๐ฆ๐จ๐ซ๐ ๐ฌ๐ฎ๐๐ก ๐๐จ๐ง๐ญ๐๐ง๐ญ ๐๐ซ๐จ๐ฎ๐ง๐ ๐๐ฅ๐จ๐ฎ๐ & ๐๐๐ฏ๐๐ฉ๐ฌ!!! // ๐๐จ๐ข๐ง ๐๐จ๐ซ ๐๐๐ฏ๐๐ฉ๐ฌ ๐๐๐๐ฌ: @devopsdocs
1. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฝ๐ผ๐ฑ๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐: Check the status of all pods across namespaces to identify failures.
2. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐๐ฐ๐ฟ๐ถ๐ฏ๐ฒ ๐ฝ๐ผ๐ฑ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ: Gather detailed information about a failed pod.
3. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐น๐ผ๐ด๐ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ -๐ฐ ๐ฐ๐ผ๐ป๐๐ฎ๐ถ๐ป๐ฒ๐ฟ_๐ป๐ฎ๐บ๐ฒ: View logs of a specific container inside a pod to troubleshoot issues.
4. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฒ๐๐ฒ๐ป๐๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐ --๐๐ผ๐ฟ๐-๐ฏ๐='.๐บ๐ฒ๐๐ฎ๐ฑ๐ฎ๐๐ฎ.๐ฐ๐ฟ๐ฒ๐ฎ๐๐ถ๐ผ๐ป๐ง๐ถ๐บ๐ฒ๐๐๐ฎ๐บ๐ฝ': Review recent events for clues on crashes and errors.
5. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ป๐ผ๐ฑ๐ฒ๐: Verify the status of nodes in the cluster, checking for node failures.
โ6. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฟ๐ฎ๐ถ๐ป ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ --๐ถ๐ด๐ป๐ผ๐ฟ๐ฒ-๐ฑ๐ฎ๐ฒ๐บ๐ผ๐ป๐๐ฒ๐๐: Safely evacuate and cordon a node for recovery operations.
7. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฐ๐ผ๐ฟ๐ฑ๐ผ๐ป ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Mark a node as unschedulable to prevent new pods from being scheduled during recovery.
8. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐น๐ฒ๐๐ฒ ๐ฝ๐ผ๐ฑ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ --๐ด๐ฟ๐ฎ๐ฐ๐ฒ-๐ฝ๐ฒ๐ฟ๐ถ๐ผ๐ฑ=0 --๐ณ๐ผ๐ฟ๐ฐ๐ฒ: Forcefully delete a crashed pod to restart it or clear it for recovery.
9. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฟ๐ผ๐น๐น๐ผ๐๐ ๐๐ป๐ฑ๐ผ ๐ฑ๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐ ๐ฑ๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐_๐ป๐ฎ๐บ๐ฒ: Roll back a deployment in case a new rollout causes crashes.
10. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฒ๐ ๐ฒ๐ฐ -๐ถ๐ ๐ฝ๐ผ๐ฑ_๐ป๐ฎ๐บ๐ฒ -- /๐ฏ๐ถ๐ป/๐๐ต: Access a container to debug and resolve application issues directly inside the pod.
11. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฐ๐ผ๐บ๐ฝ๐ผ๐ป๐ฒ๐ป๐๐๐๐ฎ๐๐๐๐ฒ๐: Check the health of core cluster components like etcd, kube-apiserver, and more.
โ12. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ผ๐ฝ ๐ป๐ผ๐ฑ๐ฒ๐: Monitor node resource usage to detect resource exhaustion causing crashes.
13. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ผ๐ฝ ๐ฝ๐ผ๐ฑ๐ --๐ฎ๐น๐น-๐ป๐ฎ๐บ๐ฒ๐๐ฝ๐ฎ๐ฐ๐ฒ๐: Check pod resource usage across namespaces, identifying bottlenecks leading to crashes.
14. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฑ๐ฒ๐น๐ฒ๐๐ฒ ๐ป๐ผ๐ฑ๐ฒ ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Remove a failed node from the cluster to allow recovery operations.
15. ๐ฒ๐๐ฐ๐ฑ๐ฐ๐๐น --๐ฒ๐ป๐ฑ๐ฝ๐ผ๐ถ๐ป๐๐=๐ต๐๐๐ฝ๐://๐ฒ๐๐ฐ๐ฑ-๐๐ฒ๐ฟ๐๐ฒ๐ฟ:2379 ๐๐ป๐ฎ๐ฝ๐๐ต๐ผ๐ ๐ฟ๐ฒ๐๐๐ผ๐ฟ๐ฒ ๐ฏ๐ฎ๐ฐ๐ธ๐๐ฝ.๐ฑ๐ฏ: Restore etcd from a snapshot in case of etcd failure..
โ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ฎ๐ฝ๐ฝ๐น๐ -๐ณ ๐ฏ๐ฎ๐ฐ๐ธ๐๐ฝ.๐๐ฎ๐บ๐น: Reapply configurations from a backup manifest during recovery.
17. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐๐ฎ๐ถ๐ป๐ ๐ป๐ผ๐ฑ๐ฒ๐ ๐ป๐ผ๐ฑ๐ฒ_๐ป๐ฎ๐บ๐ฒ ๐ธ๐ฒ๐=๐๐ฎ๐น๐๐ฒ:๐ก๐ผ๐ฆ๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ: Prevent scheduling on a node experiencing issues during recovery.
18. ๐ธ๐๐ฏ๐ฒ๐ฐ๐๐น ๐ด๐ฒ๐ ๐ฒ๐ป๐ฑ๐ฝ๐ผ๐ถ๐ป๐๐ ๐๐ฒ๐ฟ๐๐ถ๐ฐ๐ฒ_๐ป๐ฎ๐บ๐ฒ: Verify service endpoints during recovery to ensure services are resolving correctly.
๐ฑ ๐๐ผ๐น๐น๐ผ๐ @prodevopsguy ๐๐จ๐ซ ๐ฆ๐จ๐ซ๐ ๐ฌ๐ฎ๐๐ก ๐๐จ๐ง๐ญ๐๐ง๐ญ ๐๐ซ๐จ๐ฎ๐ง๐ ๐๐ฅ๐จ๐ฎ๐ & ๐๐๐ฏ๐๐ฉ๐ฌ!!! // ๐๐จ๐ข๐ง ๐๐จ๐ซ ๐๐๐ฏ๐๐ฉ๐ฌ ๐๐๐๐ฌ: @devopsdocs