Today in this article, we will see an interesting issue what i faced on one of the production server which got rebooted automatically during a SAN switch replacement activity.
One of our server was panic and rebooted automatically when all the LUN paths coming to the server was disconnected.
We found error message in the logs that a task was in hung state and its warning messages were generated multiple times.
Hence we have checked and found that hung task panic is enabled on server.
### hung_task_panic =1 in /etc/sysctl.conf, or /proc/sys/kernel/hung_task_panic contains the value 1 # sysctl -p | grep -i hung_task kernel.hung_task_panic = 1
Hung task panic was enabled on server and when we took down all the paths on server, it panic the storage devices which in turn panic the system.
The kernel.hung_task_panic should be disabled on a production server, until and unless required for special situations where a problem is being diagnosed.
One another way is to set a limit on multipath device queueing so that it does not wait indefinitely for I/O and panic the kernel.
Usually when all LUNs become unavailable, and no_path_retry is set to high value like 300 then the processes in uninterruptiple sleep state waiting for these LUNs block for long time and causes a panic.