Solved: kernel:NMI watchdog: BUG: soft lockup – CPU#20 stuck for 22s! [: ?
Today we will see an interesting article where the server was behaving slow and is generating the error messages like below.
I. Error Details
Message from syslogd@ngelinux001 at May 6 17:15:16 ... kernel:NMI watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [fvDCCUpload:161205] Message from syslogd@ngelinux001 at May 6 17:15:44 ... kernel:NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [fvDCCUpload:161205] Message from syslogd@ngelinux001 at May 6 17:16:16 ... kernel:NMI watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [fvDCCUpload:161205] May 6 17:16:16 ngelinux001 kernel: crc32c_intel mlxfw(OE) devlink libahci drm ixgbe vfio_mdev(OE) igb vfio_iommu_type1 libata vfio mdev(OE) mlx_compat(OE) i2c_algo_bit megaraid_sas mdio ptp i2c_core pps_core dca dm_mirror dm_region_hash dm_log dm_mod May 6 17:16:16 ngelinux001 kernel: CPU: 20 PID: 161205 Comm: fvDCCUpload Tainted: G OEL ------------ 3.10.0-693.el7.x86_64 #1 May 6 17:16:16 ngelinux001 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.4.3 01/17/2017 May 6 17:16:16 ngelinux001 kernel: task: ffff887e7862dee0 ti: ffff8819b6ac0000 task.ti: ffff8819b6ac0000 May 6 17:16:16 ngelinux001 kernel: RIP: 0010:[] [] _raw_spin_unlock_irqrestore+0x15/0x20 May 6 17:16:16 ngelinux001 kernel: RSP: 0018:ffff8819b6ac3778 EFLAGS: 00000282 May 6 17:16:16 ngelinux001 kernel: RAX: 0000000000000282 RBX: 0000000100010401 RCX: 0000000000001000 May 6 17:16:16 ngelinux001 kernel: RDX: 0000000000000800 RSI: 0000000000000282 RDI: 0000000000000282 May 6 17:16:16 ngelinux001 kernel: RBP: ffff8819b6ac3778 R08: ffff883531ea1b60 R09: ffff88712f95e880 May 6 17:16:16 ngelinux001 kernel: R10: 000000be5b721000 R11: ffffea00d4c7a840 R12: ffff883531ea1b60 May 6 17:16:16 ngelinux001 kernel: R13: ffff8819b6ac3780 R14: ffff8819b6ac3710 R15: ffffffff811de581 May 6 17:16:16 ngelinux001 kernel: FS: 00002b0561c0d700(0000) GS:ffff887e7e880000(0000) knlGS:0000000000000000 May 6 17:16:16 ngelinux001 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 6 17:16:16 ngelinux001 kernel: CR2: 00000000006757e8 CR3: 000000fe7c69b000 CR4: 00000000003407e0 May 6 17:16:16 ngelinux001 kernel: DR0: 00002b3a1aff5ea8 DR1: 00002b3a734ce2cc DR2: 00002aaaad4b8870 May 6 17:16:16 ngelinux001 kernel: DR3: 00002aaaad4b8878 DR6: 00000000fffe0ff0 DR7: 0000000000000600 May 6 17:16:16 ngelinux001 kernel: Stack: May 6 17:16:16 ngelinux001 kernel: ffff8819b6ac37a8 ffffffff811c96f7 ffff883531ea1b40 ffff883531ea1b20 May 6 17:16:16 ngelinux001 kernel: ffff88fe76180060 ffff881a17b463c0 ffff8819b6ac37d8 ffffffffc022cb32 May 6 17:16:16 ngelinux001 kernel: ffff881a17b463c0 ffff88fe76186570 ffff88fe76180060 00000000de903368 May 6 17:16:16 ngelinux001 kernel: Call Trace: May 6 17:16:16 ngelinux001 kernel: [] dma_pool_free+0xa7/0xd0 May 6 17:16:16 ngelinux001 kernel: [] mlx5_free_cmd_msg+0x42/0x60 [mlx5_core] May 6 17:16:16 ngelinux001 kernel: [] free_msg+0x55/0x60 [mlx5_core] May 6 17:16:16 ngelinux001 kernel: [] cmd_exec+0x41e/0x9a0 [mlx5_core] May 6 17:16:16 ngelinux001 kernel: [] ? __get_free_pages+0xe/0x40 May 6 17:16:16 ngelinux001 kernel: [] mlx5_cmd_exec+0x33/0x50 [mlx5_core] May 6 17:16:16 ngelinux001 kernel: [] mlx5_core_create_mkey_cb+0x109/0x230 [mlx5_core] May 6 17:16:16 ngelinux001 kernel: [] mlx5_core_create_mkey+0x36/0x40 [mlx5_core] May 6 17:16:16 ngelinux001 kernel: [] reg_create+0x28b/0x4d0 [mlx5_ib] May 6 17:16:16 ngelinux001 kernel: [] mlx5_ib_reg_user_mr+0x31d/0x9b0 [mlx5_ib] May 6 17:16:16 ngelinux001 kernel: [] ib_uverbs_reg_mr+0x1de/0x300 [ib_uverbs] May 6 17:16:16 ngelinux001 kernel: [] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xe5/0x120 [ib_uverbs] May 6 17:16:16 ngelinux001 kernel: [] ib_uverbs_cmd_verbs.isra.6+0xac3/0xca0 [ib_uverbs] May 6 17:16:16 ngelinux001 kernel: [] ? ib_uverbs_handler_UVERBS_METHOD_GET_CONTEXT+0xd0/0xd0 [ib_uverbs] May 6 17:16:16 ngelinux001 kernel: [] ? physflat_send_IPI_mask+0xe/0x10 May 6 17:16:16 ngelinux001 kernel: [] ? tlb_flush_mmu.part.63+0x6c/0xc0 May 6 17:16:16 ngelinux001 kernel: [] ? tlb_finish_mmu+0x55/0x60 May 6 17:16:16 ngelinux001 kernel: [] ? unmap_region+0xf4/0x140 May 6 17:16:16 ngelinux001 kernel: [] ib_uverbs_ioctl+0xc1/0x1a0 [ib_uverbs] May 6 17:16:16 ngelinux001 kernel: [] ? blk_finish_plug+0x14/0x40 May 6 17:16:16 ngelinux001 kernel: [] do_vfs_ioctl+0x33d/0x540 May 6 17:16:16 ngelinux001 kernel: [] SyS_ioctl+0xa1/0xc0 May 6 17:16:16 ngelinux001 kernel: [] system_call_fastpath+0x16/0x1b
II. Root Cause of the Issue
If you analyze the trace then its clear that below driver is conflicting with the current kernel version. The error is generated when mlx5_core tried to allocate free memory pages on the server. Driver Version filename: /lib/modules/3.10.0-693.el7.x86_64/extra/mlnx-ofa_kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko version: 5.0-2.1.8 license: Dual BSD/GPL description: Mellanox 5th generation network adapters (ConnectX series) core driver author: Eli Cohen <eli@mellanox.com> rhelversion: 7.4 srcversion: B6BBE780C74E0654351333D alias: pci:v000015B3d0000A2D6sv Kernel Version Linux ngelinux001 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
III. Solution
To resolve this, we should upgrade both the driver and kernel version to the latest available which requires server downtime.
You need to advise team to get the downtime and upgrade these to resolve this bug.
And in case it exists even after upgrade, then share the crash dump with redhat, and Mellanox company to know why this driver is blocking the memory reservation.