Solved: kernel:NMI watchdog: BUG: soft lockup – CPU#20 stuck for 22s! [: ?

Today we will see an interesting article where the server was behaving slow and is generating the error messages like below.

 

I. Error Details

Message from syslogd@ngelinux001 at May 6 17:15:16 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [fvDCCUpload:161205]

Message from syslogd@ngelinux001 at May 6 17:15:44 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#20 stuck for 23s! [fvDCCUpload:161205]

Message from syslogd@ngelinux001 at May 6 17:16:16 ...
kernel:NMI watchdog: BUG: soft lockup - CPU#20 stuck for 22s! [fvDCCUpload:161205]

May 6 17:16:16 ngelinux001 kernel: crc32c_intel mlxfw(OE) devlink libahci drm ixgbe vfio_mdev(OE) igb vfio_iommu_type1 libata vfio mdev(OE) mlx_compat(OE) i2c_algo_bit megaraid_sas mdio ptp i2c_core pps_core dca dm_mirror dm_region_hash dm_log dm_mod
May 6 17:16:16 ngelinux001 kernel: CPU: 20 PID: 161205 Comm: fvDCCUpload Tainted: G OEL ------------ 3.10.0-693.el7.x86_64 #1
May 6 17:16:16 ngelinux001 kernel: Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.4.3 01/17/2017
May 6 17:16:16 ngelinux001 kernel: task: ffff887e7862dee0 ti: ffff8819b6ac0000 task.ti: ffff8819b6ac0000
May 6 17:16:16 ngelinux001 kernel: RIP: 0010:[] [] _raw_spin_unlock_irqrestore+0x15/0x20
May 6 17:16:16 ngelinux001 kernel: RSP: 0018:ffff8819b6ac3778 EFLAGS: 00000282
May 6 17:16:16 ngelinux001 kernel: RAX: 0000000000000282 RBX: 0000000100010401 RCX: 0000000000001000
May 6 17:16:16 ngelinux001 kernel: RDX: 0000000000000800 RSI: 0000000000000282 RDI: 0000000000000282
May 6 17:16:16 ngelinux001 kernel: RBP: ffff8819b6ac3778 R08: ffff883531ea1b60 R09: ffff88712f95e880
May 6 17:16:16 ngelinux001 kernel: R10: 000000be5b721000 R11: ffffea00d4c7a840 R12: ffff883531ea1b60
May 6 17:16:16 ngelinux001 kernel: R13: ffff8819b6ac3780 R14: ffff8819b6ac3710 R15: ffffffff811de581
May 6 17:16:16 ngelinux001 kernel: FS: 00002b0561c0d700(0000) GS:ffff887e7e880000(0000) knlGS:0000000000000000
May 6 17:16:16 ngelinux001 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 6 17:16:16 ngelinux001 kernel: CR2: 00000000006757e8 CR3: 000000fe7c69b000 CR4: 00000000003407e0
May 6 17:16:16 ngelinux001 kernel: DR0: 00002b3a1aff5ea8 DR1: 00002b3a734ce2cc DR2: 00002aaaad4b8870
May 6 17:16:16 ngelinux001 kernel: DR3: 00002aaaad4b8878 DR6: 00000000fffe0ff0 DR7: 0000000000000600
May 6 17:16:16 ngelinux001 kernel: Stack:
May 6 17:16:16 ngelinux001 kernel: ffff8819b6ac37a8 ffffffff811c96f7 ffff883531ea1b40 ffff883531ea1b20
May 6 17:16:16 ngelinux001 kernel: ffff88fe76180060 ffff881a17b463c0 ffff8819b6ac37d8 ffffffffc022cb32
May 6 17:16:16 ngelinux001 kernel: ffff881a17b463c0 ffff88fe76186570 ffff88fe76180060 00000000de903368
May 6 17:16:16 ngelinux001 kernel: Call Trace:
May 6 17:16:16 ngelinux001 kernel: [] dma_pool_free+0xa7/0xd0
May 6 17:16:16 ngelinux001 kernel: [] mlx5_free_cmd_msg+0x42/0x60 [mlx5_core]
May 6 17:16:16 ngelinux001 kernel: [] free_msg+0x55/0x60 [mlx5_core]
May 6 17:16:16 ngelinux001 kernel: [] cmd_exec+0x41e/0x9a0 [mlx5_core]
May 6 17:16:16 ngelinux001 kernel: [] ? __get_free_pages+0xe/0x40
May 6 17:16:16 ngelinux001 kernel: [] mlx5_cmd_exec+0x33/0x50 [mlx5_core]
May 6 17:16:16 ngelinux001 kernel: [] mlx5_core_create_mkey_cb+0x109/0x230 [mlx5_core]
May 6 17:16:16 ngelinux001 kernel: [] mlx5_core_create_mkey+0x36/0x40 [mlx5_core]
May 6 17:16:16 ngelinux001 kernel: [] reg_create+0x28b/0x4d0 [mlx5_ib]
May 6 17:16:16 ngelinux001 kernel: [] mlx5_ib_reg_user_mr+0x31d/0x9b0 [mlx5_ib]
May 6 17:16:16 ngelinux001 kernel: [] ib_uverbs_reg_mr+0x1de/0x300 [ib_uverbs]
May 6 17:16:16 ngelinux001 kernel: [] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xe5/0x120 [ib_uverbs]
May 6 17:16:16 ngelinux001 kernel: [] ib_uverbs_cmd_verbs.isra.6+0xac3/0xca0 [ib_uverbs]
May 6 17:16:16 ngelinux001 kernel: [] ? ib_uverbs_handler_UVERBS_METHOD_GET_CONTEXT+0xd0/0xd0 [ib_uverbs]
May 6 17:16:16 ngelinux001 kernel: [] ? physflat_send_IPI_mask+0xe/0x10
May 6 17:16:16 ngelinux001 kernel: [] ? tlb_flush_mmu.part.63+0x6c/0xc0
May 6 17:16:16 ngelinux001 kernel: [] ? tlb_finish_mmu+0x55/0x60
May 6 17:16:16 ngelinux001 kernel: [] ? unmap_region+0xf4/0x140
May 6 17:16:16 ngelinux001 kernel: [] ib_uverbs_ioctl+0xc1/0x1a0 [ib_uverbs]
May 6 17:16:16 ngelinux001 kernel: [] ? blk_finish_plug+0x14/0x40
May 6 17:16:16 ngelinux001 kernel: [] do_vfs_ioctl+0x33d/0x540
May 6 17:16:16 ngelinux001 kernel: [] SyS_ioctl+0xa1/0xc0
May 6 17:16:16 ngelinux001 kernel: [] system_call_fastpath+0x16/0x1b

 

II. Root Cause of the Issue

If you analyze the trace then its clear that below driver is conflicting with the current kernel version.

The error is generated when mlx5_core tried to allocate free memory pages on the server.

Driver Version
filename:       /lib/modules/3.10.0-693.el7.x86_64/extra/mlnx-ofa_kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
version:        5.0-2.1.8
license:        Dual BSD/GPL
description:    Mellanox 5th generation network adapters (ConnectX series) core driver
author:         Eli Cohen <eli@mellanox.com>
rhelversion:    7.4
srcversion:     B6BBE780C74E0654351333D
alias:          pci:v000015B3d0000A2D6sv

Kernel Version
Linux ngelinux001 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

 

III. Solution

To resolve this, we should upgrade both the driver and kernel version to the latest available which requires server downtime.

You need to advise team to get the downtime and upgrade these to resolve this bug.

And in case it exists even after upgrade, then share the crash dump with redhat, and Mellanox company to know why this driver is blocking the memory reservation.
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments