kernel: BUG: soft lockup - CPU#2 stuck for 10s! [exp2:5725]

dev.null

Well-Known Member
May 27, 2003
89
2
158
wow. Just hit my first major problem with this box that has been running fine for over a year.

CentOS 5.2, 64bit.

My box was completely locked - the only thing I could do is hit reset to get it restarted. I look in the logs and find this:

Code:
Sep  1 05:30:55 vhost3 kernel: BUG: soft lockup - CPU#2 stuck for 10s! [exp2:5725]
Sep  1 05:30:55 vhost3 kernel: CPU 2:
Sep  1 05:30:55 vhost3 kernel: Modules linked in: nfs lockd fscache nfs_acl sunrpc iptable_nat ip_nat deflate zlib_deflate ccm serpent blowfish twofish ecb xcbc crypto_hash cbc crypto_blkcipher md5 sha256 sh
Sep  1 05:30:55 vhost3 kernel: libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Sep  1 05:30:55 vhost3 kernel: Pid: 5725, comm: exp2 Tainted: G      2.6.18-92.el5 #1
Sep  1 05:30:55 vhost3 kernel: RIP: 0010:[<0000000000000001>]  [<0000000000000001>]
Sep  1 05:30:55 vhost3 kernel: RSP: 0018:ffff81006bcabb20  EFLAGS: 00000246
Sep  1 05:30:55 vhost3 kernel: RAX: 0000000000000000 RBX: ffff810063b3fcc0 RCX: 0000000000000001
Sep  1 05:30:55 vhost3 kernel: RDX: 00000000000004d0 RSI: ffffffff884687d0 RDI: ffff8100338ade80
Sep  1 05:30:55 vhost3 kernel: RBP: ffffffff80231b65 R08: 00000000d1b48344 R09: ffffffff80231b65
Sep  1 05:30:55 vhost3 kernel: R10: 0000000080000000 R11: 00000000000003f8 R12: ffffffff804c9590
Sep  1 05:30:55 vhost3 kernel: R13: ffff81006bcabb30 R14: 0000000000000003 R15: 00000000000004d0
Sep  1 05:30:55 vhost3 kernel: FS:  0000000045975940(0000) GS:ffff81011bc3ae40(0063) knlGS:00000000f7eed6c0
Sep  1 05:30:55 vhost3 kernel: CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
Sep  1 05:30:55 vhost3 kernel: CR2: 0000000000402ba0 CR3: 000000008a8a4000 CR4: 00000000000006e0
Sep  1 05:30:55 vhost3 kernel: 
Sep  1 05:30:55 vhost3 kernel: Call Trace:
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80232128>] ip_push_pending_frames+0x383/0x45e
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80242226>] udp_push_pending_frames+0x236/0x25b
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff8005203c>] udp_sendmsg+0x4d3/0x5ce
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff8011f4af>] socket_has_perm+0x5b/0x68
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80054924>] sock_sendmsg+0xf3/0x110
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff800b9da9>] delayacct_end+0x5d/0x86
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff8009dde2>] autoremove_wake_function+0x0/0x2e
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff8000769e>] find_get_page+0x21/0x50
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80013325>] filemap_nopage+0x188/0x322
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80008b39>] __handle_mm_fault+0x4e9/0xe23
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff8020d4de>] sys_sendto+0x11c/0x14f
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80066852>] do_page_fault+0x4fe/0x830
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80041922>] d_rehash+0x21/0x34
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff8020d10f>] sock_attach_fd+0x8f/0xfd
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80005d36>] level2_kernel_pgt+0xd36/0x1000
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff800b3fd8>] audit_syscall_entry+0x16e/0x1a1
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80221cb1>] compat_sys_socketcall+0xf1/0x172
Sep  1 05:30:55 vhost3 kernel:  [<ffffffff80061618>] cstar_do_call+0x1b/0x65
Sep  1 05:30:55 vhost3 kernel:
this happens 7 more times within a 2 min period, all the same process ID and CPU. After that there is no log until reboot.

Then it happened again today, different CPU (and of course different process ID). Call stack looks the same.

I try to do yum update and find 274 pkg need updating, but yum hangs on libpq.so.4 being needed (but it's there) so it won't update... That's on another thread.

Any ideas on the CPU block?

Thanks!
 

MattCurry

Well-Known Member
Aug 18, 2009
275
0
66
Houston, Tx
CPU#2 stuck for 10s! [exp2:5725]

Hello,

I do see the issue that you are running into, and I am sorry you have had issues. Unfortunately you would need to submit a ticket with you datacenter to take a look at this machine. I do not believe it will have anything to do with cPanel. If you find that it does or would like to submit a ticket with us there is a link at the bottom of this post.

Thank you,
Matthew Curry
 

dev.null

Well-Known Member
May 27, 2003
89
2
158
Hello,

I do see the issue that you are running into, and I am sorry you have had issues. Unfortunately you would need to submit a ticket with you datacenter to take a look at this machine. I do not believe it will have anything to do with cPanel. If you find that it does or would like to submit a ticket with us there is a link at the bottom of this post.

Thank you,
Matthew Curry
I don't think it's cpanel per-se. I'm just checking with you other server admins for advice on what to do.

I am the datacenter guy... ;-D

Been running linux servers for 10+ years, never saw this problem before. I'm hoping someone will tell me "it's not your hardware going bad, this is a software/driver problem". That's the big one for me.

I currently have a script in place that records all the processes and their IDs. Next lockup I'll know what the process is.

Thanks!
 

d_t

Well-Known Member
Sep 20, 2003
245
3
168
Bucharest
It may be a problem with RAID controller or storage system. Check if is any error message in controller's BIOS. (I had a similar problem several months ago and if was a bad Adaptec controller)
 

dev.null

Well-Known Member
May 27, 2003
89
2
158
It may be a problem with RAID controller or storage system. Check if is any error message in controller's BIOS. (I had a similar problem several months ago and if was a bad Adaptec controller)
No raid, couple sata's right off the mobo.

when you say "check if is any error message in controller's BIOS", do you mean in a log file or in BIOS on boot-up?

Thanks!
 

dev.null

Well-Known Member
May 27, 2003
89
2
158
what does sar show?

# sar

we want to see the I/O load most specifically

you might want to recompile kernel as well
sar doesn't show far back enough (last time it died was yesterday, sar started at midnight)

Next time it happens I'll be all over it like a cheap suite and let you know.

Should I set sar to dump to a file via cron? (IOW does sar get reset at boot/shutdown?)

Thanks!
 

hostmedic

Well-Known Member
Apr 30, 2003
543
0
166
Washington Court House, Ohio, United States
cPanel Access Level
DataCenter Provider
not sure - dont think so -

it might - not sure.

you could get it to log out - just to be safe.

Is this the only server? - might be good to setup so taht the logs go elsewhere just in case

Honestly i think its an issue w/ kernel not liking a drive controller - but I have been known to be wrong before.
 

kran

Well-Known Member
Jul 5, 2003
75
0
156
Colombia
cPanel Access Level
Root Administrator
I'm having a similar problem

I've have look every posible cause I can think off ... I belive it might be the firewall because it ran for many hours, 1 After reinstalling the firewall, started having the same, it seems it runs out of swap space and it locks this is what I get:

Oct 3 01:20:03 tiburon kernel: Firewall: *ICMP_IN Blocked* IN=eth0 OUT= MAC=00:e0:81:34:cd:1d:00:d0:03:9c:68:0a:08:00 SRC=190.84.24
7.230 DST=66.197.xxx.xxx LEN=60 TOS=0x00 PREC=0x00 TTL=112 ID=23554 PROTO=ICMP TYPE=8 CODE=0 ID=512 SEQ=56270
Oct 3 01:20:04 tiburon kernel: BUG: soft lockup - CPU#1 stuck for 10s! [kswapd0:185]
Oct 3 01:20:04 tiburon kernel:
Oct 3 01:20:04 tiburon kernel: Pid: 185, comm: kswapd0
Oct 3 01:20:04 tiburon kernel: EIP: 0060:[<c049e068>] CPU: 1
Oct 3 01:20:04 tiburon kernel: EIP is at dqput+0xda/0x15d
Oct 3 01:20:04 tiburon kernel: EFLAGS: 00000202 Not tainted (2.6.18-164.el5 #1)
Oct 3 01:20:04 tiburon kernel: EAX: 00000000 EBX: ea75dd80 ECX: f75b6400 EDX: 00000002
Oct 3 01:20:04 tiburon kernel: ESI: 00000000 EDI: ffffffe2 EBP: f7f6af10 DS: 007b ES: 007b
Oct 3 01:20:04 tiburon kernel: CR0: 8005003b CR2: 45d0e290 CR3: 0073b000 CR4: 000006d0
Oct 3 01:20:04 tiburon kernel: [<c049e5d7>] dquot_drop+0x26/0x4c
Oct 3 01:20:04 tiburon kernel: [<f8890b2e>] ext3_dquot_drop+0x3b/0x5d [ext3]
Oct 3 01:20:04 tiburon kernel: [<c048aad3>] clear_inode+0x9f/0x104
Oct 3 01:20:04 tiburon kernel: [<c048ad9a>] dispose_list+0x33/0xb1
Oct 3 01:20:04 tiburon kernel: [<c048af94>] shrink_icache_memory+0x17c/0x1a4
Oct 3 01:20:04 tiburon kernel: [<c045f2f2>] shrink_slab+0xd3/0x13c
Oct 3 01:20:04 tiburon kernel: [<c045f67d>] kswapd+0x2a6/0x3ab
Oct 3 01:20:04 tiburon kernel: [<c0434907>] autoremove_wake_function+0x0/0x2d
Oct 3 01:20:04 tiburon kernel: [<c045f3d7>] kswapd+0x0/0x3ab
Oct 3 01:20:04 tiburon kernel: [<c0434845>] kthread+0xc0/0xeb
Oct 3 01:20:04 tiburon kernel: [<c0434785>] kthread+0x0/0xeb
Oct 3 01:20:04 tiburon kernel: [<c0405c53>] kernel_thread_helper+0x7/0x10
Oct 3 01:20:04 tiburon kernel: =======================