Reoccuring server lock ups. Any advice?

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
We have one machine that is constantly locking up due to high load. The server isn't actually very loaded, has no cron jobs running at the times of the crashes, which are every 1-2 days. The system will lock up within the space of a minute due to high load >100. Our logging software is not able to get a printout of the running processes in time. We are modifying it and trying but still we're having issues.

Any advice on what this may be as it has me and other admins stumped. Is this a known issue for anyone?

Thanks

P.s. there isn't one/two script using 100% cpu / mem etc. infact, CPU usage is nearly zero and mem usage is fine..
 

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
To add on - we have checked the hard disks - they seem to be fine, no messages at all in /var/log/* i.e. running fine then the next messages are the restart ones..
 

jester.ro

Well-Known Member
PartnerNOC
Feb 6, 2004
304
0
166
Bucharest, Romania
cPanel Access Level
DataCenter Provider
locking means you have to restart it from the button?
and you have no kernel logs exactly at the time of the crash?

i had the same problem with a server, it was a dual [email protected], serverworks motheboard, 3ware raid ide controller. (no cpanel)

i had problems with some of the kernels, not all of them worked.
what i did was compile the kernel with suport for serial console logging, and i had the kernel logs on another server.
there was a big kernel panic happening, and of course, the logs were not recorded to disk.
anyway, the problem was fixed by changing kernel versions.
it seems its an undocumented bug in serverworks chipsets.
 

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
We actually run the same setup on about half a dozen other machines without a problem... this only started happening in the last week or two.. with two different kernels...

It isn't panicing (afaik) just the load goes sky high very fast.. with alot of processes launched in that minute but the cpu doesn't seem to be doing much work..
 

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
Out of interest - did you have the same problem with the load rocketing, or a different experience?
 

dgbaker

Well-Known Member
PartnerNOC
Sep 20, 2002
2,531
10
343
Toronto, Ontario Canada
cPanel Access Level
DataCenter Provider
Does it happen at around the same time all the time?

If so, look for cronjobs, keep top running on ssh session sorted by CPU, look at your mail queue, have seen exim/mailscanner tear up an entire box before. Check for last logins to see who is/was on the box other than yourself, etc...

The top command would be the better one to have up as when/if the server locks you should be able to see the highest processes.
 

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
Hi,

That's what I've been doing, but there is almost zero cpu usage.. I.e. it's not looking like one specific user/script is running eating up all mem/cpu.

Further, no one has shell access on this system - just me and the other admins, and I am the only person logged in.

Here's one top snapshot:

21:51:53 up 2 days, 8 min, 1 user, load average: 25.24, 6.80, 2.78
364 processes: 362 sleeping, 1 running, 1 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 7.2% 0.0% 6.4% 0.0% 0.0% 0.0% 386.0%
cpu00 3.5% 0.0% 2.3% 0.0% 0.0% 0.0% 94.1%
cpu01 2.7% 0.0% 1.5% 0.0% 0.0% 0.0% 95.7%
cpu02 0.9% 0.0% 2.3% 0.0% 0.0% 0.0% 96.6%
cpu03 0.1% 0.0% 0.1% 0.0% 0.0% 0.0% 99.6%
Mem: 1032040k av, 1019296k used, 12744k free, 0k shrd, 79788k buff
435480k active, 489124k inactive
Swap: 2096376k av, 91256k used, 2005120k free 563812k cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
12548 mailnull 16 0 1248 1084 1024 S 2.9 0.1 0:16 3 exim
17437 bhadmin 16 0 1444 1444 892 R 2.1 0.1 2:28 2 top
25307 mailman 16 0 4440 2336 1436 S 1.1 0.2 1:11 0 python2
2759 root 17 0 15752 13M 13712 S 0.9 1.3 3:13 2 httpd
22550 mailman 16 0 5172 2504 1340 S 0.9 0.2 0:54 0 python2
3361 named 18 0 7952 4416 3104 S 0.5 0.4 4:01 3 named
22548 mailman 16 0 5152 2460 1324 S 0.5 0.2 0:57 0 python2
22555 mailman 16 0 5116 2464 1316 S 0.5 0.2 0:59 0 python2
17632 james 18 0 4004 4004 2436 S 0.5 0.3 0:00 2 php
17637 pensieve 18 0 3572 3572 2452 S 0.5 0.3 0:00 1 php
17643 gpeden 17 0 4000 4000 2440 S 0.5 0.3 0:00 0 php
25308 mailman 16 0 4048 2072 1320 S 0.3 0.2 0:55 0 python2
17642 robert 15 0 3808 3808 2336 D 0.3 0.3 0:00 0 php
2268 mysql 16 0 24520 15M 5868 S 0.1 1.5 2:13 3 mysqld
22551 mailman 16 0 4092 2080 1308 S 0.1 0.2 0:51 2 python2
17328 bhadmin 16 0 2148 2100 1880 S 0.1 0.2 0:05 1 sshd
13305 mailman 16 0 4716 4716 1936 S 0.1 0.4 0:01 2 python2
17635 seaguy78 15 0 5032 5028 4272 D 0.1 0.4 0:00 3 cppop
1 root 16 0 408 380 356 S 0.0 0.0 0:12 1 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 swapper
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 swapper
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 swapper
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 swapper
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:01 3 ksoftirqd/3
12 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush
11 root 15 0 0 0 0 SW 0.0 0.0 5:11 2 kswapd
13 root 15 0 0 0 0 DW 0.0 0.0 0:27 0 kupdated
14 root 23 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd
20 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 ahd_dv_0
21 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 ahd_dv_1
22 root 25 0 0 0 0 SW 0.0 0.0 0:00 1 scsi_eh_0
23 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 scsi_eh_1
27 root 15 0 0 0 0 SW 0.0 0.0 0:31 1 raid1d
28 root 15 0 0 0 0 SW 0.0 0.0 0:21 1 raid1syncd
29 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 raid1d
30 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 raid1syncd
31 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 raid1d
32 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 raid1syncd
33 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 raid1d
34 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 raid1syncd
35 root 15 0 0 0 0 SW 0.0 0.0 0:03 3 raid1d
36 root 15 0 0 0 0 SW 0.0 0.0 0:01 0 raid1syncd
37 root 15 0 0 0 0 SW 0.0 0.0 0:04 0 raid1d
38 root 15 0 0 0 0 SW 0.0 0.0 0:03 0 raid1syncd
39 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 raid1d
40 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 raid1syncd
41 root 15 0 0 0 0 SW 0.0 0.0 0:21 3 kjournald
156 root 18 0 0 0 0 SW 0.0 0.0 0:00 0 kjournald
157 root 15 0 0 0 0 DW 0.0 0.0 0:47 2 kjournald
159 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 loop0
160 root 15 0 0 0 0 DW 0.0 0.0 0:26 1 kjournald
161 root 15 0 0 0 0 DW 0.0 0.0 0:38 1 kjournald
162 root 15 0 0 0 0 DW 0.0 0.0 3:02 0 kjournald
163 root 15 0 0 0 0 SW 0.0 0.0 2:29 1 kjournald

and here is the latest:

08:41:14 up 1 day, 10:37, 1 user, load average: 100.07, 33.06, 13.29
469 processes: 463 sleeping, 1 running, 4 zombie, 1 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 1.2% 0.0% 2.8% 0.0% 0.0% 0.0% 395.6%
cpu00 0.1% 0.0% 2.1% 0.0% 0.0% 0.0% 97.6%
cpu01 0.1% 0.0% 0.1% 0.0% 0.0% 0.0% 99.6%
cpu02 0.7% 0.0% 0.1% 0.0% 0.0% 0.0% 99.0%
cpu03 0.1% 0.0% 0.3% 0.0% 0.0% 0.0% 99.4%
Mem: 1032040k av, 1018804k used, 13236k free, 0k shrd, 61104k buff
432468k active, 503076k inactive
Swap: 2096376k av, 86180k used, 2010196k free 615904k cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
1041 root 16 0 1468 1452 812 R 2.3 0.1 7:13 0 top
2198 mailnull 15 0 1568 1412 1360 S 0.9 0.1 0:14 2 exim
3479 root 16 0 15864 12M 12908 S 0.3 1.2 2:41 2 httpd
23071 root 16 0 3324 2368 1972 S 0.1 0.2 0:01 1 cppop
11647 root 16 0 1008 1008 864 D 0.1 0.0 0:00 3 suexec
11648 root 16 0 1012 1012 864 D 0.1 0.0 0:00 1 suexec
1 root 16 0 408 376 356 S 0.0 0.0 0:10 0 init
2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 swapper
3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 swapper
4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 swapper
5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 swapper
6 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 keventd
7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3
12 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush
11 root 15 0 0 0 0 SW 0.0 0.0 5:42 0 kswapd
13 root 15 0 0 0 0 DW 0.0 0.0 0:21 3 kupdated
14 root 23 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd
20 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 ahd_dv_0
21 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 ahd_dv_1
22 root 16 0 0 0 0 DW 0.0 0.0 0:00 2 scsi_eh_0
23 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 scsi_eh_1
27 root 15 0 0 0 0 SW 0.0 0.0 0:31 3 raid1d
28 root 15 0 0 0 0 SW 0.0 0.0 0:21 3 raid1syncd
29 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 raid1d
30 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 raid1syncd
31 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 raid1d
32 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 raid1syncd
33 root 15 0 0 0 0 SW 0.0 0.0 0:01 2 raid1d
34 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 raid1syncd
35 root 15 0 0 0 0 SW 0.0 0.0 0:02 1 raid1d
36 root 15 0 0 0 0 SW 0.0 0.0 0:01 0 raid1syncd
37 root 15 0 0 0 0 SW 0.0 0.0 0:04 3 raid1d
38 root 15 0 0 0 0 SW 0.0 0.0 0:03 3 raid1syncd
39 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 raid1d
40 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 raid1syncd
41 root 15 0 0 0 0 DW 0.0 0.0 0:17 0 kjournald
155 root 18 0 0 0 0 SW 0.0 0.0 0:00 3 kjournald
156 root 15 0 0 0 0 DW 0.0 0.0 0:48 1 kjournald
158 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 loop0
159 root 15 0 0 0 0 DW 0.0 0.0 0:21 2 kjournald
160 root 15 0 0 0 0 DW 0.0 0.0 0:29 0 kjournald
161 root 15 0 0 0 0 DW 0.0 0.0 2:24 1 kjournald
162 root 15 0 0 0 0 SW 0.0 0.0 2:36 2 kjournald
1928 root 16 0 576 560 496 D 0.0 0.0 1:20 1 syslogd
1932 root 15 0 368 316 316 D 0.0 0.0 0:00 2 klogd
1942 root 16 0 364 360 308 S 0.0 0.0 0:06 1 irqbalance
2038 root 17 0 560 528 512 S 0.0 0.0 0:00 1 smartd
2053 root 16 0 1592 1260 1184 D 0.0 0.1 0:03 3 cupsd
2106 root 15 0 1320 1236 1180 S 0.0 0.1 0:08 2 sshd
2120 root 16 0 764 724 648 S 0.0 0.0 0:00 2 xinetd
2139 root 16 0 3000 2024 1592 S 0.0 0.1 0:06 1 chkservd

--

Initially it was every 1 or 2 days inbetween the same time period.. Today it happened in the morning rather than in the evening as it was happening..

EDIT: System was checked for root kits earlier and just now and is reporting clean..
 
Last edited:

Tos

Member
Verifed Vendor
Oct 27, 2002
22
0
151
Be nice to see a resolve to this. I am experiencing the same issue on one of our servers.
When ssh'd into the server, the server will 'freeze' for 10 seconds or so and when it comes back, top will show the load has gone throught the roof. Eventually the server just locks.
Happens every day :(

Nothing in the mail queue, nothing in the process list to blame this on.
 

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
When ours does this, it doesn't come back, or rather, we don't allow it time to.. we're immediately on it and rebooting.
 

netwrkr

Well-Known Member
Apr 12, 2003
202
0
166
Originally posted by Hoster2k
When ours does this, it doesn't come back, or rather, we don't allow it time to.. we're immediately on it and rebooting.

469 processes? Wow. What are all those processes?
 

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
Now that's a question. We run a logging script every 15 seconds to catch these - however, it doesn't get chance to run.. it happens so fast nothing is caught in time...
 

netwrkr

Well-Known Member
Apr 12, 2003
202
0
166
Originally posted by Hoster2k
Now that's a question. We run a logging script every 15 seconds to catch these - however, it doesn't get chance to run.. it happens so fast nothing is caught in time...

do you have linux 'sar' installed?
 

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
As per above, it's not possible to do that because it happens so fast and it is a random time really - I nor anyone here is available to sit and look at the screen. Basically a script we have written does just this - every 15 seconds if the load is over a pre defined it logs a top and ps aux to a file. However, it's not getting chance to run..
 

netwrkr

Well-Known Member
Apr 12, 2003
202
0
166
Originally posted by Hoster2k
As per above, it's not possible to do that because it happens so fast and it is a random time really - I nor anyone here is available to sit and look at the screen. Basically a script we have written does just this - every 15 seconds if the load is over a pre defined it logs a top and ps aux to a file. However, it's not getting chance to run..

hrm. I see you pasted the top above with 469 processess running. Its not possible to get ps aux or am I missing something? If you want to know whats wrong you may just have to sit there and look at the screen :)
 

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
The top is from an open session left on - and that is the last screen left when we had to hit the reset switch..
 

netwrkr

Well-Known Member
Apr 12, 2003
202
0
166
Originally posted by Hoster2k
The top is from an open session left on - and that is the last screen left when we had to hit the reset switch..

ouch. Couple of thoughts.


1. Have you looked at CPU/Memory/MySQL Usage under Server Status within WHM and does it give you anything useful?

2. Are you using PHPSuEXEC? If so I think you might be able to configure some limits with /etc/security/limits.conf -- put all users in the same group and limit their resource usage with limits.conf - not necessarily a fix but a start....

3. Any chance of running 'ps aux' every 30 seconds and dumping the output to a file?

4. Do you allow interactive (SSH) users on the server?
 

r00t316

Active Member
Nov 29, 2003
34
0
156
Couple of questions...
1) What os ? Red Enterprise?
2) When you say locked up does it just not respond to ping http etc?

open less /var/log/kernel
if you do not log to here edit /etc/syslog.conf
Make the first line look like:
kern.* /var/log/kernel

killall -HUP syslogd
then wait.
Once the server crashes again
less /var/log/kernel

Look for something call NETDEV:
If you find it and it says NETDEV: eth# device timedout or something along those lines I had the exact same problem.

I actually fixed it by tweaking some of the tcp/ip setttings.

Please let us know if this was the case.
 

Hoster2k

Well-Known Member
Jun 17, 2002
131
0
166
UK
When I say locked up, it's locked up due to high load. Pings are fine, just services do not respond and ssh will not connect etc..