The Community Forums

Interact with an entire community of cPanel & WHM users!
  1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Reoccuring server lock ups. Any advice?

Discussion in 'General Discussion' started by Hoster2k, Jul 6, 2004.

  1. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    We have one machine that is constantly locking up due to high load. The server isn't actually very loaded, has no cron jobs running at the times of the crashes, which are every 1-2 days. The system will lock up within the space of a minute due to high load >100. Our logging software is not able to get a printout of the running processes in time. We are modifying it and trying but still we're having issues.

    Any advice on what this may be as it has me and other admins stumped. Is this a known issue for anyone?

    Thanks

    P.s. there isn't one/two script using 100% cpu / mem etc. infact, CPU usage is nearly zero and mem usage is fine..
     
  2. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    To add on - we have checked the hard disks - they seem to be fine, no messages at all in /var/log/* i.e. running fine then the next messages are the restart ones..
     
  3. jester.ro

    jester.ro Well-Known Member
    PartnerNOC

    Joined:
    Feb 6, 2004
    Messages:
    304
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    Bucharest, Romania
    cPanel Access Level:
    DataCenter Provider
    locking means you have to restart it from the button?
    and you have no kernel logs exactly at the time of the crash?

    i had the same problem with a server, it was a dual p3@1ghz, serverworks motheboard, 3ware raid ide controller. (no cpanel)

    i had problems with some of the kernels, not all of them worked.
    what i did was compile the kernel with suport for serial console logging, and i had the kernel logs on another server.
    there was a big kernel panic happening, and of course, the logs were not recorded to disk.
    anyway, the problem was fixed by changing kernel versions.
    it seems its an undocumented bug in serverworks chipsets.
     
  4. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    We actually run the same setup on about half a dozen other machines without a problem... this only started happening in the last week or two.. with two different kernels...

    It isn't panicing (afaik) just the load goes sky high very fast.. with alot of processes launched in that minute but the cpu doesn't seem to be doing much work..
     
  5. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    Out of interest - did you have the same problem with the load rocketing, or a different experience?
     
  6. dgbaker

    dgbaker Well-Known Member
    PartnerNOC

    Joined:
    Sep 20, 2002
    Messages:
    2,578
    Likes Received:
    3
    Trophy Points:
    38
    Location:
    Toronto, Ontario Canada
    cPanel Access Level:
    DataCenter Provider
    Does it happen at around the same time all the time?

    If so, look for cronjobs, keep top running on ssh session sorted by CPU, look at your mail queue, have seen exim/mailscanner tear up an entire box before. Check for last logins to see who is/was on the box other than yourself, etc...

    The top command would be the better one to have up as when/if the server locks you should be able to see the highest processes.
     
  7. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    Hi,

    That's what I've been doing, but there is almost zero cpu usage.. I.e. it's not looking like one specific user/script is running eating up all mem/cpu.

    Further, no one has shell access on this system - just me and the other admins, and I am the only person logged in.

    Here's one top snapshot:

    21:51:53 up 2 days, 8 min, 1 user, load average: 25.24, 6.80, 2.78
    364 processes: 362 sleeping, 1 running, 1 zombie, 0 stopped
    CPU states: cpu user nice system irq softirq iowait idle
    total 7.2% 0.0% 6.4% 0.0% 0.0% 0.0% 386.0%
    cpu00 3.5% 0.0% 2.3% 0.0% 0.0% 0.0% 94.1%
    cpu01 2.7% 0.0% 1.5% 0.0% 0.0% 0.0% 95.7%
    cpu02 0.9% 0.0% 2.3% 0.0% 0.0% 0.0% 96.6%
    cpu03 0.1% 0.0% 0.1% 0.0% 0.0% 0.0% 99.6%
    Mem: 1032040k av, 1019296k used, 12744k free, 0k shrd, 79788k buff
    435480k active, 489124k inactive
    Swap: 2096376k av, 91256k used, 2005120k free 563812k cached

    PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
    12548 mailnull 16 0 1248 1084 1024 S 2.9 0.1 0:16 3 exim
    17437 bhadmin 16 0 1444 1444 892 R 2.1 0.1 2:28 2 top
    25307 mailman 16 0 4440 2336 1436 S 1.1 0.2 1:11 0 python2
    2759 root 17 0 15752 13M 13712 S 0.9 1.3 3:13 2 httpd
    22550 mailman 16 0 5172 2504 1340 S 0.9 0.2 0:54 0 python2
    3361 named 18 0 7952 4416 3104 S 0.5 0.4 4:01 3 named
    22548 mailman 16 0 5152 2460 1324 S 0.5 0.2 0:57 0 python2
    22555 mailman 16 0 5116 2464 1316 S 0.5 0.2 0:59 0 python2
    17632 james 18 0 4004 4004 2436 S 0.5 0.3 0:00 2 php
    17637 pensieve 18 0 3572 3572 2452 S 0.5 0.3 0:00 1 php
    17643 gpeden 17 0 4000 4000 2440 S 0.5 0.3 0:00 0 php
    25308 mailman 16 0 4048 2072 1320 S 0.3 0.2 0:55 0 python2
    17642 robert 15 0 3808 3808 2336 D 0.3 0.3 0:00 0 php
    2268 mysql 16 0 24520 15M 5868 S 0.1 1.5 2:13 3 mysqld
    22551 mailman 16 0 4092 2080 1308 S 0.1 0.2 0:51 2 python2
    17328 bhadmin 16 0 2148 2100 1880 S 0.1 0.2 0:05 1 sshd
    13305 mailman 16 0 4716 4716 1936 S 0.1 0.4 0:01 2 python2
    17635 seaguy78 15 0 5032 5028 4272 D 0.1 0.4 0:00 3 cppop
    1 root 16 0 408 380 356 S 0.0 0.0 0:12 1 init
    2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 swapper
    3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 swapper
    4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 swapper
    5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 swapper
    6 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 keventd
    7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
    8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
    9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
    10 root 34 19 0 0 0 SWN 0.0 0.0 0:01 3 ksoftirqd/3
    12 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush
    11 root 15 0 0 0 0 SW 0.0 0.0 5:11 2 kswapd
    13 root 15 0 0 0 0 DW 0.0 0.0 0:27 0 kupdated
    14 root 23 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd
    20 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 ahd_dv_0
    21 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 ahd_dv_1
    22 root 25 0 0 0 0 SW 0.0 0.0 0:00 1 scsi_eh_0
    23 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 scsi_eh_1
    27 root 15 0 0 0 0 SW 0.0 0.0 0:31 1 raid1d
    28 root 15 0 0 0 0 SW 0.0 0.0 0:21 1 raid1syncd
    29 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 raid1d
    30 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 raid1syncd
    31 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 raid1d
    32 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 raid1syncd
    33 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 raid1d
    34 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 raid1syncd
    35 root 15 0 0 0 0 SW 0.0 0.0 0:03 3 raid1d
    36 root 15 0 0 0 0 SW 0.0 0.0 0:01 0 raid1syncd
    37 root 15 0 0 0 0 SW 0.0 0.0 0:04 0 raid1d
    38 root 15 0 0 0 0 SW 0.0 0.0 0:03 0 raid1syncd
    39 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 raid1d
    40 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 raid1syncd
    41 root 15 0 0 0 0 SW 0.0 0.0 0:21 3 kjournald
    156 root 18 0 0 0 0 SW 0.0 0.0 0:00 0 kjournald
    157 root 15 0 0 0 0 DW 0.0 0.0 0:47 2 kjournald
    159 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 loop0
    160 root 15 0 0 0 0 DW 0.0 0.0 0:26 1 kjournald
    161 root 15 0 0 0 0 DW 0.0 0.0 0:38 1 kjournald
    162 root 15 0 0 0 0 DW 0.0 0.0 3:02 0 kjournald
    163 root 15 0 0 0 0 SW 0.0 0.0 2:29 1 kjournald

    and here is the latest:

    08:41:14 up 1 day, 10:37, 1 user, load average: 100.07, 33.06, 13.29
    469 processes: 463 sleeping, 1 running, 4 zombie, 1 stopped
    CPU states: cpu user nice system irq softirq iowait idle
    total 1.2% 0.0% 2.8% 0.0% 0.0% 0.0% 395.6%
    cpu00 0.1% 0.0% 2.1% 0.0% 0.0% 0.0% 97.6%
    cpu01 0.1% 0.0% 0.1% 0.0% 0.0% 0.0% 99.6%
    cpu02 0.7% 0.0% 0.1% 0.0% 0.0% 0.0% 99.0%
    cpu03 0.1% 0.0% 0.3% 0.0% 0.0% 0.0% 99.4%
    Mem: 1032040k av, 1018804k used, 13236k free, 0k shrd, 61104k buff
    432468k active, 503076k inactive
    Swap: 2096376k av, 86180k used, 2010196k free 615904k cached

    PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
    1041 root 16 0 1468 1452 812 R 2.3 0.1 7:13 0 top
    2198 mailnull 15 0 1568 1412 1360 S 0.9 0.1 0:14 2 exim
    3479 root 16 0 15864 12M 12908 S 0.3 1.2 2:41 2 httpd
    23071 root 16 0 3324 2368 1972 S 0.1 0.2 0:01 1 cppop
    11647 root 16 0 1008 1008 864 D 0.1 0.0 0:00 3 suexec
    11648 root 16 0 1012 1012 864 D 0.1 0.0 0:00 1 suexec
    1 root 16 0 408 376 356 S 0.0 0.0 0:10 0 init
    2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 swapper
    3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 swapper
    4 root RT 0 0 0 0 SW 0.0 0.0 0:00 2 swapper
    5 root RT 0 0 0 0 SW 0.0 0.0 0:00 3 swapper
    6 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 keventd
    7 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
    8 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1
    9 root 34 19 0 0 0 SWN 0.0 0.0 0:00 2 ksoftirqd/2
    10 root 34 19 0 0 0 SWN 0.0 0.0 0:00 3 ksoftirqd/3
    12 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush
    11 root 15 0 0 0 0 SW 0.0 0.0 5:42 0 kswapd
    13 root 15 0 0 0 0 DW 0.0 0.0 0:21 3 kupdated
    14 root 23 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd
    20 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 ahd_dv_0
    21 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 ahd_dv_1
    22 root 16 0 0 0 0 DW 0.0 0.0 0:00 2 scsi_eh_0
    23 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 scsi_eh_1
    27 root 15 0 0 0 0 SW 0.0 0.0 0:31 3 raid1d
    28 root 15 0 0 0 0 SW 0.0 0.0 0:21 3 raid1syncd
    29 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 raid1d
    30 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 raid1syncd
    31 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 raid1d
    32 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 raid1syncd
    33 root 15 0 0 0 0 SW 0.0 0.0 0:01 2 raid1d
    34 root 15 0 0 0 0 SW 0.0 0.0 0:01 1 raid1syncd
    35 root 15 0 0 0 0 SW 0.0 0.0 0:02 1 raid1d
    36 root 15 0 0 0 0 SW 0.0 0.0 0:01 0 raid1syncd
    37 root 15 0 0 0 0 SW 0.0 0.0 0:04 3 raid1d
    38 root 15 0 0 0 0 SW 0.0 0.0 0:03 3 raid1syncd
    39 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 raid1d
    40 root 15 0 0 0 0 SW 0.0 0.0 0:00 2 raid1syncd
    41 root 15 0 0 0 0 DW 0.0 0.0 0:17 0 kjournald
    155 root 18 0 0 0 0 SW 0.0 0.0 0:00 3 kjournald
    156 root 15 0 0 0 0 DW 0.0 0.0 0:48 1 kjournald
    158 root 15 0 0 0 0 SW 0.0 0.0 0:00 3 loop0
    159 root 15 0 0 0 0 DW 0.0 0.0 0:21 2 kjournald
    160 root 15 0 0 0 0 DW 0.0 0.0 0:29 0 kjournald
    161 root 15 0 0 0 0 DW 0.0 0.0 2:24 1 kjournald
    162 root 15 0 0 0 0 SW 0.0 0.0 2:36 2 kjournald
    1928 root 16 0 576 560 496 D 0.0 0.0 1:20 1 syslogd
    1932 root 15 0 368 316 316 D 0.0 0.0 0:00 2 klogd
    1942 root 16 0 364 360 308 S 0.0 0.0 0:06 1 irqbalance
    2038 root 17 0 560 528 512 S 0.0 0.0 0:00 1 smartd
    2053 root 16 0 1592 1260 1184 D 0.0 0.1 0:03 3 cupsd
    2106 root 15 0 1320 1236 1180 S 0.0 0.1 0:08 2 sshd
    2120 root 16 0 764 724 648 S 0.0 0.0 0:00 2 xinetd
    2139 root 16 0 3000 2024 1592 S 0.0 0.1 0:06 1 chkservd

    --

    Initially it was every 1 or 2 days inbetween the same time period.. Today it happened in the morning rather than in the evening as it was happening..

    EDIT: System was checked for root kits earlier and just now and is reporting clean..
     
    #7 Hoster2k, Jul 6, 2004
    Last edited: Jul 6, 2004
  8. Tos

    Tos Member

    Joined:
    Oct 27, 2002
    Messages:
    21
    Likes Received:
    0
    Trophy Points:
    1
    Be nice to see a resolve to this. I am experiencing the same issue on one of our servers.
    When ssh'd into the server, the server will 'freeze' for 10 seconds or so and when it comes back, top will show the load has gone throught the roof. Eventually the server just locks.
    Happens every day :(

    Nothing in the mail queue, nothing in the process list to blame this on.
     
  9. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    When ours does this, it doesn't come back, or rather, we don't allow it time to.. we're immediately on it and rebooting.
     
  10. netwrkr

    netwrkr Well-Known Member

    Joined:
    Apr 12, 2003
    Messages:
    203
    Likes Received:
    0
    Trophy Points:
    16

    469 processes? Wow. What are all those processes?
     
  11. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    Now that's a question. We run a logging script every 15 seconds to catch these - however, it doesn't get chance to run.. it happens so fast nothing is caught in time...
     
  12. netwrkr

    netwrkr Well-Known Member

    Joined:
    Apr 12, 2003
    Messages:
    203
    Likes Received:
    0
    Trophy Points:
    16

    do you have linux 'sar' installed?
     
  13. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
  14. netwrkr

    netwrkr Well-Known Member

    Joined:
    Apr 12, 2003
    Messages:
    203
    Likes Received:
    0
    Trophy Points:
    16

    login to the server. run 'top' when the processes get above 300 do a 'ps aux' and paste here. something is obviously spawning out of control.
     
  15. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    As per above, it's not possible to do that because it happens so fast and it is a random time really - I nor anyone here is available to sit and look at the screen. Basically a script we have written does just this - every 15 seconds if the load is over a pre defined it logs a top and ps aux to a file. However, it's not getting chance to run..
     
  16. netwrkr

    netwrkr Well-Known Member

    Joined:
    Apr 12, 2003
    Messages:
    203
    Likes Received:
    0
    Trophy Points:
    16

    hrm. I see you pasted the top above with 469 processess running. Its not possible to get ps aux or am I missing something? If you want to know whats wrong you may just have to sit there and look at the screen :)
     
  17. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    The top is from an open session left on - and that is the last screen left when we had to hit the reset switch..
     
  18. netwrkr

    netwrkr Well-Known Member

    Joined:
    Apr 12, 2003
    Messages:
    203
    Likes Received:
    0
    Trophy Points:
    16

    ouch. Couple of thoughts.


    1. Have you looked at CPU/Memory/MySQL Usage under Server Status within WHM and does it give you anything useful?

    2. Are you using PHPSuEXEC? If so I think you might be able to configure some limits with /etc/security/limits.conf -- put all users in the same group and limit their resource usage with limits.conf - not necessarily a fix but a start....

    3. Any chance of running 'ps aux' every 30 seconds and dumping the output to a file?

    4. Do you allow interactive (SSH) users on the server?
     
  19. r00t316

    r00t316 Active Member

    Joined:
    Nov 29, 2003
    Messages:
    34
    Likes Received:
    0
    Trophy Points:
    6
    Couple of questions...
    1) What os ? Red Enterprise?
    2) When you say locked up does it just not respond to ping http etc?

    open less /var/log/kernel
    if you do not log to here edit /etc/syslog.conf
    Make the first line look like:
    kern.* /var/log/kernel

    killall -HUP syslogd
    then wait.
    Once the server crashes again
    less /var/log/kernel

    Look for something call NETDEV:
    If you find it and it says NETDEV: eth# device timedout or something along those lines I had the exact same problem.

    I actually fixed it by tweaking some of the tcp/ip setttings.

    Please let us know if this was the case.
     
  20. Hoster2k

    Hoster2k Well-Known Member

    Joined:
    Jun 17, 2002
    Messages:
    131
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    UK
    When I say locked up, it's locked up due to high load. Pings are fine, just services do not respond and ssh will not connect etc..
     
Loading...

Share This Page