The Community Forums

Interact with an entire community of cPanel & WHM users!
  1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Server shuts down at 2am every night. I have 18 1/2 hours to figure out why :)

Discussion in 'General Discussion' started by schwim, Dec 4, 2008.

  1. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Hi there guys,

    This started two nights ago. Initially, I thought it was caused by a looping email delivery attempt with to to: address(5 errors in a row prior to it shutting down), but that seems to have just been a coincidence.

    I'm waiting for the server to come back up so I can begin perusing the logs and such, but I'm really hoping that someone(s) can give me a smart way to track this down. It's obviously caused by something happening at a set time, so are there any locations I need to look in first?

    Suggestions and thoughts are welcome!

    thanks,
    json
     
  2. darren.nolan

    darren.nolan Well-Known Member

    Joined:
    Oct 4, 2007
    Messages:
    259
    Likes Received:
    0
    Trophy Points:
    16
    Well if you know to check your /var/log/messages to see what's happening at the server level, you could check every users cron jobs to see if they are doing anything they shouldn't be at 2am.

    Check out /var/spool/cron

    Each file is the username associated.

    If it's 2am, usually that's after cPanel backup and updates unless your chosen to have different times for those. Your server message log should give you an indication if those are running and just killing the server for whatever reason.

    Perhaps login one evening and sit in top and closely watching to see if anything spawns that shouldn't, just before the server goes down (or maybe something spawned well before 2am and it just takes that long to eat your resources).

    Anyway - hope that helps.
     
  3. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Hi there and thanks very much for the response. The server refuses to respond after a reboot, so it seems my problem is a little more severe than normal, but I'm continuing to do what I can without actually being able to contact the server, which isn't much :)

    I can't get to the logs of course, but I viewed the mailed logs that got sent with the 5 min high load alert last night. Here's what got attached to the high load alert:

    ps.txt
    vmstat.txt
    apachestatus.html

    I can't see anything that helps, but I'm a tard when it comes to abbreviated logs. Does anyone see anything that causes concern?

    thanks,
    json
     
  4. mtindor

    mtindor Well-Known Member

    Joined:
    Sep 14, 2004
    Messages:
    1,279
    Likes Received:
    36
    Trophy Points:
    48
    Location:
    inside a catfish
    cPanel Access Level:
    Root Administrator
    Looks like you had a backup process goign at the time that email was sent to you. And since your server doesn't respond after coming back up, perhaps its stuck in single user mode waiting for somebody to perform an fsck on one of the drives. It may be that you havea failing drive and that all of the thrashing that is taking place during backup while it is accessing corrupt areas is causing the server to crash.... then when it boots back up it may be wanting an fsck to be performed to fix any file system errors before booting back up into multiuser mode.

    You need to get somebody to get a console up on it and tell you what it's showing at the console.

    Mike
     
  5. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Hi there guys and thanks for the suggestions and help so far.

    First, let me say that I know an emergency on my part doesn't constitute one for the whole forum. That being said, the server is back online and I don't know if I'm going to run into the problem again at 2am, so I am hoping to try to find the root of the cause, if possible. I need help determining the best logs to start off with. I have two questions:

    1) What would be the order of most to least pertinent logs to view?

    2) during a reboot, will the previous log get renamed or do I just view further along the existing log? I've seen some in /var/log that get archived.

    Thanks very much for any help you might be able to provide.

    thanks,
    json
     
  6. Sys Admin

    Sys Admin Well-Known Member

    Joined:
    Apr 29, 2007
    Messages:
    67
    Likes Received:
    0
    Trophy Points:
    6
    cPanel Access Level:
    Root Administrator
    Please provide us with the outputs of:

    1) cat /etc/crontab

    2) crontab -e

    3)

    ls -l /etc/cron.hourly

    ls -l /etc/cron.daily

    ls -l /etc/cron.weekly

    ls -l /etc/cron.monthly

    Regards,
     
  7. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Please let me know if there's anything else you'd like me to provide. Thanks very much in advance for your time.

    1) cat /etc/crontab

    root@server [~]# cat /etc/crontab
    SHELL=/bin/bash
    PATH=/sbin:/bin:/usr/sbin:/usr/bin
    MAILTO=root
    HOME=/

    # run-parts
    01 * * * * root run-parts /etc/cron.hourly
    02 4 * * * root run-parts /etc/cron.daily
    22 4 * * 0 root run-parts /etc/cron.weekly
    42 4 1 * * root run-parts /etc/cron.monthly

    2) crontab -e

    0 5 * * * /root/sa_rules.sh > /dev/null 2>&1
    10 0 * * * perl /usr/mscpanel/mscpanel.pl > /dev/null 2>&1
    20 0 * * * /etc/init.d/clamd restart > /dev/null 2>&1

    0 8,20 * * * /root/chkrootkit.sh | grep -v .packlist
    0 8,20 * * * /root/rkhunter.sh

    10 4 * * * /scripts/upcp
    0 1 * * * /scripts/cpbackup
    */15 * * * * /usr/local/cpanel/whostmgr/bin/dnsqueue > /dev/null 2>&1
    2,58 * * * * /usr/local/bandmin/bandmin
    0 0 * * * /usr/local/bandmin/ipaddrmap
    3 4 * * * /usr/local/cpanel/whostmgr/docroot/cgi/cpaddons_report.pl --notify
    */5 * * * * /usr/local/cpanel/bin/dcpumon >/dev/null 2>&1
    0 6 * * * /scripts/exim_tidydb > /dev/null 2>&1

    3)

    ls -l /etc/cron.hourly

    root@server [~]# ls -l /etc/cron.hourly
    total 20
    drwxr-xr-x 2 root root 4096 Sep 4 08:35 ./
    drwx--x--x 101 root root 8192 Dec 4 17:14 ../
    lrwxrwxrwx 1 root root 20 Aug 19 11:01 logcheck.sh -> /usr/bin/logcheck.sh*
    -rwxr-x--- 1 root root 6118 Sep 4 08:35 modsecparse.pl*
    lrwxrwxrwx 1 root root 42 Aug 19 11:16 update_virus_scanners -> /usr/mailscanner/bin/update_virus_scanners*

    ls -l /etc/cron.daily

    root@server [~]# ls -l /etc/cron.daily
    total 48
    drwxr-xr-x 2 root root 4096 Aug 19 11:16 ./
    drwx--x--x 101 root root 8192 Dec 4 17:15 ../
    -rwxr-xr-x 1 root root 379 Mar 28 2007 0anacron*
    lrwxrwxrwx 1 root root 39 Aug 19 11:02 0logwatch -> /usr/share/logwatch/scripts/logwatch.pl*
    -rwxr-xr-x 1 root root 1001 Aug 1 2007 clean.incoming.cron*
    lrwxrwxrwx 1 root root 47 Aug 19 11:16 clean.quarantine.cron -> /usr/mailscanner/bin/cron/clean.quarantine.cron*
    -rwxr-xr-x 1 root root 118 Jun 21 21:53 cups*
    -rwxr-xr-x 1 root root 219 Aug 16 22:50 logrotate*
    -rwxr-xr-x 1 root root 418 Jan 6 2007 makewhatis.cron*
    -rwxr-xr-x 1 root root 137 Mar 14 2007 mlocate.cron*
    -rwxr-xr-x 1 root root 2181 Jun 21 2006 prelink*
    -rwxr-xr-x 1 root root 114 May 24 2008 rpm*
    -rwxr-xr-x 1 root root 290 Mar 14 2007 tmpwatch*

    ls -l /etc/cron.weekly

    root@server [~]# ls -l /etc/cron.weekly
    total 20
    drwxr-xr-x 2 root root 4096 Feb 4 2008 ./
    drwx--x--x 101 root root 8192 Dec 4 17:15 ../
    -rwxr-xr-x 1 root root 380 Mar 28 2007 0anacron*
    -rwxr-xr-x 1 root root 414 Jan 6 2007 makewhatis.cron*

    ls -l /etc/cron.monthly

    root@server [~]# ls -l /etc/cron.monthly
    total 16
    drwxr-xr-x 2 root root 4096 Feb 4 2008 ./
    drwx--x--x 101 root root 8192 Dec 4 17:15 ../
    -rwxr-xr-x 1 root root 381 Mar 28 2007 0anacron*
     
  8. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Just for an update:

    I called the locator to find out whether the issue was resolved (hardware) or whether they just got it running. Short answer is they just got it running :)

    From the resolution email:

    Now, before considering mailscanner as being the cause: It did throw errors yesterday because when the server went down that day, it corrupted a spool file. I believe those are the errors he found, so I believe that this is a symptom of the crash, not the cause.

    Then the guy stated that the gent actually booting the server said "it wasn't that the server reboot failed, your server was running some kind of diagnostic and it just took a long time to run". Note that the server was unresponsive for over 9 hours after reboot and another tech had already stated that two recycle attempts had failed.

    So the problem has not been resolved and I'm still looking for a reason.

    It seems to happen around the time that backups are running. Should I temporarily disable backups?

    thanks,
    json
     
    #8 schwim, Dec 4, 2008
    Last edited: Dec 4, 2008
  9. mtindor

    mtindor Well-Known Member

    Joined:
    Sep 14, 2004
    Messages:
    1,279
    Likes Received:
    36
    Trophy Points:
    48
    Location:
    inside a catfish
    cPanel Access Level:
    Root Administrator
    I think you really ought to entrust somebody with root access so they can take a look for you. It's one thing to give recommendations, but really I think everybody is flying blind with what information you have. You should check every log in /var/log that has any recent timestamp on it (within the past day or so) - Go over the last days' worth of logs in each of those files. dmesg, messages, cron, etc.

    The fact that they say your server was up but 'running a diagnostic' for 8 or 9 hours is ludicrous... what diagnostic? Was it doing an fsck for nine hours? (thats more than a simple diagnostic - if it's fsck, then there is an underlying reason for it running). I don't think the people who got console on the server for you and reported that information are giving you any useful information - perhaps they aren't too clueful? I mean if I asked a datacenter to check things out, I'd expect them to report more than 'its running a diagnostic' and then not telling me what supposed diagnostic it is running.

    Feel free to PM me if you want me to take a look. I can be on MSN or Skype. I'd rather you find it yourself (that's how you learn), but if you don't find anything in the logs useful you may have trouble finding the problem, and I don't know how many days you want to continuet o have a similar experience.

    Again, make sure and check /var/log/messages and /var/log/dmesg and any other log with a recent timestamp.

    Mike
     
  10. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Hi there Mike,

    Thanks very much for your help and if it's possible, I'd like to be able to resolve it myself. If I come up empty handed, then I would most definitely be interested in your assistance.

    Thanks very much for your time. I'm very much looking forward to sleeping again :)

    thanks,
    json
     
    #10 schwim, Dec 4, 2008
    Last edited: Dec 5, 2008
  11. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Hi there guys,

    I woke up this morning to a crippled server again. It was clear that something was taking all of the resources. If I loaded a page, most times it would simply stall, sometimes I would get a page after a 3 min load time and then others, a corrupt page load(mysql server down).

    I shelled in(again with a 3 minute lag for any command to be processed) and managed to run top. From what I can tell, all processes look within normal boundaries except for the mysql process. That's not right, is it?

    I tried to restart MySQL from the shell, but the server had quit responding by then.

    Something to note is that the server didn't shut down, and at 7 am it was still running, just not properly.

    I'm waiting for it to come back up so I can get the logs but wanted someone to confirm whether that mysql process is problematic.

    thanks,
    json
     
  12. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Hi there again guys,

    I've viewed my logs from around the time of the shutdowns and found things of interest in cron, maillog and messages from all three days. Problem is, I'm not sure what to do with the information.

    I don't know if these are symptoms or causes.

    What would be the next logical step in finding the problem and do you see anything in what I posted that raises concern?

    thanks,
    json
     
  13. mtindor

    mtindor Well-Known Member

    Joined:
    Sep 14, 2004
    Messages:
    1,279
    Likes Received:
    36
    Trophy Points:
    48
    Location:
    inside a catfish
    cPanel Access Level:
    Root Administrator
    pointing more and more toward something that is running as a cron job in the early morning. Check your cron log in more detail for anything happening after midnight and the time the machine becomes responsive again.

    Mike
     
  14. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    are there files other than just cron that I should be looking in for this? Are there separate, more detailed logs for cron per user or per daily/weekly, etc?

    I ask because I'm not seeing anything that shows a problem. For instance, this is an hour in advance on my initial lockup:

    thanks,
    json
     
  15. Sys Admin

    Sys Admin Well-Known Member

    Joined:
    Apr 29, 2007
    Messages:
    67
    Likes Received:
    0
    Trophy Points:
    6
    cPanel Access Level:
    Root Administrator
    Try to run each script manually to see which one is causing this issue.
     
  16. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Hi there sysadmin, and thanks very much for the reply.

    Before I do this, can I ask the best way to monitor the issue? Am I just to wait to see if the server becomes unreachable or should I just monitor top for that process? Also, how do I run processes in the form of "0logwatch -> /usr/share/logwatch/scripts/logwatch.pl*

    [EDIT]
    While browsing the error notices, I found this for Dec 4:

    Matching:

    But there's no way to track down which processes it's talking about is there?

    thanks,
    json
     
    #16 schwim, Dec 5, 2008
    Last edited: Dec 5, 2008
  17. rhenderson

    rhenderson Well-Known Member

    Joined:
    Apr 21, 2005
    Messages:
    785
    Likes Received:
    0
    Trophy Points:
    16
    Location:
    Oklahoma
    cPanel Access Level:
    Root Administrator
    Just curious if you have Selinux enabled?
     
  18. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    SELinux is not enabled.

    Thanks,
    json
     
  19. schwim

    schwim Well-Known Member

    Joined:
    Aug 2, 2006
    Messages:
    198
    Likes Received:
    0
    Trophy Points:
    16
    Well, at 1:30 am I opened up three ssh connections, one running top, one tailing cron and one tailing messages.

    I've watched it for an hour now and I see nothing out of the ordinary. The load spikes when running the site backups(1.00 - 2.00) but that has happened on every server I have.

    I'll hope it's running tomorrow morning, but if the tradition holds, then I hope at least that I'm not dealing with any data corruption ;)

    thanks,
    json
     
  20. stdout

    stdout Well-Known Member

    Joined:
    Apr 10, 2003
    Messages:
    189
    Likes Received:
    5
    Trophy Points:
    18
    Location:
    Nelspruit, Mpumalanga, South Africa
    cPanel Access Level:
    Root Administrator
    Just a piece of information.
    What I've always done to track down spontaneous reboots is to:
    "grep BIOS-provided /var/log/messages" - This will give you the time the server *bootup occurred.

    Now that you have the exact time/date:
    "less /var/log/messages" and inside "less" type "/" and then paste the time and date of the occurrence.

    This will take you to the area of the syslog where the BIOS messages showed up - scroll a bit higher to see if anything was written to syslog prior to the system booting up.

    Example:
    Code:
    root@localhost [~]# grep BIOS-provided /var/log/messages*
    /var/log/messages.2:Nov 21 21:35:11 localhost kernel: BIOS-provided physical RAM map:
    /var/log/messages.2:Nov 22 17:11:35 localhost kernel: BIOS-provided physical RAM map:
    I would "less /var/log/messages.2", hit "/" and paste "Nov 22 17:11:35".
    Now scroll up about 10 lines - see if you spot any potential bugs or errors prior to the reboot/failure.

    This may sound like a backward way of troubleshooting this issue.
    However its accurate and works alot faster than the other suggestions I've been seeing.

    Of course use the same technique to get the failure time and to search in other logs such as cron, mysql, apache and dmesg.
     
Loading...

Share This Page