Server shuts down at 2am every night. I have 18 1/2 hours to figure out why :)

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Hi there guys,

This started two nights ago. Initially, I thought it was caused by a looping email delivery attempt with to to: address(5 errors in a row prior to it shutting down), but that seems to have just been a coincidence.

I'm waiting for the server to come back up so I can begin perusing the logs and such, but I'm really hoping that someone(s) can give me a smart way to track this down. It's obviously caused by something happening at a set time, so are there any locations I need to look in first?

Suggestions and thoughts are welcome!

thanks,
json
 

darren.nolan

Well-Known Member
Oct 4, 2007
257
0
66
Well if you know to check your /var/log/messages to see what's happening at the server level, you could check every users cron jobs to see if they are doing anything they shouldn't be at 2am.

Check out /var/spool/cron

Each file is the username associated.

If it's 2am, usually that's after cPanel backup and updates unless your chosen to have different times for those. Your server message log should give you an indication if those are running and just killing the server for whatever reason.

Perhaps login one evening and sit in top and closely watching to see if anything spawns that shouldn't, just before the server goes down (or maybe something spawned well before 2am and it just takes that long to eat your resources).

Anyway - hope that helps.
 

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Hi there and thanks very much for the response. The server refuses to respond after a reboot, so it seems my problem is a little more severe than normal, but I'm continuing to do what I can without actually being able to contact the server, which isn't much :)

I can't get to the logs of course, but I viewed the mailed logs that got sent with the 5 min high load alert last night. Here's what got attached to the high load alert:

ps.txt
vmstat.txt
apachestatus.html

I can't see anything that helps, but I'm a tard when it comes to abbreviated logs. Does anyone see anything that causes concern?

thanks,
json
 

mtindor

Well-Known Member
Sep 14, 2004
1,454
110
193
inside a catfish
cPanel Access Level
Root Administrator
Looks like you had a backup process goign at the time that email was sent to you. And since your server doesn't respond after coming back up, perhaps its stuck in single user mode waiting for somebody to perform an fsck on one of the drives. It may be that you havea failing drive and that all of the thrashing that is taking place during backup while it is accessing corrupt areas is causing the server to crash.... then when it boots back up it may be wanting an fsck to be performed to fix any file system errors before booting back up into multiuser mode.

You need to get somebody to get a console up on it and tell you what it's showing at the console.

Mike
 

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Hi there guys and thanks for the suggestions and help so far.

First, let me say that I know an emergency on my part doesn't constitute one for the whole forum. That being said, the server is back online and I don't know if I'm going to run into the problem again at 2am, so I am hoping to try to find the root of the cause, if possible. I need help determining the best logs to start off with. I have two questions:

1) What would be the order of most to least pertinent logs to view?

2) during a reboot, will the previous log get renamed or do I just view further along the existing log? I've seen some in /var/log that get archived.

Thanks very much for any help you might be able to provide.

thanks,
json
 

Sys Admin

Well-Known Member
Apr 29, 2007
67
0
156
cPanel Access Level
Root Administrator
Please provide us with the outputs of:

1) cat /etc/crontab

2) crontab -e

3)

ls -l /etc/cron.hourly

ls -l /etc/cron.daily

ls -l /etc/cron.weekly

ls -l /etc/cron.monthly

Regards,
 

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Please let me know if there's anything else you'd like me to provide. Thanks very much in advance for your time.

1) cat /etc/crontab

[email protected] [~]# cat /etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/

# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 4 * * * root run-parts /etc/cron.daily
22 4 * * 0 root run-parts /etc/cron.weekly
42 4 1 * * root run-parts /etc/cron.monthly

2) crontab -e

0 5 * * * /root/sa_rules.sh > /dev/null 2>&1
10 0 * * * perl /usr/mscpanel/mscpanel.pl > /dev/null 2>&1
20 0 * * * /etc/init.d/clamd restart > /dev/null 2>&1

0 8,20 * * * /root/chkrootkit.sh | grep -v .packlist
0 8,20 * * * /root/rkhunter.sh

10 4 * * * /scripts/upcp
0 1 * * * /scripts/cpbackup
*/15 * * * * /usr/local/cpanel/whostmgr/bin/dnsqueue > /dev/null 2>&1
2,58 * * * * /usr/local/bandmin/bandmin
0 0 * * * /usr/local/bandmin/ipaddrmap
3 4 * * * /usr/local/cpanel/whostmgr/docroot/cgi/cpaddons_report.pl --notify
*/5 * * * * /usr/local/cpanel/bin/dcpumon >/dev/null 2>&1
0 6 * * * /scripts/exim_tidydb > /dev/null 2>&1

3)

ls -l /etc/cron.hourly

[email protected] [~]# ls -l /etc/cron.hourly
total 20
drwxr-xr-x 2 root root 4096 Sep 4 08:35 ./
drwx--x--x 101 root root 8192 Dec 4 17:14 ../
lrwxrwxrwx 1 root root 20 Aug 19 11:01 logcheck.sh -> /usr/bin/logcheck.sh*
-rwxr-x--- 1 root root 6118 Sep 4 08:35 modsecparse.pl*
lrwxrwxrwx 1 root root 42 Aug 19 11:16 update_virus_scanners -> /usr/mailscanner/bin/update_virus_scanners*

ls -l /etc/cron.daily

[email protected] [~]# ls -l /etc/cron.daily
total 48
drwxr-xr-x 2 root root 4096 Aug 19 11:16 ./
drwx--x--x 101 root root 8192 Dec 4 17:15 ../
-rwxr-xr-x 1 root root 379 Mar 28 2007 0anacron*
lrwxrwxrwx 1 root root 39 Aug 19 11:02 0logwatch -> /usr/share/logwatch/scripts/logwatch.pl*
-rwxr-xr-x 1 root root 1001 Aug 1 2007 clean.incoming.cron*
lrwxrwxrwx 1 root root 47 Aug 19 11:16 clean.quarantine.cron -> /usr/mailscanner/bin/cron/clean.quarantine.cron*
-rwxr-xr-x 1 root root 118 Jun 21 21:53 cups*
-rwxr-xr-x 1 root root 219 Aug 16 22:50 logrotate*
-rwxr-xr-x 1 root root 418 Jan 6 2007 makewhatis.cron*
-rwxr-xr-x 1 root root 137 Mar 14 2007 mlocate.cron*
-rwxr-xr-x 1 root root 2181 Jun 21 2006 prelink*
-rwxr-xr-x 1 root root 114 May 24 2008 rpm*
-rwxr-xr-x 1 root root 290 Mar 14 2007 tmpwatch*

ls -l /etc/cron.weekly

[email protected] [~]# ls -l /etc/cron.weekly
total 20
drwxr-xr-x 2 root root 4096 Feb 4 2008 ./
drwx--x--x 101 root root 8192 Dec 4 17:15 ../
-rwxr-xr-x 1 root root 380 Mar 28 2007 0anacron*
-rwxr-xr-x 1 root root 414 Jan 6 2007 makewhatis.cron*

ls -l /etc/cron.monthly

[email protected] [~]# ls -l /etc/cron.monthly
total 16
drwxr-xr-x 2 root root 4096 Feb 4 2008 ./
drwx--x--x 101 root root 8192 Dec 4 17:15 ../
-rwxr-xr-x 1 root root 381 Mar 28 2007 0anacron*
 

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Just for an update:

I called the locator to find out whether the issue was resolved (hardware) or whether they just got it running. Short answer is they just got it running :)

From the resolution email:

There are extensive changes made to the default cPanel setup(iptables and mailscanner) and mailscanner is showing errors.
Now, before considering mailscanner as being the cause: It did throw errors yesterday because when the server went down that day, it corrupted a spool file. I believe those are the errors he found, so I believe that this is a symptom of the crash, not the cause.

Then the guy stated that the gent actually booting the server said "it wasn't that the server reboot failed, your server was running some kind of diagnostic and it just took a long time to run". Note that the server was unresponsive for over 9 hours after reboot and another tech had already stated that two recycle attempts had failed.

So the problem has not been resolved and I'm still looking for a reason.

It seems to happen around the time that backups are running. Should I temporarily disable backups?

thanks,
json
 
Last edited:

mtindor

Well-Known Member
Sep 14, 2004
1,454
110
193
inside a catfish
cPanel Access Level
Root Administrator
I think you really ought to entrust somebody with root access so they can take a look for you. It's one thing to give recommendations, but really I think everybody is flying blind with what information you have. You should check every log in /var/log that has any recent timestamp on it (within the past day or so) - Go over the last days' worth of logs in each of those files. dmesg, messages, cron, etc.

The fact that they say your server was up but 'running a diagnostic' for 8 or 9 hours is ludicrous... what diagnostic? Was it doing an fsck for nine hours? (thats more than a simple diagnostic - if it's fsck, then there is an underlying reason for it running). I don't think the people who got console on the server for you and reported that information are giving you any useful information - perhaps they aren't too clueful? I mean if I asked a datacenter to check things out, I'd expect them to report more than 'its running a diagnostic' and then not telling me what supposed diagnostic it is running.

Feel free to PM me if you want me to take a look. I can be on MSN or Skype. I'd rather you find it yourself (that's how you learn), but if you don't find anything in the logs useful you may have trouble finding the problem, and I don't know how many days you want to continuet o have a similar experience.

Again, make sure and check /var/log/messages and /var/log/dmesg and any other log with a recent timestamp.

Mike
 

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Hi there Mike,

Thanks very much for your help and if it's possible, I'd like to be able to resolve it myself. If I come up empty handed, then I would most definitely be interested in your assistance.

Thanks very much for your time. I'm very much looking forward to sleeping again :)

thanks,
json
 
Last edited:

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Hi there guys,

I woke up this morning to a crippled server again. It was clear that something was taking all of the resources. If I loaded a page, most times it would simply stall, sometimes I would get a page after a 3 min load time and then others, a corrupt page load(mysql server down).

I shelled in(again with a 3 minute lag for any command to be processed) and managed to run top. From what I can tell, all processes look within normal boundaries except for the mysql process. That's not right, is it?

I tried to restart MySQL from the shell, but the server had quit responding by then.

Something to note is that the server didn't shut down, and at 7 am it was still running, just not properly.

I'm waiting for it to come back up so I can get the logs but wanted someone to confirm whether that mysql process is problematic.

thanks,
json
 

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Hi there again guys,

I've viewed my logs from around the time of the shutdowns and found things of interest in cron, maillog and messages from all three days. Problem is, I'm not sure what to do with the information.

Two days ago:

Cron:
Dec 3 02:01:08 server crond[5266]: (root) CMD (run-parts /etc/cron.hourly)
Dec 3 02:02:27 server crond[5308]: (root) CMD (/usr/local/bandmin/bandmin)
Dec 3 02:05:11 server crond[5579]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 09:23:36 server crond[3640]: (CRON) STARTUP (V5.0)
Dec 3 09:20:19 server crond[4729]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 09:20:19 server crond[4731]: (root) CMD (/usr/lib/sa/sa1 1 1)

maillog:
Dec 3 02:14:11 server pop3d: LOGIN, user=************, ip=[::ffff:75.106.184.10], port=[62725]
Dec 3 02:17:04 server pop3d: Connection, ip=[::ffff:64.233.182.140]
Dec 3 02:18:36 server pop3d: Disconnected, ip=[::ffff:75.106.184.10]
Dec 3 02:18:36 server pop3d: Maximum connection limit reached for ::ffff:75.106.184.10
Dec 3 09:23:05 server authdaemond: modules="authpipe", daemons=5
Dec 3 09:23:05 server authdaemond: Installing libauthpipe
Dec 3 09:23:05 server authdaemond: Installation complete: authpipe
Dec 3 09:23:15 server pop3d: Connection, ip=[::ffff:75.106.184.10]

messages:

Dec 3 01:58:13 server pure-ftpd: ([email protected]) [INFO] New connection from 127.0.0.1
Dec 3 01:58:14 server pure-ftpd: ([email protected]) [INFO] Logout.
Dec 3 02:06:46 server named[31018]: lame server resolving '96.101.246.218.in-addr.arpa' (in '101.246.218.in-addr.arpa'?): 211.99.129.210#53
Dec 3 02:06:47 server named[31018]: lame server resolving '96.101.246.218.in-addr.arpa' (in '101.246.218.in-addr.arpa'?): 211.99.129.211#53
Dec 3 02:06:50 server named[31018]: lame server resolving '96.101.246.218.in-addr.arpa' (in '101.246.218.in-addr.arpa'?): 211.99.129.210#53
Dec 3 02:06:51 server named[31018]: lame server resolving '96.101.246.218.in-addr.arpa' (in '101.246.218.in-addr.arpa'?): 211.99.129.211#53
Dec 3 02:06:51 server named[31018]: lame server resolving '96.101.246.218.in-addr.arpa' (in '101.246.218.in-addr.arpa'?): 211.99.129.210#53
Dec 3 02:06:51 server named[31018]: lame server resolving '96.101.246.218.in-addr.arpa' (in '101.246.218.in-addr.arpa'?): 211.99.129.211#53
Dec 3 02:06:52 server named[31018]: lame server resolving '96.101.246.218.in-addr.arpa' (in '101.246.218.in-addr.arpa'?): 211.99.129.210#53
Dec 3 02:06:52 server named[31018]: lame server resolving '96.101.246.218.in-addr.arpa' (in '101.246.218.in-addr.arpa'?): 211.99.129.211#53
Dec 3 02:10:18 server named[31018]: unexpected RCODE (REFUSED) resolving 'ns.wuhan.net.cn/A/IN': 202.103.0.117#53
Dec 3 02:10:18 server named[31018]: unexpected RCODE (REFUSED) resolving 'ns.wuhan.net.cn/AAAA/IN': 202.103.0.117#53
Dec 3 02:10:19 server named[31018]: lame server resolving 'ns.gdgzptt.net.cn' (in 'gdgzptt.net.cn'?): 202.103.224.69#53
Dec 3 09:23:02 server syslogd 1.4.1: restart.

Yesterdays lockup:

cron:
Dec 4 02:12:40 server crond[30159]: (root) error: Job execution of per-minute job scheduled for 02:10 delayed into subsequent minute 02:12. Skipping job run.
Dec 4 02:12:40 server crond[30159]: CRON (root) ERROR: cannot set security context
Dec 4 02:14:09 server crond[30160]: (root) error: Job execution of per-minute job scheduled for 02:10 delayed into subsequent minute 02:14. Skipping job run.
Dec 4 02:14:09 server crond[30160]: CRON (root) ERROR: cannot set security context
Dec 4 02:14:50 server crond[30161]: (root) error: Job execution of per-minute job scheduled for 02:10 delayed into subsequent minute 02:14. Skipping job run.
Dec 4 02:14:50 server crond[30161]: CRON (root) ERROR: cannot set security context
Dec 4 02:17:49 server crond[30217]: (root) error: Job execution of per-minute job scheduled for 02:15 delayed into subsequent minute 02:17. Skipping job run.
Dec 4 02:17:49 server crond[30217]: CRON (root) ERROR: cannot set security context
Dec 4 07:47:49 server crond[3624]: (CRON) STARTUP (V5.0)
Dec 4 07:45:16 server crond[4832]: (root) CMD (/usr/local/cpanel/whostmgr/bin/dnsqueue > /dev/null 2>&1)

maillog:

Dec 4 02:22:20 server pop3d: Connection, ip=[::ffff:66.119.98.183]
Dec 4 02:22:20 server pop3d: Disconnected, ip=[::ffff:66.119.98.183]
Dec 4 02:25:36 server pop3d: Connection, ip=[::ffff:66.119.98.183]
Dec 4 02:25:56 server pop3d: Disconnected, ip=[::ffff:66.119.98.183]
Dec 4 07:47:17 server authdaemond: modules="authpipe", daemons=5
Dec 4 07:47:17 server authdaemond: Installing libauthpipe
Dec 4 07:47:17 server authdaemond: Installation complete: authpipe
Dec 4 07:47:40 server MailScanner[3552]: MailScanner E-Mail Virus Scanner version 4.73.4 starting...

messages:

Dec 4 01:52:24 server named[2026]: lame server resolving '44.194.80.208.in-addr.arpa' (in '194.80.208.in-addr.arpa'?): 204.15.67.53#53
Dec 4 01:52:24 server named[2026]: lame server resolving '44.194.80.208.in-addr.arpa' (in '194.80.208.in-addr.arpa'?): 204.15.69.53#53
Dec 4 01:52:24 server named[2026]: lame server resolving '44.194.80.208.in-addr.arpa' (in '194.80.208.in-addr.arpa'?): 204.15.66.53#53
Dec 4 01:59:44 server pure-ftpd: ([email protected]) [INFO] New connection from 127.0.0.1
Dec 4 01:59:45 server pure-ftpd: ([email protected]) [INFO] Logout.
Dec 4 07:47:15 server syslogd 1.4.1: restart.

and from today:

Cron:

Dec 5 08:13:00 server crond[12045]: (root) error: Job execution of per-minute job scheduled for 08:10 delayed into subsequent minute 08:13. Skipping job run.
Dec 5 08:13:00 server crond[12045]: CRON (root) ERROR: cannot set security context
Dec 5 08:54:57 server crond[3622]: (CRON) STARTUP (V5.0)
Dec 5 08:55:01 server crond[3903]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)

maillog:

Dec 5 08:11:12 server MailScanner[1748]: File checker /usr/bin/file timed out!
Dec 5 08:11:28 server pop3d: LOGIN, user=************, ip=[::ffff:74.170.251.172], port=[50990]
Dec 5 08:11:29 server MailScanner[1748]: Virus and Content Scanning: Starting
Dec 5 08:11:31 server pop3d: LOGIN FAILED, user=************, ip=[::ffff:74.125.46.160]
Dec 5 08:11:31 server pop3d: authentication error: Input/output error
Dec 5 08:12:20 server pop3d: LOGOUT, user=************, ip=[::ffff:74.170.251.172], port=[50990], top=0, retr=861758, rcvd=378, sent=877241, time=53
Dec 5 08:12:25 server pop3d: Connection, ip=[::ffff:66.119.98.183]
Dec 5 08:12:25 server MailScanner[2192]: Commercial virus checker failed with real error: Can't fork at /usr/mailscanner/lib/MailScanner/SweepViruses.pm line 999.
Dec 5 08:12:29 server MailScanner[1339]: Commercial virus checker failed with real error: Can't fork at /usr/mailscanner/lib/MailScanner/SweepViruses.pm line 999.
Dec 5 08:12:55 server pop3d: LOGIN FAILED, user=************, ip=[::ffff:66.119.98.183]
Dec 5 08:12:55 server pop3d: authentication error: Input/output error
Dec 5 08:13:38 server pop3d: Connection, ip=[::ffff:66.119.98.183]
Dec 5 08:13:38 server pop3d: Connection, ip=[::ffff:66.119.98.183]
Dec 5 08:14:31 server pop3d: LOGIN FAILED, user=************, ip=[::ffff:66.119.98.183]
Dec 5 08:14:33 server pop3d: authentication error: Input/output error
Dec 5 08:54:27 server authdaemond: modules="authpipe", daemons=5
Dec 5 08:54:27 server authdaemond: Installing libauthpipe

messages:

Dec 5 08:06:10 server pure-ftpd: ([email protected]) [INFO] New connection from 127.0.0.1
Dec 5 08:06:11 server pure-ftpd: ([email protected]) [INFO] Logout.
Dec 5 08:08:30 server named[2110]: lame server resolving 'adsl196-163-166-217-196.adsl196-14.iam.net.ma' (in 'IAM.NET.ma'?): 212.217.1.1#53
Dec 5 08:08:39 server named[2110]: lame server resolving 'adsl196-163-166-217-196.adsl196-14.iam.net.ma' (in 'IAM.NET.ma'?): 212.217.0.1#53
Dec 5 08:09:23 server named[2110]: lame server resolving 'adsl196-163-166-217-196.adsl196-14.iam.net.ma' (in 'IAM.NET.ma'?): 212.217.0.12#53
Dec 5 08:54:24 server syslogd 1.4.1: restart.
I don't know if these are symptoms or causes.

What would be the next logical step in finding the problem and do you see anything in what I posted that raises concern?

thanks,
json
 

mtindor

Well-Known Member
Sep 14, 2004
1,454
110
193
inside a catfish
cPanel Access Level
Root Administrator
pointing more and more toward something that is running as a cron job in the early morning. Check your cron log in more detail for anything happening after midnight and the time the machine becomes responsive again.

Mike
 

schwim

Well-Known Member
Aug 2, 2006
213
0
166
are there files other than just cron that I should be looking in for this? Are there separate, more detailed logs for cron per user or per daily/weekly, etc?

I ask because I'm not seeing anything that shows a problem. For instance, this is an hour in advance on my initial lockup:

Dec 3 01:15:01 server crond[720]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 01:15:01 server crond[722]: (root) CMD (/usr/local/cpanel/whostmgr/bin/dnsqueue > /dev/null 2>&1)
Dec 3 01:15:41 server crontab[775]: (root) LIST (root)
Dec 3 01:16:28 server crontab[851]: (root) LIST (root)
Dec 3 01:16:42 server crontab[905]: (root) LIST (root)
Dec 3 01:16:56 server crontab[956]: (root) LIST (root)
Dec 3 01:17:22 server crontab[1091]: (root) LIST (root)
Dec 3 01:17:47 server crontab[1151]: (root) LIST (root)
Dec 3 01:18:08 server crontab[1213]: (root) LIST (root)
Dec 3 01:20:01 server crond[1473]: (root) CMD (perl /usr/mscpanel/msbe.pl > /dev/null 2>&1)
Dec 3 01:20:01 server crond[1474]: (root) CMD (/usr/lib/sa/sa1 1 1)
Dec 3 01:20:01 server crond[1477]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 01:25:01 server crond[1958]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 01:30:02 server crond[2268]: (root) CMD (perl /usr/mscpanel/msbe.pl > /dev/null 2>&1)
Dec 3 01:30:02 server crond[2269]: (root) CMD (/usr/lib/sa/sa1 1 1)
Dec 3 01:30:02 server crond[2270]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 01:30:02 server crond[2273]: (root) CMD (/usr/local/cpanel/whostmgr/bin/dnsqueue > /dev/null 2>&1)
Dec 3 01:34:02 server crontab[2686]: (root) LIST (root)
Dec 3 01:35:01 server crond[2763]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 01:40:01 server crond[3274]: (root) CMD (perl /usr/mscpanel/msbe.pl > /dev/null 2>&1)
Dec 3 01:40:01 server crond[3277]: (root) CMD (/usr/lib/sa/sa1 1 1)
Dec 3 01:40:01 server crond[3279]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 01:45:01 server crond[3614]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 01:45:01 server crond[3615]: (root) CMD (/usr/local/cpanel/whostmgr/bin/dnsqueue > /dev/null 2>&1)
Dec 3 01:50:01 server crond[4041]: (root) CMD (perl /usr/mscpanel/msbe.pl > /dev/null 2>&1)
Dec 3 01:50:01 server crond[4044]: (root) CMD (/usr/lib/sa/sa1 1 1)
Dec 3 01:50:01 server crond[4046]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 01:55:21 server crond[4600]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 01:58:19 server crond[4852]: (root) CMD (/usr/local/bandmin/bandmin)
Dec 3 02:00:08 server crond[5167]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 02:00:08 server crond[5168]: (root) CMD (perl /usr/mscpanel/msbe.pl > /dev/null 2>&1)
Dec 3 02:00:08 server crond[5169]: (root) CMD (/usr/local/cpanel/whostmgr/bin/dnsqueue > /dev/null 2>&1)
Dec 3 02:00:08 server crond[5173]: (root) CMD (/usr/lib/sa/sa1 1 1)
Dec 3 02:01:08 server crond[5266]: (root) CMD (run-parts /etc/cron.hourly)
Dec 3 02:02:27 server crond[5308]: (root) CMD (/usr/local/bandmin/bandmin)
Dec 3 02:05:11 server crond[5579]: (root) CMD (/usr/local/cpanel/bin/dcpumon >/dev/null 2>&1)
Dec 3 09:23:36 server crond[3640]: (CRON) STARTUP (V5.0)
thanks,
json
 

Sys Admin

Well-Known Member
Apr 29, 2007
67
0
156
cPanel Access Level
Root Administrator
[email protected] [~]# ls -l /etc/cron.daily
total 48
drwxr-xr-x 2 root root 4096 Aug 19 11:16 ./
drwx--x--x 101 root root 8192 Dec 4 17:15 ../
-rwxr-xr-x 1 root root 379 Mar 28 2007 0anacron*
lrwxrwxrwx 1 root root 39 Aug 19 11:02 0logwatch -> /usr/share/logwatch/scripts/logwatch.pl*
-rwxr-xr-x 1 root root 1001 Aug 1 2007 clean.incoming.cron*
lrwxrwxrwx 1 root root 47 Aug 19 11:16 clean.quarantine.cron -> /usr/mailscanner/bin/cron/clean.quarantine.cron*
-rwxr-xr-x 1 root root 118 Jun 21 21:53 cups*
-rwxr-xr-x 1 root root 219 Aug 16 22:50 logrotate*
-rwxr-xr-x 1 root root 418 Jan 6 2007 makewhatis.cron*
-rwxr-xr-x 1 root root 137 Mar 14 2007 mlocate.cron*
-rwxr-xr-x 1 root root 2181 Jun 21 2006 prelink*
-rwxr-xr-x 1 root root 114 May 24 2008 rpm*
-rwxr-xr-x 1 root root 290 Mar 14 2007 tmpwatch*
Try to run each script manually to see which one is causing this issue.
 

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Hi there sysadmin, and thanks very much for the reply.

Before I do this, can I ask the best way to monitor the issue? Am I just to wait to see if the server becomes unreachable or should I just monitor top for that process? Also, how do I run processes in the form of "0logwatch -> /usr/share/logwatch/scripts/logwatch.pl*

[EDIT]
While browsing the error notices, I found this for Dec 4:

--------------------- Cron Begin ------------------------


Wrong file owner (): 4 Time(s)

**Unmatched Entries**
Dec 4 02:12:40 server crond[30159]: CRON (root) ERROR: cannot set security context
Dec 4 02:14:09 server crond[30160]: CRON (root) ERROR: cannot set security context
Dec 4 02:14:50 server crond[30161]: CRON (root) ERROR: cannot set security context
Dec 4 02:17:49 server crond[30217]: CRON (root) ERROR: cannot set security context

---------------------- Cron End -------------------------
Matching:

Dec 4 02:12:40 server crond[30159]: (root) error: Job execution of per-minute job scheduled for 02:10 delayed into subsequent minute 02:12. Skipping job run.
Dec 4 02:12:40 server crond[30159]: CRON (root) ERROR: cannot set security context
Dec 4 02:14:09 server crond[30160]: (root) error: Job execution of per-minute job scheduled for 02:10 delayed into subsequent minute 02:14. Skipping job run.
Dec 4 02:14:09 server crond[30160]: CRON (root) ERROR: cannot set security context
Dec 4 02:14:50 server crond[30161]: (root) error: Job execution of per-minute job scheduled for 02:10 delayed into subsequent minute 02:14. Skipping job run.
Dec 4 02:14:50 server crond[30161]: CRON (root) ERROR: cannot set security context
Dec 4 02:17:49 server crond[30217]: (root) error: Job execution of per-minute job scheduled for 02:15 delayed into subsequent minute 02:17. Skipping job run.
Dec 4 02:17:49 server crond[30217]: CRON (root) ERROR: cannot set security context
But there's no way to track down which processes it's talking about is there?

thanks,
json
 
Last edited:

schwim

Well-Known Member
Aug 2, 2006
213
0
166
Well, at 1:30 am I opened up three ssh connections, one running top, one tailing cron and one tailing messages.

I've watched it for an hour now and I see nothing out of the ordinary. The load spikes when running the site backups(1.00 - 2.00) but that has happened on every server I have.

I'll hope it's running tomorrow morning, but if the tradition holds, then I hope at least that I'm not dealing with any data corruption ;)

thanks,
json
 

stdout

Well-Known Member
Apr 10, 2003
189
7
168
Nelspruit, Mpumalanga, South Africa
cPanel Access Level
Root Administrator
Just a piece of information.
What I've always done to track down spontaneous reboots is to:
"grep BIOS-provided /var/log/messages" - This will give you the time the server *bootup occurred.

Now that you have the exact time/date:
"less /var/log/messages" and inside "less" type "/" and then paste the time and date of the occurrence.

This will take you to the area of the syslog where the BIOS messages showed up - scroll a bit higher to see if anything was written to syslog prior to the system booting up.

Example:
Code:
[email protected] [~]# grep BIOS-provided /var/log/messages*
/var/log/messages.2:Nov 21 21:35:11 localhost kernel: BIOS-provided physical RAM map:
/var/log/messages.2:Nov 22 17:11:35 localhost kernel: BIOS-provided physical RAM map:
I would "less /var/log/messages.2", hit "/" and paste "Nov 22 17:11:35".
Now scroll up about 10 lines - see if you spot any potential bugs or errors prior to the reboot/failure.

This may sound like a backward way of troubleshooting this issue.
However its accurate and works alot faster than the other suggestions I've been seeing.

Of course use the same technique to get the failure time and to search in other logs such as cron, mysql, apache and dmesg.