Strange problem resulting in hung server.

dc2447

Well-Known Member
Apr 18, 2003
49
0
156
I have a dual xeon server, 4Gb of Ram - the server runs fine pretty much all the time but this morning the server became unresponsive and I could access it over httpd or ssh.

The server was rebooted and I manged to get an ssh session just after the reboot for a few moments until I couldn't coninue as the server had apparently run out of ram. Another reboot and I got on again via ssh but when I tried to su ssh couldn't fork as it was out of ram already (less than 1 minute after coming up). There is something obviously wrong as ther server rarely uses more than about 1.2Gb of ram at anytime.

I would normally put this behaviour down to high traffic but the server is really quiet today and there is nothing to suggest that we are being dos'd or there are any large traffic spikes. MRTG on the swich shows now large burts of traffic.

I second suspicion is that there is a hardware fault of the server - however we had the exact same experience with the server hanging a few months ago which went away as quickly as it came and we have been fine since then.

There has been no new scripts uploaded to the server recently either.

Has anyone seen anything like this before on a server running cpanel? The forums don't suggest much - mostly people running unoptimised mysql (mine is optimised).

Any thoughts?
 

chirpy

Well-Known Member
Verifed Vendor
Jun 15, 2002
13,437
33
473
Go on, have a guess
A few things that are simple to check:

1. What OS and kernel are you running? If it's RHE or CentOS with an old kernel, make sure you upgrade it to their latest release

2. Do you have the laus rpm installed? If so, uninstall it (search the forum on how to best do that)

As you say, such problems can happen with hardware problems, especially bad memory sticks.
 

dc2447

Well-Known Member
Apr 18, 2003
49
0
156
1. What OS and kernel are you running? If it's RHE or CentOS with an old kernel, make sure you upgrade it to their latest release
Rather pathetically I am on redhat 7.2 - labeit running the lastest 2.4 kernel

2. Do you have the laus rpm installed? If so, uninstall it (search the forum on how to best do that)
Nope

As you say, such problems can happen with hardware problems, especially bad memory sticks.
The memory was actually swapped out before but it made no difference.

I really need to get off this server as soon as -possible - it's just the thought of migrating everything puts me off.

The server has now been up for about 15 minutes, is servers 3 times the traffic it was when it last crashed and is using less than a gig of ram.

Wierd.
 

AndyReed

Well-Known Member
PartnerNOC
May 29, 2004
2,217
4
193
Minneapolis, MN
Did you check the error messages log /var/log/messages
Any error leads to a software and/or hardware fault?
 

dc2447

Well-Known Member
Apr 18, 2003
49
0
156
Did you check the error messages log /var/log/messages
For sure - nothing in sylog at all

Any error leads to a software and/or hardware fault?
Not sure - it seems the OS justs starts believing that all the ram is gone and starts swapping when in fact there is nothing actually using any ram at all.
 

dc2447

Well-Known Member
Apr 18, 2003
49
0
156
Server just stopped responding again for no apparent reason other than I assume suddenly having no free memory



actually got a shell on the box

Code:
5:09pm  up  5:41,  1 user,  load average: 123.55, 237.06, 211.77
331 processes: 300 sleeping, 26 running, 5 zombie, 0 stopped
CPU0 states: 83.4% user, 16.0% system,  0.0% nice,  0.0% idle
CPU1 states: 85.1% user, 14.3% system,  0.0% nice,  0.0% idle
CPU2 states: 83.3% user, 16.1% system,  0.0% nice,  0.0% idle
CPU3 states: 86.0% user, 14.0% system,  0.0% nice,  0.0% idle
Mem:  4068524K av, 3793132K used,  275392K free,       0K shrd,   13504K buff
Swap: 2048276K av,  119036K used, 1929240K free                  212960K cached
 
Last edited:

chirpy

Well-Known Member
Verifed Vendor
Jun 15, 2002
13,437
33
473
Go on, have a guess
That load average seems to suggest a looping process rather than a memory thrashing problem to me. TBH, I'd go with getting off the server and onto a new one as you're running RH7.2 if it's something that you want to do anyway. In the long run, it'll definitely be worth it. Unless you've done a load of OS level customisations, moving server using cPanel full backups is reasonably painless these days.

Have you checked things like /tmp, /var/tmp, /usr/local/apache/proxy and /dev/shm for exploits and run chkrootkit and rkhunter - just incase those are exploits running. The processes at a CPU sorted top might help ( press P and then c in top)
 

dc2447

Well-Known Member
Apr 18, 2003
49
0
156
That load average seems to suggest a looping process rather than a memory thrashing problem to me. TBH, I'd go with getting off the server and onto a new one as you're running RH7.2 if it's something that you want to do anyway. In the long run, it'll definitely be worth it. Unless you've done a load of OS level customisations, moving server using cPanel full backups is reasonably painless these days.
I agree with what you are saying. The problem appears to be with Apache, some processes taking a lot of cpu time whilst most of the others behave normally. I have attempted to mitigate this by doing the old Solaris trick of setting maxRequestsPerChild to a very low level so that even if some children are running amok apache should catch them and kill them. There is an extra overhead but I don't see what else I can do.

With regard to mving servers - I think I have to. I was kind of waiting until rh4 U1 was available so I could get 2.6 kernel etc but I will probably push ahead with moving now.

Have you checked things like /tmp, /var/tmp, /usr/local/apache/proxy and /dev/shm for exploits and run chkrootkit and rkhunter - just incase those are exploits running. The processes at a CPU sorted top might help ( press P and then c in top)
yes - ran checkrootkit and had an audit - nothing seems amisss.
 

chirpy

Well-Known Member
Verifed Vendor
Jun 15, 2002
13,437
33
473
Go on, have a guess
The other thing you can do is to disable KeepAlives if too many children are hanging around claiming resources unecessarily, though this could cause a hike in CPU as new children are started. If you have KeepAlives already disabled, you could enable them but with a low Timeout, e.g. 3 seconds. It can be a balancing act with Apache, but as you know, it's always difficult with too many processes chasing too few resources.
 

dc2447

Well-Known Member
Apr 18, 2003
49
0
156
chirpy said:
The other thing you can do is to disable KeepAlives if too many children are hanging around claiming resources unecessarily, though this could cause a hike in CPU as new children are started. If you have KeepAlives already disabled, you could enable them but with a low Timeout, e.g. 3 seconds. It can be a balancing act with Apache, but as you know, it's always difficult with too many processes chasing too few resources.
I have KeepAlive's disabled already - I don't think the issue is resources in general - just a certain set of criteria are being met that is resulting in the server spiralling out of control every so often.

Am pricing a new server as we speak.
 

dc2447

Well-Known Member
Apr 18, 2003
49
0
156
Have ordered a new server and am transferring accounts as we speak however the problematic server [above] is exhibiting the exact same problems again today. Keep rebooting the server and within 10 seconds the server is unusable - the server is a dual xeon with 4Gb of ram

Nightmare.