Ghost on Server! crash without reason

wimp

Well-Known Member
Jul 13, 2002
301
0
166
hi all,
it really have a ghost on my server. It freeze a few times a day beause of high server load.
I couldnt' finde any reason for this problem. also i let working 2 Admin. companies to upgrade
the kernel (for the iowait problem) and other server tweaks/securities and installing APF

firewall but nothing.
I installed SIM and PRM disable the spam and antivirus filter. Disable the SMTP and make tha

way
the nobody cannot send e-mails. I ask the NOC to check thr server for HW problems. they also
installed a new NIC. But nothing. the server load is normally from 0.50 - 2.50 and suddenly

goes up to 30.00- 80.00
So, is there anyone who could give me a tip what to check to solve this problem?
When the server goes high i have the following in "messages"


==========================

Jan 3 04:21:02 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=69.105.56.3$
Jan 3 04:21:02 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=69.105.56.3$
Jan 3 04:21:03 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=63.151.206.$
Jan 3 04:21:09 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=63.151.206.$
Jan 3 04:22:09 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=69.152.39.1$
Jan 3 04:22:12 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=69.152.39.1$
Jan 3 04:22:18 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=69.152.39.1$
Jan 3 04:22:26 server proftpd[28395]: server.an-dns.com (127.0.0.1[127.0.0.1]) - FTP login

timed out, disconnected
Jan 3 04:22:26 server proftpd[28395]: server.an-dns.com (127.0.0.1[127.0.0.1]) - FTP session

closed.
Jan 3 04:23:11 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=202.9.99.25$
Jan 3 04:24:39 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=205.207.184$
Jan 3 04:24:49 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=206.168.193$
Jan 3 04:26:53 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=217.18.64.7$
Jan 3 04:26:58 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=209.237.25.$
Jan 3 04:27:45 server wall[2964]: wall: user root broadcasted 1 lines (58 chars)
Jan 3 04:27:51 server shutdown: shutting down for system reboot
Jan 3 04:27:57 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=209.15.115.$
Jan 3 04:28:02 server init: Switching to runlevel: 6
Jan 3 04:28:11 server kernel: ** IN_TCP DROP ** IN=eth0 OUT=

MAC=00:01:02:9b:3f:86:00:02:85:0d:7c:80:08:00 SRC=218.7.153.2$
Jan 3 04:28:16 server portsentry[3299]: securityalert: Psionic PortSentry is shutting down
Jan 3 04:28:16 server portsentry[3299]: adminalert: Psionic PortSentry is shutting down
Jan 3 04:28:20 server portsentry: portsentry shutdown succeeded
Jan 3 04:28:21 server xfs[2099]: terminating
Jan 3 04:28:24 server xfs: xfs shutdown succeeded
Jan 3 04:28:26 server mysql: Killing mysqld with pid 3016
Jan 3 04:28:27 server mysql: Wait for mysqld to exit
Jan 3 04:28:28 server mysql: .
Jan 3 04:28:59 server last message repeated 31 times
Jan 3 04:29:00 server wall[7103]: wall: user root broadcasted 1 lines (58 chars)
Jan 3 04:29:00 server mysql: .
Jan 3 04:29:01 server mysql: gave up waiting!
Jan 3 04:29:01 server rc: Stopping mysql: succeeded
Jan 3 04:29:01 server antirelayd: antirelayd shutdown failed
Jan 3 04:29:02 server exim: exim shutdown succeeded
Jan 3 04:29:02 server exim: antirelayd shutdown failed
Jan 3 04:30:10 server syslogd 1.4.1: restart.
Jan 3 04:30:10 server syslog: syslogd startup succeeded
Jan 3 04:30:10 server syslog: klogd startup succeeded
Jan 3 04:30:10 server kernel: klogd 1.4.1, log source = /proc/kmsg started.
Jan 3 04:30:10 server kernel: Linux version 2.4.21-20.0.1.EL ([email protected]) (gcc version

3.2.3 20030502 (Red Hat Li$
Jan 3 04:30:10 server kernel: BIOS-provided physical RAM map:
=============================


sometimes i see this befor it crash's
============================
Jan 3 22:31:56 server named[1899]: zone domain.com/IN: loaded serial 2004112201
Jan 3 22:31:56 server named[1899]: zone domain.com/IN: loaded serial 2004112601
Jan 3 22:31:56 server named[1899]: zone domain.com/IN: loaded serial 2004112001
Jan 3 22:31:56 server named[1899]: zone domain.com/IN: loaded serial 2004122301
Jan 3 22:31:56 server named[1899]: zone domain.com/IN: loaded serial 2004112301
Jan 3 22:31:56 server named[1899]: zone domain.com/IN: loaded serial 2004112701
Jan 3 22:31:56 server named[1899]: zone domain.com/IN: loaded serial 2004112301
Jan 3 22:31:56 server named[1899]: zone domain.com/IN: loaded serial 2004112003
Jan 3 22:31:56 server named[1899]: zone domain.com/IN: loaded serial 2004112502
and so on...
===========================
and sometime i have this befor it crash's
===========================
Jan 3 21:27:37 server proftpd[10367]: server.an-dns.com (127.0.0.1[127.0.0.1]) - FTP session
opened.
Jan 3 21:27:37 server proftpd[10367]: server.an-dns.com (127.0.0.1[127.0.0.1]) - FTP session
closed.
Jan 3 21:35:58 server proftpd[11236]: server.an-dns.com (127.0.0.1[127.0.0.1]) - FTP session
opened.
Jan 3 21:35:58 server proftpd[11236]: server.an-dns.com (127.0.0.1[127.0.0.1]) - FTP session
closed.
Jan 3 21:49:06 server proftpd[12260]: server.an-dns.com (127.0.0.1[127.0.0.1]) - FTP login
timed out, disconnected
Jan 3 21:49:06 server proftpd[12260]: server.an-dns.com (127.0.0.1[127.0.0.1]) - FTP session
closed.
Jan 3 22:31:03 server syslogd 1.4.1: restart.
==========================



and i can see somethign like this in "TOP"

=============================
16:23:33 up 51 min, 2 users, load average: 85.73, 38.67, 16.35
375 processes: 361 sleeping, 5 running, 9 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 9.6% 2.0% 2.7% 0.3% 1.2% 83.8% 0.0%
Mem: 962848k av, 953632k used, 9216k free, 0k shrd, 112048k buff
720240k actv, 135076k in_d, 10068k in_c
Swap: 1959920k av, 156764k used, 1803156k free 290416k cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
2004 root 19 4 16796 124 64 S N 2.5 0.0 1:19 0 httpd
1969 mailnull 15 0 544 272 188 S 1.9 0.0 0:07 0 exim
21544 someuser 19 4 2788 2788 1088 D N 0.4 0.2 0:00 0 search.pl
2548 root 24 8 3560 1372 472 S N 0.3 0.1 0:04 0 cppop
21532 root 15 0 1276 1276 656 R 0.3 0.1 0:00 0 top
5 root 15 0 0 0 0 SW 0.2 0.0 0:01 0 kscand
21002 root 15 0 2528 1940 636 D 0.2 0.2 0:01 0 mkvhostspasswd
21619 root 15 0 3440 3440 2432 S 0.2 0.3 0:00 0 exim
4 root 15 0 0 0 0 SW 0.1 0.0 0:01 0 kswapd
7 root 15 0 0 0 0 DW 0.1 0.0 0:00 0 kupdated
12 root 15 0 0 0 0 DW 0.1 0.0 0:09 0 kjournald
20707 otherus 23 8 4288 2764 968 S N 0.1 0.2 0:00 0 cppop
21631 root 19 0 3168 3168 2376 D 0.1 0.3 0:00 0 exim
1 root 15 0 128 80 56 S 0.0 0.0 0:04 0 init
2 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 keventd
3 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
6 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush
8 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd
104 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 khubd
576 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 kjournald
577 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 kjournald
579 root 15 0 0 0 0 DW 0.0 0.0 0:00 0 loop0
580 root 15 0 0 0 0 DW 0.0 0.0 0:00 0 kjournald
1835 root 15 0 244 216 156 D 0.0 0.0 0:01 0 syslogd
1839 root 15 0 72 4 0 S 0.0 0.0 0:00 0 klogd
1884 named 25 0 10004 8752 724 S 0.0 0.9 0:25 0 named
1916 root 15 0 1388 408 88 D 0.0 0.0 0:00 0 chkservd
1975 mailnull 25 0 344 4 0 S 0.0 0.0 0:00 0 exim
1982 root 15 0 780 728 432 S 0.0 0.0 0:13 0 antirelayd
2018 root 25 0 152 4 0 S 0.0 0.0 0:00 0 mysqld_safe
2052 mysql 21 6 29572 17M 1164 S N 0.0 1.8 0:07 0 mysqld
3101 root 15 0 3560 756 304 S 0.0 0.0 0:00 0 cpsrvd
3119 root 21 6 2736 984 340 D N 0.0 0.1 0:00 0 eximstats
3126 root 34 19 6164 2200 40 S N 0.0 0.2 0:08 0 cpanellogd
3152 root 23 8 3212 324 124 S N 0.0 0.0 0:00 0 cppop

==============================
 

kris1351

Well-Known Member
Apr 18, 2003
961
0
166
Lewisville, Tx
Take a look at your resources during the crash time. I am seeing a very low amount of free memory on that last top you had.

Mem: 962848k av, 953632k used, 9216k free, 0k shrd, 112048k buff

It is very possible that you are eating up your memory and causing a crash there.
 

dezignguy

Well-Known Member
Sep 26, 2004
533
0
166
kris1351: Linux uses as much memory as it possibly can... it keeps programs in RAM instead of using slower swap space... so it's normal for 'free' memory to be low when using linux. (Though not when using windows). It will swap more unused things out to swap space on the drive if it needs more ram for something else. The last top was using a bit of swap space, but it doesn't seem anything to be concerned about...

I haven't carefully looked over much of your stuff, but it seems the concern should be that there are 9 zombies after only 51 minutes of uptime in that last top... that's very high, and is likely related to your troubles (right now, my server has been up for 8 days with 0 zombie processes... and it's gone for months without zombies as well). Find out what those processes are and why they aren't working properly.
 

GOT

Get Proactive!
PartnerNOC
Apr 8, 2003
1,770
324
363
Chesapeake, VA
cPanel Access Level
DataCenter Provider
There really is not enough info here to give you a solid answer.

You MIGHT want to look at the cpu usage stats in WHM as that can give you a rough feel for overall waht is taking up a lot of cpu cycles.

I know that you indicated that you had some people come in and take a look, were they not able to tell you anything useful?
 

bimal

Member
Jan 1, 2002
7
0
301
I had same problem. I deleted and moved huge log files. It is working now 15 days with out any problem. Try this

1. Cleanup /tmp
2. Ceanup your huge log files.

find . -size +100000 -print
 
Last edited:

wimp

Well-Known Member
Jul 13, 2002
301
0
166
i already empty the log files form server but i will go around to see what else is there. thanks
 

philb

Well-Known Member
Jan 28, 2004
118
4
168
Jan 3 04:27:45 server wall[2964]: wall: user root broadcasted 1 lines (58 chars)
Jan 3 04:27:51 server shutdown: shutting down for system reboot
Did you reboot the server or is this something it did 'by itself' ?
 

wimp

Well-Known Member
Jul 13, 2002
301
0
166
hi,
how cna i see the associate processes for those zombies ? I currenlty have 11 ather only 30 mins.

thanks
 

Promethyl

Well-Known Member
Mar 27, 2004
68
0
156
Processor util is pretty damn high. I mean, it's not 300 or anything, but I'd stay as far from 100 as possible.

What about mysql... is the max_connections_per_hour getting you?
( http://www.fedoraforum.org/forum/showthread.php?t=26074 )

hrm...

Using AFP? Turn off the drop.
http://www.crucialparadigm.com/reso...ng/changing-apf-log-for-tdp-udp-tcp-drops.php

General search:
http://www.google.com/search?q=serv...ient=firefox-a&rls=org.mozilla:en-US:official

Good luck to you mate. If push comes to shove... review the iptables/ifconfig/ipf/afp rules... If you have to, get a new box, and migrate all the accounts.

Having customers means you have very little time to play games.
 

wish

Member
Aug 14, 2003
9
0
151
There is a known but poorly documented VM bug in early RHE3 kernels. It caused us some problems for a while and shows up as high I/O wait states and system crashes. The two latest kernel updates fix it. If you haven't solved the problem and you're using an "older" RHE3 kernel, this may be worth looking into.
 

Promethyl

Well-Known Member
Mar 27, 2004
68
0
156
So are you a proponent for rebuilding/updating the system kernel?

For future reference, could you provide a link for how to update the kernel for the users?
 

dezignguy

Well-Known Member
Sep 26, 2004
533
0
166
Hmm, he does seem to have a high I/O wait percentage... so either there are heavily disk intensive scripts/programs running, or he's running on an old (insecure) kernel with the RHE VM bug.

run 'uname -a' and make sure that you're running on the latest kernel.. currently "2.4.21-27.0.1.EL".

Promethyl, keeping the kernel up to date is highly recommended... it fixes bugs, improves performance, and fixes serious security issues. That's using the standard RHN/up2date kernels though. Running on a bleeding edge/custom compiled kernel isn't to be done unless you really know what you're doing though.