Sudden Serious Load

S

Secret Agent

Guest
Specs are:

Dual Xeon 2.8GHz
4GB Memory
100MBit Port
300GB SCSI drives (21% full)
5GB /tmp partition (15% full)

PHP 4.4.2 w/ zend optimizer and eaccelerator
MySQL 4.1
Apache 1.3x

About 128 IP's on this server (resellers)

Problem:
Server load floats around 10 - 12 cpu suddenly past few days. Was only 2-3 cpu average usually. TOP shows nothing abnormal (see screenshots) and /tmp shows nothing unusual

Ran rkhunter, chkrootkit, graceful reboot - nothing suspicious

cpu/memory/mysql stats is typical, usual results

Installed:
APF
BFD
LSM
LES
SIM
MOD DOSINFLATE
MOD SECURITY
EACCELERATOR
MOD THROTTLE
ZEND OPTIMIZER
SECURED TMP AND VAR PARTITIONS

RUBY RAILS
FAST CGI
EXIM ACL DICTIONARY ATTACK


/etc/my.cnf

#DO NOT MODIFY THE FOLLOWING COMMENTED LINES!
#Created with ELS from www.nsonetworks.com
#els-build=4.1
[mysqld]
datadir=/var/lib/mysql
skip-locking
#skip-networking
safe-show-database
query_cache_limit=1M
query_cache_size=128M ## 32MB for every 1GB of RAM
query_cache_type=1
max_user_connections=200
max_connections=500
interactive_timeout=10
wait_timeout=20
connect_timeout=20
thread_cache_size=128
key_buffer=256M ## 64MB for every 1GB of RAM
join_buffer=1M
max_connect_errors=20
max_allowed_packet=16M
table_cache=1024
record_buffer=1M
sort_buffer_size=4M ## 1MB for every 1GB of RAM
read_buffer_size=4M ## 1MB for every 1GB of RAM
read_rnd_buffer_size=4M ## 1MB for every 1GB of RAM
thread_concurrency=8 ## Number of CPUs x 2
myisam_sort_buffer_size=64M
server-id=1
log_slow_queries=/var/log/mysql-slow-queries.log
long_query_time=2
collation-server=latin1_general_ci
old-passwords

[mysql.server]
user=mysql
basedir=/var/lib

[safe_mysqld]
err-log=/var/log/mysqld.log
pid-file=/var/lib/mysql/mysql.pid
open_files_limit=8192

[mysqldump]
quick
max_allowed_packet=16M

[mysql]
no-auto-rehash
#safe-updates

[isamchk]
key_buffer=256M
sort_buffer=64M
read_buffer=16M
write_buffer=16M

[myisamchk]
key_buffer=256M
sort_buffer=64M
read_buffer=16M
write_buffer=16M

[mysqlhotcopy]
interactive-timeout

httpd.conf file

Timeout 300
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 15
MinSpareServers 5
MaxSpareServers 10
StartServers 5
MaxClients 300
MaxRequestsPerChild 0

php.ini file (resource limits)

max_execution_time = 30
memory_limit = 8M
post_max_size = 55M
 

Attachments

S

Secret Agent

Guest
I'd like to add that after I did the reboot, I shut down httpd and the server load was minimal, normal. Once I restarted it, it shot back up.

I had the config on 150 max clients entire time, I now changed it to 300 this morning, nothing changed.
 
S

Secret Agent

Guest
If I answered my own question and knew where to find the process I would not have posted this, correct?
 

dalem

Well-Known Member
PartnerNOC
Oct 24, 2003
2,980
156
368
SLC
cPanel Access Level
DataCenter Provider
we ccan't find it from the forms

strace -p <pid>
lsof -p <pid>
ps auxf


i would start with th obvious the apache process thats been running for over 30 hours as seen in your top output

and then all of your zombie processes
 
Last edited:
S

Secret Agent

Guest
lsof -p 5009

gave me a list of domlogs on all domains

example:

Code:
httpd   5009 nobody  703w   REG     8,3         0  3119609 /usr/local/apache/domlogs/clientdomains.com-bytes_log
httpd   5009 nobody  704w   REG     8,3         0  3118763 /usr/local/apache/domlogs/clientdomains.com-bytes_log
httpd   5009 nobody  705w   REG     8,3      1965  3119384 /usr/local/apache/domlogs/clientdomains.clientdomains.net-bytes_log
httpd   5009 nobody  706w   REG     8,3       179  3119557 /usr/local/apache/domlogs/clientdomains.com-bytes_log
httpd   5009 nobody  707w   REG     8,3         0  3119610 /usr/local/apache/domlogs/beta.clientdomains.co.uk-bytes_log
httpd   5009 nobody  708w   REG     8,3       206  3117986 /usr/local/apache/domlogs/clientdomains.com-bytes_log
httpd   5009 nobody  709w   REG     8,3    470926  3119743 /usr/local/apache/domlogs/clientdomains.net-bytes_log
httpd   5009 nobody  710w   REG     8,3       287  3119621 /usr/local/apache/domlogs/clientdomains.com-bytes_log
I attached the ps auxf results

Strace was nearly endless and mentioned numerous domains, not single one. If you need me to attach the entire file I will
 

Attachments

dalem

Well-Known Member
PartnerNOC
Oct 24, 2003
2,980
156
368
SLC
cPanel Access Level
DataCenter Provider
I am not inside your box so it would not do me any good

use top, ps aux, to find the resource hogging pid

use stace to see what script/file they are hitting

start with the domain thats using the most resoures and work backwords shuting each site down and killing off the pid and see if the same resource hog starts up again

is a process of elimination
 

xidica

Well-Known Member
Apr 21, 2005
63
0
156
Texas
First off, you're using an extremely standard and untuned apache configuration per the following settings :

Timeout 300
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 15
MinSpareServers 5
MaxSpareServers 10
StartServers 5
MaxClients 300
MaxRequestsPerChild 0

I'd recommend the following be changed :

Timeout 50-100
KeepAliveTimeout 5-10
MaxRequestsPerChild 1000

The reduced timeout will help with how long child processes stay alive, and setting a finite limit on the max requests per child will help to prevent intentional and unintentional memory leaks. This is just a suggestion of course and may not be perfectly suited to your deployment.
 
S

Secret Agent

Guest
I was told that the nslookups were suspicious

Code:
nobody   13554  0.0  0.0     0    0 ?        Z    14:48   0:00  |   \_ [nslookup] <defunct>
nobody   13559  0.0  0.0     0    0 ?        Z    14:48   0:00  |   \_ [nslookup] <defunct>
nobody   13563  0.0  0.0     0    0 ?        Z    14:48   0:00  |   \_ [nslookup] <defunct>
nobody   13567  0.0  0.0     0    0 ?        Z    14:48   0:00  |   \_ [nslookup] <defunct>
nobody   13571  0.0  0.0     0    0 ?        Z    14:48   0:00  |   \_ [nslookup] <defunct>
nobody   13576  0.0  0.0     0    0 ?        Z    14:48   0:00  |   \_ [nslookup] <defunct>
nobody   13580  0.0  0.0     0    0 ?        Z    14:48   0:00  |   \_ [nslookup] <defunct>
nobody   13585  0.0  0.0     0    0 ?        Z    14:48   0:00  |   \_ [nslookup] <defunct>
nobody   13589  0.0  0.0     0    0 ?        Z    14:48   0:00  |   \_ [nslookup] <defunct>
nobody   15835  0.0  0.0     0    0 ?        Z    14:53   0:00  |   \_ [nslookup] <defunct>
nobody   15839  0.0  0.0     0    0 ?        Z    14:53   0:00  |   \_ [nslookup] <defunct>
nobody   15849  0.0  0.0     0    0 ?        Z    14:53   0:00  |   \_ [nslookup] <defunct>
nobody   15853  0.0  0.0     0    0 ?        Z    14:53   0:00  |   \_ [nslookup] <defunct>
nobody   15861  0.0  0.0     0    0 ?        Z    14:53   0:00  |   \_ [nslookup] <defunct>
nobody   25512  0.0  0.0     0    0 ?        Z    15:21   0:00  |   \_ [nslookup] <defunct>
nobody   25518  0.0  0.0     0    0 ?        Z    15:21   0:00  |   \_ [nslookup] <defunct>
nobody   25522  0.0  0.0     0    0 ?        Z    15:21   0:00  |   \_ [nslookup] <defunct>
nobody   25534  0.0  0.0     0    0 ?        Z    15:21   0:00  |   \_ [nslookup] <defunct>
nobody   25542  0.0  0.0     0    0 ?        Z    15:21   0:00  |   \_ [nslookup] <defunct>
nobody   25549  0.0  0.0     0    0 ?        Z    15:21   0:00  |   \_ [nslookup] <defunct>
How can I trace these and stop them?
 

richy

Well-Known Member
Jun 30, 2003
274
1
168
If you enable phpSuExec (and suExec), PHP and Perl scripts will no longer run as "nobody", but as the username - which should make them easier for you to track.

Going into:
/proc/PID
should help show you where nslookup is being run from.
 
S

Secret Agent

Guest
The probem is enabling PHP SuExec is the conflict with customers' scripts. What would be required (what would I tell them to do) to work around this if enabled?

Also, SuExec is enabled

I never saw nslookup in TOP so I can't find a place to trace it
 
S

Secret Agent

Guest
See attached.

More details:

Code:
[email protected] [~]# cd /proc/22560
[email protected] [/proc/22560]# ls
/bin/ls: cannot read symbolic link cwd: No such file or directory
/bin/ls: cannot read symbolic link root: No such file or directory
/bin/ls: cannot read symbolic link exe: No such file or directory
./  ../  attr/  auxv  cmdline  [email protected]  environ  [email protected]  fd/  maps  mem  mounts  [email protected]  stat  statm  status  task/  wchan
 

Attachments

S

Secret Agent

Guest
Ok found this in /tmp

-rw------- 1 nobody nobody 128K Mar 16 21:05 bot.tar

How do I trace where this came from?