All site down but server seems healthy

PeteS

Well-Known Member
Jun 8, 2017
380
85
78
Oregon
cPanel Access Level
Root Administrator
Tonight a server suddenly stopped responding to website requests (time out). WHM was fine, and reported no services down, nor problems with RAM or CPU usage. SSH and ping was fine. Email was fine. Prior to that it seemed fine and load was within normal range.

upcp had just run about 10 minutes prior (no errors reported). I'm not saying it's related, just adding it to the context.

My first thought was Apache, but I started with a restart of Nginx, then PHP-FPM, then Apache. Still all sites were unresponsive. Ping times were normal, top looked like the server was nearly at rest (not a surprise).

My next thought was a NOC issue, but I saw nothing wrong there, so I decided to do a reboot just in case before opening a ticket. To my surprise, that fixed it!

Afterward, looking at sar all loads and usage were med/high during the 10 minute block of time upcp was running, and very low the next 10 minute block during which the outage began. After reboot, normal loads and usages in sar and top.

I took a quick look at the Apache error log and I see about 8 minutes before the outage began that all ModSecurity errors stopped, then about a minute before almost all events stopped. (Prior to that I see nothing odd, compared to after reboot.) I see the reboot, and after that everything looks normal.

I was surprised to see no services in error, SSH and ping normal, WHM normal, and that no service restarts helped. But the reboot fixed whatever had failed. I'm wondering if anyone has seen similar and/or has ideas of what else to check.
 

cPRex

Jurassic Moderator
Staff member
Oct 19, 2014
14,260
2,220
363
cPanel Access Level
Root Administrator
Hey hey! That's definitely odd. I never have a good explanation when a reboot seems to magically fix things. Is there likely a real reason this happened? Sure. Is there going to be a good way to find out after a reboot? Probably not from the symptoms you've described.

Perhaps an "apachectl status" or other similar test of the Apache system could have shown if Apache was actually handling data, or a simple telnet to port 80 on the system to see if that traffic was being handled.
 
  • Like
Reactions: PeteS

PeteS

Well-Known Member
Jun 8, 2017
380
85
78
Oregon
cPanel Access Level
Root Administrator
Doht! apachectl status would have been a good idea (same for telnet :80)! It was looking like a NOC network issue and I wanted to get things working again (so I could sleep!) so I rebooted as a last resort before yelling at the NOC. ;) Thanks for the response that lets me know I'm not way off base in my thinking about it.

I figure it's worth noting here, but hopefully it's a one-off thing...
 
  • Like
Reactions: cPRex