Tonight a server suddenly stopped responding to website requests (time out). WHM was fine, and reported no services down, nor problems with RAM or CPU usage. SSH and ping was fine. Email was fine. Prior to that it seemed fine and load was within normal range.
upcp had just run about 10 minutes prior (no errors reported). I'm not saying it's related, just adding it to the context.
My first thought was Apache, but I started with a restart of Nginx, then PHP-FPM, then Apache. Still all sites were unresponsive. Ping times were normal, top looked like the server was nearly at rest (not a surprise).
My next thought was a NOC issue, but I saw nothing wrong there, so I decided to do a reboot just in case before opening a ticket. To my surprise, that fixed it!
Afterward, looking at sar all loads and usage were med/high during the 10 minute block of time upcp was running, and very low the next 10 minute block during which the outage began. After reboot, normal loads and usages in sar and top.
I took a quick look at the Apache error log and I see about 8 minutes before the outage began that all ModSecurity errors stopped, then about a minute before almost all events stopped. (Prior to that I see nothing odd, compared to after reboot.) I see the reboot, and after that everything looks normal.
I was surprised to see no services in error, SSH and ping normal, WHM normal, and that no service restarts helped. But the reboot fixed whatever had failed. I'm wondering if anyone has seen similar and/or has ideas of what else to check.
upcp had just run about 10 minutes prior (no errors reported). I'm not saying it's related, just adding it to the context.
My first thought was Apache, but I started with a restart of Nginx, then PHP-FPM, then Apache. Still all sites were unresponsive. Ping times were normal, top looked like the server was nearly at rest (not a surprise).
My next thought was a NOC issue, but I saw nothing wrong there, so I decided to do a reboot just in case before opening a ticket. To my surprise, that fixed it!
Afterward, looking at sar all loads and usage were med/high during the 10 minute block of time upcp was running, and very low the next 10 minute block during which the outage began. After reboot, normal loads and usages in sar and top.
I took a quick look at the Apache error log and I see about 8 minutes before the outage began that all ModSecurity errors stopped, then about a minute before almost all events stopped. (Prior to that I see nothing odd, compared to after reboot.) I see the reboot, and after that everything looks normal.
I was surprised to see no services in error, SSH and ping normal, WHM normal, and that no service restarts helped. But the reboot fixed whatever had failed. I'm wondering if anyone has seen similar and/or has ideas of what else to check.