Hi there,
Last thursday we had a couple hours downtime during work hours on a huge dedicated server with over 800 accounts. Needless to say, it was bad, but the worst thing is that all services came up again and we still dont know why! Here's what happened:
Right before noon Apache and Exim stop responding correctly, with browsers and e-mail clients receiving a "time-out" response. WHM and SSH where still working (responding) perfectly, and the server load was low.
At that moment i tried restarting apache and exim, and when it didn't work i tried stopping the firewall, because it seems like a network issue... But no change.
Finally i gave up and restarted the whole server... Still no change.
After that i logged into another server in the same hosting company (this one was working with no hiccups) and tried to reach a website on the problematic server from the command line using "wget"... It worked instantly on any page.
From that moment i assumed it was some kind of filter or bug on the hosting company network so i contacted then. Unfortunatly they said there was no problem with their network so it should be something with my server.
After a couple hours the server started responding normally again without any change from me, or from my hosting company (allegedly). I checked all my logs and it all points out that those services were working with no problems, but network traffic to those ports stopped during the downtime. There's no problem with the server.
The question is... Is it possible that something malfunctioned on my hosting company and that caused the downtime? Any idea of what it might be? Or should i keep looking for something on my server?
Thanks!
Last thursday we had a couple hours downtime during work hours on a huge dedicated server with over 800 accounts. Needless to say, it was bad, but the worst thing is that all services came up again and we still dont know why! Here's what happened:
Right before noon Apache and Exim stop responding correctly, with browsers and e-mail clients receiving a "time-out" response. WHM and SSH where still working (responding) perfectly, and the server load was low.
At that moment i tried restarting apache and exim, and when it didn't work i tried stopping the firewall, because it seems like a network issue... But no change.
Finally i gave up and restarted the whole server... Still no change.
After that i logged into another server in the same hosting company (this one was working with no hiccups) and tried to reach a website on the problematic server from the command line using "wget"... It worked instantly on any page.
From that moment i assumed it was some kind of filter or bug on the hosting company network so i contacted then. Unfortunatly they said there was no problem with their network so it should be something with my server.
After a couple hours the server started responding normally again without any change from me, or from my hosting company (allegedly). I checked all my logs and it all points out that those services were working with no problems, but network traffic to those ports stopped during the downtime. There's no problem with the server.
The question is... Is it possible that something malfunctioned on my hosting company and that caused the downtime? Any idea of what it might be? Or should i keep looking for something on my server?
Thanks!