Last Tuesday, my server seized up and went down, NOC said due to a disk failure. Chose at that time to upgrade the server - the box is new, moved up to a Raid 5 array with SATA drives, also new. It was restored on Wednesday.
Thursday, some sites were having strange data problems - I have a backup disk, accounts were restored, then when I checked my server "quotacheck -a" was running and the load was around 7. They take the server down to check the disks, say everything is fine, put it back up. 24 hours pass and the same thing happens again. The do another check, put it back up, no more quotacheck for a couple of days.
Meanwhile, I have several sites that get restored, work for a few hours, maybe even a day, and then their file permissions change, or there's junk instead of html, the mail system on one domain suddenly seizes up, php sites start serving blank pages - but only some. Some people find problems, upload their sites, and a few hours later it's screwed up again.
I am now at a week with managed servers that, supposedly, have a 2 hour SLA for problems - and these people's sites cannot stay up. If I fix them, it's utterly pointless, because as fast as I fix the individual sites, that's how fast something comes up behind me and screws it up. I have another server I can move them all to, but I am worried I will just wind up movingthe problems.
My NOC has the following answer:
It looks like random files on the server are being corrupted with binary
garbage. It may be there is a bad cable causing corruption or it may be the
raid controller itself. I don't believe the drives are an issue as they are
all brand new.
I would really like to solve this however I don't want to cause any down time
at the moment, we will need to research what our options are.
In other words "we have no idea why this is happening. They may as well have written back and said there were Gremlins and there was nothing they could do until they got bored. I'm paying a fair penny for the server and it's supposed to be managed, and 1 week of total screw ups, the fourth day of total screw ups and a seeming total inability to even find the problem, much less fix it, is really starting to aggravate me - and, of course, threaten the continued goodwill my customers have for me. I really like my NOC and they've been really good to me up until now - but this is starting to get very frustrating. My clients have been saints so far as I have an awful lot of goodwill built up, but the natives are getting as restless as I am and my goodwill is starting to slide toward tilt.
I could hire someone else to take a look, but it's a leased server and if it is a hardware problem, that would be money wasted as I am held hostage by the peolple that physically have my server. Since the problem is perplexing, they seem to just be putting me off and hoping it goes away. Which it's not.
Has anyone ever seen anything like this before? My NOC claims they haven't, which is probably why the cluelessness, but I can't believe that I am unique in a sea of web hosts.
Thursday, some sites were having strange data problems - I have a backup disk, accounts were restored, then when I checked my server "quotacheck -a" was running and the load was around 7. They take the server down to check the disks, say everything is fine, put it back up. 24 hours pass and the same thing happens again. The do another check, put it back up, no more quotacheck for a couple of days.
Meanwhile, I have several sites that get restored, work for a few hours, maybe even a day, and then their file permissions change, or there's junk instead of html, the mail system on one domain suddenly seizes up, php sites start serving blank pages - but only some. Some people find problems, upload their sites, and a few hours later it's screwed up again.
I am now at a week with managed servers that, supposedly, have a 2 hour SLA for problems - and these people's sites cannot stay up. If I fix them, it's utterly pointless, because as fast as I fix the individual sites, that's how fast something comes up behind me and screws it up. I have another server I can move them all to, but I am worried I will just wind up movingthe problems.
My NOC has the following answer:
It looks like random files on the server are being corrupted with binary
garbage. It may be there is a bad cable causing corruption or it may be the
raid controller itself. I don't believe the drives are an issue as they are
all brand new.
I would really like to solve this however I don't want to cause any down time
at the moment, we will need to research what our options are.
In other words "we have no idea why this is happening. They may as well have written back and said there were Gremlins and there was nothing they could do until they got bored. I'm paying a fair penny for the server and it's supposed to be managed, and 1 week of total screw ups, the fourth day of total screw ups and a seeming total inability to even find the problem, much less fix it, is really starting to aggravate me - and, of course, threaten the continued goodwill my customers have for me. I really like my NOC and they've been really good to me up until now - but this is starting to get very frustrating. My clients have been saints so far as I have an awful lot of goodwill built up, but the natives are getting as restless as I am and my goodwill is starting to slide toward tilt.
I could hire someone else to take a look, but it's a leased server and if it is a hardware problem, that would be money wasted as I am held hostage by the peolple that physically have my server. Since the problem is perplexing, they seem to just be putting me off and hoping it goes away. Which it's not.
Has anyone ever seen anything like this before? My NOC claims they haven't, which is probably why the cluelessness, but I can't believe that I am unique in a sea of web hosts.
Last edited: