Repeated Data Corruption, NOC at a Loss

jenlepp

Well-Known Member
Jul 4, 2005
116
2
168
Liberty Hill, TX
cPanel Access Level
DataCenter Provider
Last Tuesday, my server seized up and went down, NOC said due to a disk failure. Chose at that time to upgrade the server - the box is new, moved up to a Raid 5 array with SATA drives, also new. It was restored on Wednesday.

Thursday, some sites were having strange data problems - I have a backup disk, accounts were restored, then when I checked my server "quotacheck -a" was running and the load was around 7. They take the server down to check the disks, say everything is fine, put it back up. 24 hours pass and the same thing happens again. The do another check, put it back up, no more quotacheck for a couple of days.

Meanwhile, I have several sites that get restored, work for a few hours, maybe even a day, and then their file permissions change, or there's junk instead of html, the mail system on one domain suddenly seizes up, php sites start serving blank pages - but only some. Some people find problems, upload their sites, and a few hours later it's screwed up again.

I am now at a week with managed servers that, supposedly, have a 2 hour SLA for problems - and these people's sites cannot stay up. If I fix them, it's utterly pointless, because as fast as I fix the individual sites, that's how fast something comes up behind me and screws it up. I have another server I can move them all to, but I am worried I will just wind up movingthe problems.

My NOC has the following answer:

It looks like random files on the server are being corrupted with binary
garbage. It may be there is a bad cable causing corruption or it may be the
raid controller itself. I don't believe the drives are an issue as they are
all brand new.

I would really like to solve this however I don't want to cause any down time
at the moment, we will need to research what our options are.


In other words "we have no idea why this is happening. They may as well have written back and said there were Gremlins and there was nothing they could do until they got bored. I'm paying a fair penny for the server and it's supposed to be managed, and 1 week of total screw ups, the fourth day of total screw ups and a seeming total inability to even find the problem, much less fix it, is really starting to aggravate me - and, of course, threaten the continued goodwill my customers have for me. I really like my NOC and they've been really good to me up until now - but this is starting to get very frustrating. My clients have been saints so far as I have an awful lot of goodwill built up, but the natives are getting as restless as I am and my goodwill is starting to slide toward tilt.

I could hire someone else to take a look, but it's a leased server and if it is a hardware problem, that would be money wasted as I am held hostage by the peolple that physically have my server. Since the problem is perplexing, they seem to just be putting me off and hoping it goes away. Which it's not.

Has anyone ever seen anything like this before? My NOC claims they haven't, which is probably why the cluelessness, but I can't believe that I am unique in a sea of web hosts.
 
Last edited:

chirpy

Well-Known Member
Verifed Vendor
Jun 15, 2002
13,437
33
473
Go on, have a guess
It's not something I've come across. You don't mention which OS you're using, though. If it's not an enterprise OS (RHE, CentOS) then using one of those may be a safer option.

Also, check that you're using an up to date kernel from the OS vendor.

Apart from that, I'd say that you've given your NOC their chance and maybe it's time to move elsewhere.
 

eliteds3

Registered
Jul 17, 2006
3
0
151
I would have them swap out the controller AND the ram that is on the controller,

I have seen this some in raid 3 arrays but not really in raid 5 arrays.

Is it a 3ware controler or other true hardware raid card?

If you have an SATA raid 5 configuration, its not the cable, the raid card will fail that cable/drive as soon as it detects errors. My guess is its the ram on the raid controller which may or may not be onboard.
 
Last edited:

jenlepp

Well-Known Member
Jul 4, 2005
116
2
168
Liberty Hill, TX
cPanel Access Level
DataCenter Provider
Finally got someone this morning who likes tackling new problems - we're copying data over to a new box and going back to Raid1 instead of Raid 5, because he thinks it's the Raid 5 array. So, I'm crossing my fingers.

I'm on CentOS - if it is onboard, hopefully this will fix it and I can lug my backup drive to the new box and fix the issues.
 

jsnape

Well-Known Member
Mar 11, 2002
174
0
316
draknet said:
My NOC has the following answer:

.... I don't believe the drives are an issue as they are
all brand new.

That caught my attention. I would always suspect brand new hardware.
 

rpmws

Well-Known Member
Aug 14, 2001
1,787
10
318
back woods of NC, USA
jsnape said:
That caught my attention. I would always suspect brand new hardware.
nothing like a brand new drive LOOK OUT!!! ..it's them good old reliable ones with a few trips across country are the ones I trust. Proven mileage is what you want.
 

jenlepp

Well-Known Member
Jul 4, 2005
116
2
168
Liberty Hill, TX
cPanel Access Level
DataCenter Provider
Just wanted to post an update to end out the thread in case anyone else has this problem.

The short answer: no one knows what caused what happened, but we moved the backup drive to a new box, started the whole restore thing over, and it's mostly fixed.

The long answer is we (they, the NOC) tried to rsync disk to disk because I was adament about trying to accomplish this with no more down time and really wanted to save the updates, and it wound up being a gigantic mistake. The rsync was like the hand of death to the old box, and literally just peeled the whole thing apart data-wise as we tried to copy like it was an onion. By the end, passwords wouldn't work, services were crashing like mad, and the whole thing was swiss cheese. I'd never seen such a mess in all of my life. After nine hours of what was supposed to be a three hour copy, I spoke to a tech at the NOC who said it would take him a few hours to get services up on the old before the new box could get his attention.

It was at that point I pulled the plug on it. It's midnight, we have 6-7 hours before getting a new box up, just stop. Stop messing with the old box, let it die the death it so clearly wants, and just get the new box up as fast as you can. The relief I heard through the phone was palpable.

We lost two weeks worth of data updates on sites, lost a few clients, and I have about 20 sites "missing" from the restore, I had to pay Chirpy, like, three times to do the mailscanner install (they must have thought I was nuts) but for the most part life has returned to somewhat normal except for a pain in the butt PHP problem - fix one thing, another site breaks, fix that site, a third site breaks, etc. But it's better than it was.

As for leaving my NOC... that was a tough one. The first week, I was ready to fly over there and strangle them. I really did feel like it got passed around and no one wanted to say "I don't know", so I was left with "this will fix it, this will fix it" as catastrophe loomed large, and it wasn't a pleasant experience. I got assurances, I passed on assurances, and I looked like a liar. I *really* don't like that.

The second week, these guys really stepped up to the plate. I got hand holding, coddling, immediate attention, several apologies, constant information, and a credit as an apology. I know a few of them stayed far, far longer than they were supposed at work to deal with the problem, and the staff costs that server incurred plus the credit meant they could crown me loss leader this month - and they still, once they got on the ball, worked their tails off even so.

If anyone ever sees this on their server, my advice is to make immediate preparation to move with back ups not on the drive that's affected. Don't wait as things spiral downward. Just go. Make sure that you don't over-write your back ups, keep a clean back up stashed, and go.

Without a separate drive backup, I would have been seriously, seriously screwed. Two week old full back ups were gold in this case. For those folks that think it can't happen, trust me - you always think that, up until it does.

Thanks for the suggestions. :)
 

aspardeshi

Member
Oct 22, 2009
15
0
51
raid 5 may be a big problem

raid 5 may be a big problem as far as data replication is concerned in some cases.
 

aspardeshi

Member
Oct 22, 2009
15
0
51
raid 5 may be a big problem

raid 5 may be a big problem as far as data replication is concerned in some cases. reason being unknown.