Community Forums
Connect with us on LinkedIn
+ Reply to Thread
Results 1 to 9 of 9
  1. #1
    Member jenlepp's Avatar
    Join Date
    Jul 2005
    Location
    Liberty Hill, TX
    Posts
    115
    cPanel/Enkompass Access Level

    DataCenter Provider

    Angry Repeated Data Corruption, NOC at a Loss

    Last Tuesday, my server seized up and went down, NOC said due to a disk failure. Chose at that time to upgrade the server - the box is new, moved up to a Raid 5 array with SATA drives, also new. It was restored on Wednesday.

    Thursday, some sites were having strange data problems - I have a backup disk, accounts were restored, then when I checked my server "quotacheck -a" was running and the load was around 7. They take the server down to check the disks, say everything is fine, put it back up. 24 hours pass and the same thing happens again. The do another check, put it back up, no more quotacheck for a couple of days.

    Meanwhile, I have several sites that get restored, work for a few hours, maybe even a day, and then their file permissions change, or there's junk instead of html, the mail system on one domain suddenly seizes up, php sites start serving blank pages - but only some. Some people find problems, upload their sites, and a few hours later it's screwed up again.

    I am now at a week with managed servers that, supposedly, have a 2 hour SLA for problems - and these people's sites cannot stay up. If I fix them, it's utterly pointless, because as fast as I fix the individual sites, that's how fast something comes up behind me and screws it up. I have another server I can move them all to, but I am worried I will just wind up movingthe problems.

    My NOC has the following answer:

    It looks like random files on the server are being corrupted with binary
    garbage. It may be there is a bad cable causing corruption or it may be the
    raid controller itself. I don't believe the drives are an issue as they are
    all brand new.

    I would really like to solve this however I don't want to cause any down time
    at the moment, we will need to research what our options are.


    In other words "we have no idea why this is happening. They may as well have written back and said there were Gremlins and there was nothing they could do until they got bored. I'm paying a fair penny for the server and it's supposed to be managed, and 1 week of total screw ups, the fourth day of total screw ups and a seeming total inability to even find the problem, much less fix it, is really starting to aggravate me - and, of course, threaten the continued goodwill my customers have for me. I really like my NOC and they've been really good to me up until now - but this is starting to get very frustrating. My clients have been saints so far as I have an awful lot of goodwill built up, but the natives are getting as restless as I am and my goodwill is starting to slide toward tilt.

    I could hire someone else to take a look, but it's a leased server and if it is a hardware problem, that would be money wasted as I am held hostage by the peolple that physically have my server. Since the problem is perplexing, they seem to just be putting me off and hoping it goes away. Which it's not.

    Has anyone ever seen anything like this before? My NOC claims they haven't, which is probably why the cluelessness, but I can't believe that I am unique in a sea of web hosts.
    Last edited by jenlepp; 08-01-2006 at 09:37 PM.

  2. #2
    Super Moderator This forum account has been confirmed by cPanel staff to represent a vendor. chirpy's Avatar
    Join Date
    Jun 2002
    Location
    Go on, have a guess
    Posts
    13,495

    Default

    It's not something I've come across. You don't mention which OS you're using, though. If it's not an enterprise OS (RHE, CentOS) then using one of those may be a safer option.

    Also, check that you're using an up to date kernel from the OS vendor.

    Apart from that, I'd say that you've given your NOC their chance and maybe it's time to move elsewhere.
    Jonathan Michaelson

    Need your cPanel servers secured and tuned?
    cPanel Server Configuration, Security, Recovery and Antivirus/AntiSpam Services
    Developers of the most effective (and free) Firewall & Security Solution for cPanel Servers - csf
    http://www.configserver.com

  3. #3
    Registered User
    Join Date
    Jul 2006
    Posts
    3

    Default

    I would have them swap out the controller AND the ram that is on the controller,

    I have seen this some in raid 3 arrays but not really in raid 5 arrays.

    Is it a 3ware controler or other true hardware raid card?

    If you have an SATA raid 5 configuration, its not the cable, the raid card will fail that cable/drive as soon as it detects errors. My guess is its the ram on the raid controller which may or may not be onboard.
    Last edited by eliteds3; 08-02-2006 at 08:20 AM.

  4. #4
    Member jenlepp's Avatar
    Join Date
    Jul 2005
    Location
    Liberty Hill, TX
    Posts
    115
    cPanel/Enkompass Access Level

    DataCenter Provider

    Default

    Finally got someone this morning who likes tackling new problems - we're copying data over to a new box and going back to Raid1 instead of Raid 5, because he thinks it's the Raid 5 array. So, I'm crossing my fingers.

    I'm on CentOS - if it is onboard, hopefully this will fix it and I can lug my backup drive to the new box and fix the issues.

  5. #5
    Member
    Join Date
    Mar 2002
    Posts
    175

    Default

    Quote Originally Posted by draknet
    My NOC has the following answer:

    .... I don't believe the drives are an issue as they are
    all brand new.


    That caught my attention. I would always suspect brand new hardware.

  6. #6
    Member rpmws's Avatar
    Join Date
    Aug 2001
    Location
    back woods of NC, USA
    Posts
    1,858

    Default

    Quote Originally Posted by jsnape
    That caught my attention. I would always suspect brand new hardware.
    nothing like a brand new drive LOOK OUT!!! ..it's them good old reliable ones with a few trips across country are the ones I trust. Proven mileage is what you want.
    Just keeping my "eye" on things....
    R. Paul Mathews
    RPMWS - diehard cPanel Nutcase

  7. #7
    Member jenlepp's Avatar
    Join Date
    Jul 2005
    Location
    Liberty Hill, TX
    Posts
    115
    cPanel/Enkompass Access Level

    DataCenter Provider

    Default

    Just wanted to post an update to end out the thread in case anyone else has this problem.

    The short answer: no one knows what caused what happened, but we moved the backup drive to a new box, started the whole restore thing over, and it's mostly fixed.

    The long answer is we (they, the NOC) tried to rsync disk to disk because I was adament about trying to accomplish this with no more down time and really wanted to save the updates, and it wound up being a gigantic mistake. The rsync was like the hand of death to the old box, and literally just peeled the whole thing apart data-wise as we tried to copy like it was an onion. By the end, passwords wouldn't work, services were crashing like mad, and the whole thing was swiss cheese. I'd never seen such a mess in all of my life. After nine hours of what was supposed to be a three hour copy, I spoke to a tech at the NOC who said it would take him a few hours to get services up on the old before the new box could get his attention.

    It was at that point I pulled the plug on it. It's midnight, we have 6-7 hours before getting a new box up, just stop. Stop messing with the old box, let it die the death it so clearly wants, and just get the new box up as fast as you can. The relief I heard through the phone was palpable.

    We lost two weeks worth of data updates on sites, lost a few clients, and I have about 20 sites "missing" from the restore, I had to pay Chirpy, like, three times to do the mailscanner install (they must have thought I was nuts) but for the most part life has returned to somewhat normal except for a pain in the butt PHP problem - fix one thing, another site breaks, fix that site, a third site breaks, etc. But it's better than it was.

    As for leaving my NOC... that was a tough one. The first week, I was ready to fly over there and strangle them. I really did feel like it got passed around and no one wanted to say "I don't know", so I was left with "this will fix it, this will fix it" as catastrophe loomed large, and it wasn't a pleasant experience. I got assurances, I passed on assurances, and I looked like a liar. I *really* don't like that.

    The second week, these guys really stepped up to the plate. I got hand holding, coddling, immediate attention, several apologies, constant information, and a credit as an apology. I know a few of them stayed far, far longer than they were supposed at work to deal with the problem, and the staff costs that server incurred plus the credit meant they could crown me loss leader this month - and they still, once they got on the ball, worked their tails off even so.

    If anyone ever sees this on their server, my advice is to make immediate preparation to move with back ups not on the drive that's affected. Don't wait as things spiral downward. Just go. Make sure that you don't over-write your back ups, keep a clean back up stashed, and go.

    Without a separate drive backup, I would have been seriously, seriously screwed. Two week old full back ups were gold in this case. For those folks that think it can't happen, trust me - you always think that, up until it does.

    Thanks for the suggestions.

  8. #8
    Member
    Join Date
    Oct 2009
    Posts
    15

    Default raid 5 may be a big problem

    raid 5 may be a big problem as far as data replication is concerned in some cases.

  9. #9
    Member
    Join Date
    Oct 2009
    Posts
    15

    Default raid 5 may be a big problem

    raid 5 may be a big problem as far as data replication is concerned in some cases. reason being unknown.

Similar Threads & Tags
Similar threads

  1. Data Loss in-between cPanel/WHM Migration?
    By yosofun in forum New User Questions
    Replies: 0
    Last Post: 06-05-2010, 11:56 PM
  2. Replies: 2
    Last Post: 02-22-2007, 12:26 AM
  3. Repeated Cron Message
    By DockettHost in forum cPanel and WHM Discussions
    Replies: 4
    Last Post: 12-13-2004, 03:54 AM
  4. mysql 4 and data corruption
    By sosoalex in forum cPanel and WHM Discussions
    Replies: 0
    Last Post: 06-25-2003, 11:20 AM
Linkedin       Facebook       Twitter       RSS       Flickr       YouTube