The Community Forums

Interact with an entire community of cPanel & WHM users!
  1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

What happened to this RAID system?

Discussion in 'Workarounds and Optimization' started by AndyB78, Dec 6, 2012.

  1. AndyB78

    AndyB78 Active Member

    Joined:
    Oct 7, 2003
    Messages:
    35
    Likes Received:
    0
    Trophy Points:
    6
    Location:
    Romania
    Hi,

    Since a few days ago when I had to restart one of our shared hosting servers (and failed to restart so the DC techs had to intervene - they've probably ran some manual fsck but I am just not sure what they've run) the IO performance on our server dropped dramatically right after reboot.

    The HDDs are setup in Raid 1 behind a LSI MegaRAID 8704ELP Version: 1.20.
    The HDD themselves are ST3500320NS.
    The raid matrix seems to be in a NOT degraded state (status optimal, no media errors).
    CPU is E5520 (quad core w. HT, 8MB cache)

    The problem is that IO wait is 5 times bigger than it should (probably more) and iostat shows pretty weird data. For comparison I present 2 identical servers:

    Code:
             rrqm/s   wrqm/s   r/s    w/s   rsec/s     wsec/s   avgrq-sz avgqu-sz     await   svctm   %util
    OK one:  62.90    67.33   76.98  53.49  2469.66    985.81    26.48     0.10       3.83    0.22     2.84
    Bad one: 12.67    78.06   74.31  49.04  2318.72    1017.19   27.04     2.19       17.73   4.38     54.03
    
    While r/s and w/s are about the same (I believe this means they share a similar utilization) and avgrq-sz is virtually the same, rrqm/s is much lower in the bad system, avgrq-sz is much higher (it gets to about 75 times under higher load) and the await is also much higher (gets to ~ 50 times larger under load) and also service time (svctm).

    Also while on the OK server kjournald is very discreet on the bad server kjournald takes the top through 2 different forks (out of 4) even after setting the ionice class to Idle for those 2 kjournald processes.

    So what makes rrqm go down and avgqu-sz, await and svctm go up in a bad system? Is it a HDD, is it the card itself, is it some rogue mount option? What is busting the second server?

    Thanks in advance for any suggestion!
     

Share This Page