Oct 7, 2003
Since a few days ago when I had to restart one of our shared hosting servers (and failed to restart so the DC techs had to intervene - they've probably ran some manual fsck but I am just not sure what they've run) the IO performance on our server dropped dramatically right after reboot.

The HDDs are setup in Raid 1 behind a LSI MegaRAID 8704ELP Version: 1.20.
The HDD themselves are ST3500320NS.
The raid matrix seems to be in a NOT degraded state (status optimal, no media errors).
CPU is E5520 (quad core w. HT, 8MB cache)

The problem is that IO wait is 5 times bigger than it should (probably more) and iostat shows pretty weird data. For comparison I present 2 identical servers:

         rrqm/s   wrqm/s   r/s    w/s   rsec/s     wsec/s   avgrq-sz avgqu-sz     await   svctm   %util
OK one:  62.90    67.33   76.98  53.49  2469.66    985.81    26.48     0.10       3.83    0.22     2.84
Bad one: 12.67    78.06   74.31  49.04  2318.72    1017.19   27.04     2.19       17.73   4.38     54.03
While r/s and w/s are about the same (I believe this means they share a similar utilization) and avgrq-sz is virtually the same, rrqm/s is much lower in the bad system, avgrq-sz is much higher (it gets to about 75 times under higher load) and the await is also much higher (gets to ~ 50 times larger under load) and also service time (svctm).

Also while on the OK server kjournald is very discreet on the bad server kjournald takes the top through 2 different forks (out of 4) even after setting the ionice class to Idle for those 2 kjournald processes.

So what makes rrqm go down and avgqu-sz, await and svctm go up in a bad system? Is it a HDD, is it the card itself, is it some rogue mount option? What is busting the second server?

Thanks in advance for any suggestion!