I am familiar with the crazy high load you have experienced.
Its a VERY RARE issue, but nonetheless very serious for those that experience it.
This is a known issue in the linux kernel. There is a bug open for it on kernel.org and its unresolved as it appears to be complex combination of certain disk controller device drivers and disk I/O scheduling.
Repeat - this is not a bug in r1softt's product... our product does a lot of disk I/O and we can set it off under the right conditions e.g. planets aligned particular storage controllers i/o schedulers etc
Bug 12309 – Large I/O operations result in slow performance and high iowait times
Its most commonly but not always seen in OpenVZ or Virtuozzo kernels.
I have had two customers that experienced the issue on a couple of servers and by migrating to different hardware the issue completely went away... same r1soft rev... same kernel rev etc. Only thing changed is hardware and storage controller driver.
Our customers have been able to reproduce the same issue WITHOUT R1Soft even loaded on the system.
Here is notes form one of our customers who has done exhaustive research into this issue:
#######################
5/12/2009
Joe / David,
In our testing we were able to reproduce the same condition that has been happening with openvz/cdp scenarios. It seems to relate to accessing large files. In the case scenario where we could replicate the condition, we tar'd a directory of ~25GB of files in the 5-20MB size range. Near the end of this test, the system in question exhibited the exact same signs as our systems which failed during the backup process. In order to get the IO to spin out of control we also ran some other intensive tasks which kept io ops around ~1,000/sec -- at the point of failure sar had logged ~6,250/sec.
What's different about the failures with CDP is that sometimes the failures occurred at the very first stage of the backup, other times near the end of the backup.
There is some online chatter as of late regarding the linux kernel and iowait issues:
Bug 12309 – Large I/O operations result in slow performance and high iowait times
Hopefully this information assists you in tracking down how you are triggering the failure.
#######################
What I can also tell you is that we have done everything possible to
workaround this kernel bug including rewriting our CDP device driver for our new CDP 3 product. We believe by doing disk I/O a different way we appear not to trigger the kernel scheduling issue.
And I repeat again. This is a kernel bug not an r1soft bug.
As far as posting on the forums we started screening new forum members some time ago to stop spammers. I will look into see why you were not approved or the approval process itself.
Regards,
-David Wartell
R1Soft Founder