Strange "fake" CPU usage during Backup

Rogerio

Well-Known Member
Sep 26, 2016
78
15
8
Sao Paulo, Brazil
cPanel Access Level
Root Administrator
Hello,

I'm having a problem with a "fake" CPU usage during cPanel backup. And the problem does not happen always. Only about 3-4 months, for no reason, but I need to reboot the server and lost that day backups.

Hard to explain, but... let's go.

Server is a VPS on RamNode, package VDS with only 1 dedicated 3.4 Ghz CPU and 4 Gb RAM. VPS is KVM (host).

Backup is set to run at 01:30. It runs everyday.

At 03:00 AM local time, that is 00:00 UTC, I start to receive alerts from my Nagios saying that the CPU is high. When I "top", CPU usage is 100% idle but the average counters are crazy, like: 1.02 (1 min), 1.03 (5 min), 1.02 (15 min). But, in fact, nothing is using CPU except the basic 2-3% for services and OS.

But... at 23:55 UTC, CPU is like 0.02 0.03 0.02 - In real world, the 1.00 (15 min) is not true. So, I believe that the counters just move to 1.0 1.0 1.0 immediately at 00:00 UTC.

Logs...

[2018-10-05 01:35:47 -0300] Performing “Integration” component....
[2018-10-05 01:35:47 -0300] Completed “Integration” component.
[2018-10-05 01:35:47 -0300] Performing “AuthnLinks” component....
(...)
[2018-10-05 01:35:47 -0300] Completed “MailLimits” component.
[2018-10-05 01:35:47 -0300] Creating Archive ....Load watching resumed due to SIGUSR2
cpuwatch (Fri Oct 5 01:35:47 2018): System load is currently 0.88; waiting for it to go down below 0.88 to continue …
and stuck... then, a reboot, and the next line:

[2018-10-05 03:10:02 -0300] info [backup] Final state is Backup::Failure (HUP)
The "System load is currently 0.88" is normal, occurs everday, but proceeds after some seconds.

Ideas: something related to daily Cron? The server is 95% cPanel basic install in a minimum CentOS install and everything as recommended by docs... I just install other packages like mrtg, systat, iotop, nrpe (nagios) after done.

Note: in the first moment appears to be related to CentOS 7, KVM but... the problem only happens on cPanel servers AND during the backup AND 00:00 UTC (always). And this is not related to RamNode because the problem happened in another cPanel server once (but I don't remember if it was KVM, Xen or OpenVZ).

So... any ideas? :(
 

cPanelLauren

Product Owner II
Staff member
Nov 14, 2017
13,266
1,301
363
Houston
Hi @Rogerio

Is this happening on multiple servers you have or only one? This isn't behavior I've seen, to be honest but if you're experiencing it on several servers I might suggest you open a ticket using the link in my signature so that we can observe the behavior while it's occurring. Once open please reply with the Ticket ID here so that we can update this thread with the resolution once the ticket is resolved.


Thanks!
 

Rogerio

Well-Known Member
Sep 26, 2016
78
15
8
Sao Paulo, Brazil
cPanel Access Level
Root Administrator
Hello Lauren,

the problem is something rare, so, hard to monitor. I don't know how to fix, only rebooting. Do you know any way to force Linux "reload" the CPU average counters? Or a way to kill a stuck cPanel backup and force to re-run?
 

cPanelLauren

Product Owner II
Staff member
Nov 14, 2017
13,266
1,301
363
Houston
It really sounds like IO wait on the hostnode to me especially if you're running VPS servers this is a common occurrence.

Do you know any way to force Linux "reload" the CPU average counters?
There is not that I am aware of no.

Or a way to kill a stuck cPanel backup and force to re-run?
If it truly is stuck you can always kill the process and then manually restart it by running:
Code:
/usr/local/cpanel/bin/backup --force
 

LucasRolff

Well-Known Member
Community Guide Contributor
May 27, 2013
142
95
78
cPanel Access Level
Root Administrator
I can also recommend trying to install netdata github.com/netdata/netdata or possibly the Munin plugin in cPanel - it might be you get a high load alert, and by the time you log in, the actual process that causes these load issues is "done" - thus giving you 100% idle in CPU.

It's important to remember that the unix load is relative and average over the intervals 1, 5 and 15 minutes, so if you have a big spike in load 50 seconds ago and nothing now, the 1 minute load will still be "high".

The benefit of netdata or munin is the fact you have historical data to check - I personally prefer netdata because it goes down to a 1 second resolution for a whole lot of system metrics (it keeps them for 1 hour by default, and doesn't consume too much memory).

It will probably help you narrowing down the problem a lot easier.
 
Last edited by a moderator:
  • Like
Reactions: cPanelLauren

Rogerio

Well-Known Member
Sep 26, 2016
78
15
8
Sao Paulo, Brazil
cPanel Access Level
Root Administrator
Hello @cPanelLauren, thanks for the infos.

@LucasRolff I already use Munin, but, as I said, the problem is only the counters. The server was idle for sure until cPanel backup. I run MRTG too, every 5 minutes, and it shows no CPU usage until the problem. And I run NRPE (Nagios) too, that monitors CPU usage every 2 minutes and notify by SMS and PushOver.
 

cPanelLauren

Product Owner II
Staff member
Nov 14, 2017
13,266
1,301
363
Houston
If when it happens again I'd be curious to see some sysstats info specifically what the output of the following is:

Code:
sar -p
Or for historic usage (pinpoint a specific time/date)

Code:
sar -p -f /var/log/sa/saXX
 
  • Like
Reactions: Rogerio