SOLVED [CPANEL-24832] Multiple service failure reports for cpsrvd, IMAP, and Exim

PeteS

Well-Known Member
Jun 8, 2017
168
32
28
Oregon
cPanel Access Level
Root Administrator
Hi @PeteS


Issues like this, especially intermittent issues are easiest to troubleshoot with access when the issue occurring. If possible next time please do open a ticket as we haven't one opened at this point.

I'm also curious why cpsrvd is restarting those services is anything noted in the chksrvd logs?

/var/log/chkservd.log

or in the tailwatchd logs?
To be clear, the issue only has an event window of 20 minutes, after which all we have are log entries. There's no way anyone could catch the issue, open a ticket, and have a tech looking at in it that time frame.

I have data for an event (on 3/6/19) from the following logs:
/var/log/chkservd.log
/var/log/messages
/usr/local/cpanel/logs/tailwatchd_log
/usr/local/cpanel/logs/error_log

I can supply them (un-obfuscated) if you give me a method to send you the file. I won't post them here. (I will be out for the weekend so if I don't reply quickly, that's why.)

It would also seem prudent to look at changes in the cPanel versions I cited to see what may have affected this. I recognize the 11.76 update would have had a lot in it, but gotta start somewhere.

-Pete
 
Last edited by a moderator:

Nicola Urbinati

Well-Known Member
Feb 1, 2017
73
10
8
44
Italy
www.dreamlordpress.it
cPanel Access Level
Root Administrator
Hi,

I'm experiencing nightly problems on Exim/Smtp/Cpsrvd.
I'm receiving mails about nightly failures and recoveries of those services, lasting about 15 minutes. Not every day, but most.

I find the following in the cpanel error log I attach

Is there a way to properly debug/solve the problem?

Thank you.
 

Attachments

PeteS

Well-Known Member
Jun 8, 2017
168
32
28
Oregon
cPanel Access Level
Root Administrator
The event occurred again this morning.

Since I have time I the office today, I have opened a ticket (#11652337) and am currently #63 (and moving backward). It might be in your best interest to escalate it, but it's up to you.

I referenced this thread, provided detailed log info on my 3/6 event, and also noted the one from today.

-Pete
 
  • Like
Reactions: cPanelMichael

PeteS

Well-Known Member
Jun 8, 2017
168
32
28
Oregon
cPanel Access Level
Root Administrator
Hi Pete,

It's possible this relates to the issue noted on the following thread:

In Progress - [CPANEL-24832] Services failing at least once/day for no apparent reason

I've added a note to your ticket to have our Technical Analysts review that case and see if the issue you have described is related. I'll update this thread with the outcome for anyone else facing the same issue.

Note that for the issue described in the above link, the temporary workaround is to browse to the Software tab in WHM >> Tweak Settings and disable cpsrvd for the Dormant services option.

Thank you.
Thanks. I had already reviewed and am following that thread, but it did not seem to be the same scenario (at first), though some of what the OP cited sounded like the issue discussed here. After getting a little more information though, I think you are correct.

The response on the ticket seems to indicate that a race condition may be occurring when cpsrvd unloads and reloads (via the dormant services setting). Therefore unchecking cpsrvd in WHM»Server Configuration »Tweak Settings, Software tab *does* seem to be the workaround. I have made that change and will report back. I suspect if no occurrences happen within 14 days that it confirms the cause.

If it is the unloading/reloading then my server being low volume would have more opportunities for the issue to manifest itself, as compared to a higher volume server where it would unload less often. It also makes sense that most of the time I am seeing this at off-peak hours.

I wonder if others posting here about this are in a similar situation (low volume server, and/or occurring at off-peak times).

-Pete
 
Last edited:

cPanelMichael

Technical Support Community Manager
Staff member
Apr 11, 2011
47,749
2,205
363
cPanel Access Level
DataCenter Provider
Twitter
Hello Everyone,

I've merged multiple threads here so we can better track reports of this happening.

Internal case CPANEL-24832 is open to track an issue where Chkservd reports service failures for the cpsrvd, IMAP, and EXIM services on a regular basis. I'll monitor this case and update this thread with more information as it becomes available.

In the meantime, the temporary workaround is to disable cpsrvd in the Dormant services section under the Software tab in WHM >> Tweak Settings.

Let us know if you have any questions.

Thank you.
 
  • Like
Reactions: Nicola Urbinati

The Old Man

Active Member
Feb 24, 2016
29
5
3
UK
cPanel Access Level
Root Administrator
Thank you looking into this.

I have a managed Cloud VPS with Inmotion Hosting which has 4 websites (
Server Version: Apache/2.4.38 (cPanel) OpenSSL/1.0.2q mod_bwlimited/1.4 mod_cpanel/1.4 Server MPM: event), and have been having this issue every few days since around end of 2018. WHM is set to update each day, so I'm currently running 78.0.17.

cphulkd, httpd, apache_php_fpm, cpanellogd, crond, exim, cpservd are all examples of the services I get notified are down.

I also find that if I set the off-peak times for CPBackup, Backup and UPCP to my preferred times, they switch back to other different times after a couple of days.

IMH Tech support thought it was a memory issue, but later confirmed I had not exceeded my allocation. I ended up with a 5 day trial on the next package up with double the memory (3GB RAM burstable as needed to 6GB) and the failure notifications seemed less, so I upgraded permanently. However every few days I still get notifications about services failing and my websites are still going down. It is so frustrating.

Support have forgotten about it now. The one thing they said was that a process had taken CPHulk down, but the process ID wasn't listed anywhere to tell them what it was.

With today's investigation, Chkservd (this service that sent the email, it's what makes sure other services are online) did correctly determine that cPHulk was offline, during that time period, while every other service was online. Since we only see that cPHulk failed, I checked /usr/local/cpanel/logs/cphulkd.log to try and find if this service logged why it was killed. We found the following:

[2019-02-15 05:20:18 +0000] info [cPhulkd] DB processor shutdown via SIGTERM with pid 6181
[2019-02-15 05:20:18 +0000] info [cPhulkd] processor shutdown via SIGTERM with pid 929
[2019-02-15 05:35:06 +0000] info [cPhulkd] processor startup with pid 7152
[2019-02-15 05:35:06 +0000] info [cPhulkd] DB processor startup with pid 7593

While it is normal for cPHulk's DB processor to be started and stopped, the processor itself should be remaining online. The above logs show that a process with an ID 929 was what killed cPHulk. Unfortunately, just an hour later and no such process is running as ID 929 any longer, meaning now we can't tell what external process had issued this SIGTERM and killed cPHulk.

OOM kills originate from the kernel, not as some process ID, so that rules the low memory/RAM theory out.

To give us better resources to help dig deeper into these service downtime events, I've temporarily installed some advanced logging via cPanel System Snapshot, which does take process logs every few minutes, and keeps them for 24 hours. If you recieve another service downtime email, reply to us again just like you did today, and hopefully with this more advanced logging, we can reach a conclusion and see a resolution.
When my sites and services go down, WHM Service Status says all the processes are running yet when when I click on Apache Status it says it is not responding.

If I manually restart Apache via WHM, it comes back online immediately, so it's frustrating CPanel can't achieve the same thing.
 
Last edited:

cPanelNick

Administrator
Staff member
Mar 9, 2015
3,486
31
158
cpanel.net
cPanel Access Level
DataCenter Provider
We believe we have tracked the problem down to a race condition with systemd. The problem was less likely to occur on an older version of cPanel because cpsrvd startup was not as fast as it is on newer versions. If our initial analysis is correct the issue will affect any systemd service that has Type=forking and KillMode=process. This means it can also cause seemingly random restart failure in cpgreylistd, cphulkd, dnsadmin, queueprocd, tailwatchd, etc.

Can everyone who has reported the issue also confirm that it is unique to CentOS 7/CloudLinux 7?

Additionally, the problem may be solved in systemd version 232 or newer. We are working on a workaround for the systemd version (219) that ships with CentOS 7/CloudLinux 7.
 

lorio

Well-Known Member
Feb 25, 2004
294
13
168
Visit site
cPanel Access Level
Root Administrator
Interesting, I have this issue on one virtual server (Virtuozzo) with CentOS7 too. Always thought is was a issue with Virtuzzo or the node setup. Made a parallel fresh installation on the same node and had no issues there. Same host, same CentOS7 template, but newer cPanel-Installation. Now i stumble upon this bugreport.
 

lorio

Well-Known Member
Feb 25, 2004
294
13
168
Visit site
cPanel Access Level
Root Administrator
Same issue here, the workaround is to disable cpsrvd in the Dormant services doesn't work
Did you try to disable all service in Dormants services? How often do you see issues? I can see no pattern. Somestimes the issue is a few times a day. Then no issues for days.
 

rallisf1

Registered
Apr 13, 2019
1
0
1
Athens, Greece
cPanel Access Level
Root Administrator
I have the same problem on my Centos 7.6 VPS (systemd 219) and I have found numerous posts about it being buggy for more than cPanel. My issue is happening (almost) daily late at night (ranging from 11pm to 6am) for about 20 minutes which wasn't a huge deal for me or my clients, so I just waited for some cPanel update to fix it. Apparently after 5 months this is still an issue and I just stumbled upon this thread now.

I am trying the Dormant services workaround for the moment but have also found this permament fix by facebook sysadmins for the more dauntless:
- Removed -

I can't really try this fix on a production server with dozens of clients so if anyone goes through with it please post an update.

P.S. I cannot post a ticket because the domain of my server has changed and my cpanel account doesn't work any more.
 
Last edited by a moderator:

cPanelMichael

Technical Support Community Manager
Staff member
Apr 11, 2011
47,749
2,205
363
cPanel Access Level
DataCenter Provider
Twitter
Hello @rallisf1,

This is fixed in cPanel & WHM version 80:

Fixed case CPANEL-24832: Workaround systemd race condition which can cause it to kill cpsrvd.

Version 80 is currently only published to the EDGE release tier as a development version, however there is an active request to backport the case to cPanel & WHM version 78. I'll continue to monitor the case and update this thread with more information as it becomes available.

Thank you.
 
  • Like
Reactions: The Old Man

luxmicro

Registered
Mar 30, 2013
2
2
78
cPanel Access Level
Root Administrator
Hello @The Old Man,

Can you see if disabling cpsrvd in the Dormant services section under the Software tab in WHM >> Tweak Settings addresses the issue? This should solve the problem until case CPANEL-24832 is published.

Thank you.
Thank you for this! Our server is running the latest version of CPanel/WHM and this problem has been driving me crazy for the last several months.

Issue is:

Failed: cpsrvd
"ip address"
The Service cpsrvd appears to be down

I can confirm it has done this at least once a day (sending emails and texts) for the last 4 months. Mine does have a time pattern and it always happens around 12 midnight.

So I did what your post asks by disabling cpsrvd in the dormant services section and it has finally stopped. I honestly had to look at my phone twice to see if this was for real. haha.

Thanks!
 
  • Like
Reactions: cPanelMichael