SOLVED [CPANEL-24832] Multiple service failure reports for cpsrvd, IMAP, and Exim

quanin

Well-Known Member
Aug 18, 2011
118
3
68
cPanel Access Level
Root Administrator
So this used to be a very occasional problem I thought could be attributed to memory issues, but while I've solved the memory problem and it appeared stable for a stretch, we're back to this problem again, but now more often. At least daily, sometimes once every couple days, I'll see alerts that multiple services have failed. CPSRVD is a usual suspect, as is the Bind nameserver. clamd, IMAP, and Exim sometimes follow shortly thereafter. The weird part is, all of the services are still running... it's just that occasionally, they like to throw mostly unhelpful errors.

What I believe is happening is something all of the services that fail commonly use, possibly for authentication, is failing and it's only after CPanel goes through the usual restart routines and stumbles on the thing that's failing that the issue actually self-corrects, until the next time. But until CPanel hits on the service that's failing, I'll see things like this.
dovecot: auth: Error: auth worker: Aborted PASSV request for [email protected]: Shutting down
dovecot: auth: Error: net_connect_unix(anvil-auth-penalty) failed: Permission denied

Or, from another affected service:
2019-02-19 21:38:07.244 [31472] 1gwHlt-0008BS-4Z == [email protected] R=virtual_user T=dovecot_virtual_delivery_no_batch defer (-44): LMTP error after RCPT TO:<[email protected]>: 451 4.3.0 <[email protected]> Temporary internal error

On the occasion I try to access a site on this server when these issues are on-going, I'll get a 503 error instead of what I'm trying to access. As said, this is all very temporary--usually resolved in about 5 minutes from when CPanel notices, but I'd be curious to know what causes it and if there's any way it can be avoided. The only thing my own digging has been able to uncover is that all of the affected services seem to rely on one common mechanism, and it's that mechanism rather than the individual services that has fallen sideways.
 

quanin

Well-Known Member
Aug 18, 2011
118
3
68
cPanel Access Level
Root Administrator
... I have no idea how I missed checking that earlier. I'm also not entirely sure why it doesn't happen more often--I see CPSRVD and sometimes clamd fail daily, but it's been 3 days since the last OOM issue.

Also, I thought I'd optimised the heck out of that server. I'm open to suggestions for things to tweak further.

[[email protected] ~]# free -ht
total used free shared buff/cache available
Mem: 3.7G 1.6G 291M 140M 1.9G 1.7G
Swap: 255M 255M 0B
Total: 4.0G 1.8G 291M
 

janipewter

Active Member
Jan 2, 2013
44
3
8
cPanel Access Level
Root Administrator
Since November 2018 I am regularly getting emails from my server saying various services have failed. I just checked my inbox and there are 158 unique "service failed" emails from cpanel monitoring since 11 November 2018.

It is always imap, exim, and cpsrvd. After 5-15 minutes the services will all recover but they are all unresponsive during this gap. If I ssh in and manually run /scripts/restartsrv_cpsrvd or whatever it is, immediately the problem is fixed.

I have just checked /var/log/chkservd.log and it is 20.2MB; 333186 lines long.

Many lines like this:

cpsrvd [Service check failed to complete
Unable to connect to port 2086 on 127.0.0.1: Connection refused: Died[check command:N/A][socket connect:-][socket failure threshold:3/5]]...
and also this:

exim: ** [535 Incorrect authentication data != 2]
: Died[check command:+][socket connect:-][socket failure threshold:4/3][fail count:2]Restarting exim....
and this:

<< A001 NO [UNAVAILABLE] Temporary authentication failure. [srv01.smileserv.co.uk:2019-02-20 20:49:30]
imap: ** [A001 NO [UNAVAILABLE] Temporary authentication failure. [srv01.smileserv.co.uk:2019-02-20 20:49:30] != A001 OK]
: Died[check command:+][socket connect:-][socket failure threshold:4/3][fail count:2]Restarting imap....
 

quanin

Well-Known Member
Aug 18, 2011
118
3
68
cPanel Access Level
Root Administrator
I have this same issue, though some of mine may be due to the fact I need to stop being cheap and add more resources. Still, usually it's IMAP, Exim, CPSRVD, and sometimes clamd that takes a nap. Like you, they all usually recover within 5-15 minutes.
 

janipewter

Active Member
Jan 2, 2013
44
3
8
cPanel Access Level
Root Administrator
Nice to know I am not the only one. But resources are not a problem for us, we have 2x Xeon E5-2620 and 32GB RAM. The server is not being heavily loaded
 

quanin

Well-Known Member
Aug 18, 2011
118
3
68
cPanel Access Level
Root Administrator
It has to be more than the OOM issues, as neither IMAP, nor Exim, nor CPSRVD were affected. Clamd was, but in the two most recent OOM kills in particular, it was HTTP or MySQL causing it, and usually only clamd and a MySQL process (sometimes Bind) are caught up in it.

Feb 19 12:22:36 quantum kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 19 12:22:36 quantum kernel: [<ffffffffbcfba4e4>] oom_kill_process+0x254/0x3d0
Feb 19 12:22:36 quantum kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Feb 19 12:22:36 quantum kernel: Out of memory: Kill process 13043 (clamd) score 145 or sacrifice child
Feb 19 12:22:36 quantum kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 19 12:22:36 quantum kernel: [<ffffffffbcfba4e4>] oom_kill_process+0x254/0x3d0
Feb 19 12:22:36 quantum kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Feb 19 12:22:36 quantum kernel: Out of memory: Kill process 12890 (named) score 68 or sacrifice child
Feb 22 14:11:13 quantum kernel: httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 22 14:11:13 quantum kernel: [<ffffffffbcfba4e4>] oom_kill_process+0x254/0x3d0
Feb 22 14:11:13 quantum kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Feb 22 14:11:14 quantum kernel: Out of memory: Kill process 7490 (clamd) score 145 or sacrifice child
Feb 22 14:11:14 quantum kernel: httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Feb 22 14:11:14 quantum kernel: [<ffffffffbcfba4e4>] oom_kill_process+0x254/0x3d0
Feb 22 14:11:14 quantum kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Feb 22 14:11:14 quantum kernel: Out of memory: Kill process 12056 (mysqld) score 87 or sacrifice child

Exim failed last on 2/20, along with IMAP and CPSRVD, and there was no OOM event there.
 

cPanelLauren

Forums Analyst II
Staff member
Nov 14, 2017
9,054
766
263
Houston
It has to be more than the OOM issues, as neither IMAP, nor Exim, nor CPSRVD were affected. Clamd was, but in the two most recent OOM kills in particular, it was HTTP or MySQL causing it, and usually only clamd and a MySQL process (sometimes Bind) are caught up in it.
'
I agree with you, please feel free to open a ticket as I suggested previously and let us know the ticket ID here once complete.
 

janipewter

Active Member
Jan 2, 2013
44
3
8
cPanel Access Level
Root Administrator
Hi @janipewter

That's a pretty long gap of time (3 months)

Are there any more recent failures, within the last week or two? If so what are they? As much detail as possible would be helpful.
What's a pretty long gap? There have been 158 service failures since November. It's happening almost daily, at least once every three days. Multiple services failing.
 

cPanelLauren

Forums Analyst II
Staff member
Nov 14, 2017
9,054
766
263
Houston
What's a pretty long gap? There have been 158 service failures since November. It's happening almost daily, at least once every three days. Multiple services failing.
Trying to troubleshoot service failures 3 months ago is not going to be extremely productive as not only have logs been most likely rotated the activity on the server is more than likely different than at that time. If you have some that are recent i.e., in the last few weeks please provide as much possible information on those and we'd be happy to assist you.

The advice provided here should be a good start to providing the most helpful information: Guide To Opening An Effective Forums Thread
 

MHFraser

Member
Feb 28, 2019
5
0
1
Australia
cPanel Access Level
Root Administrator
I Have the exact same problem on a brand new server.
I have dug into it and this is what I have found.

1) Unknown process shuts down cpsrvd
/var/log/messages
Code:
Feb 28 09:29:31 redback systemd: Stopping cPanel services...
2) chkservd notices cpsrvd is down but doesn't do anything about it. Also notices exim and dovecot are not authenticating but cant fix them (presumably due to cpsrvd being down)
/var/log/chkservd.log
Code:
[2019-02-28 09:32:44 +1100] Service check ....
cpsrvd [too soon after restart to check]...
imap [[socket_service_auth:1]TCP Transaction Log:
...
<< A001 NO [UNAVAILABLE] Temporary authentication failure. [redback.mudgee.host:2019-02-27 22:32:48]

exim [TCP Transaction Log:
...
<< 535 Incorrect authentication data
3) after 20 minutes, chkservd finally starts up cpsrvd, and exim & dovecot start working again
grep "cpsrvd \[" chkservd.log
Code:
cpsrvd [too soon after restart to check]...
cpsrvd [too soon after restart to check]...
cpsrvd [Service check failed to complete
cpsrvd [Service check failed to complete
cpsrvd [Service check failed to complete
cpsrvd [[http_service_auth:1][check command:N/A][socket connect:+]]...

For the life of me I cant find what is stopping cpsrvd. Its very regular, every 46 hours.

chkservd not taking any action for 20+ minutes isn't very efficient. it notices cpsrvd is down within a few minutes but sits on the information for ages.

Clobbering exim & dovecot for 20 minutes during working hours makes my users (hence me) cranky :)

Manually restarting cpsrvd immediately restores exim & dovecot
 

quanin

Well-Known Member
Aug 18, 2011
118
3
68
cPanel Access Level
Root Administrator
Logged ticket 11570883 for this. After recent updates I was going to drop it as it hadn't happened, but this morning changed my mind.
 

cPanelLauren

Forums Analyst II
Staff member
Nov 14, 2017
9,054
766
263
Houston
Hello,


I just checked in on this ticket and it would appear that the services IMAP, EXIM and cPsrvd are all being flagged as being down as a result of an internal case CPANEL-24832

The case is still in the "monitored" but critical status which means we've not identified specifically what's occurring though they have identified that removing these services from dormant mode appears tobe a workaround for the issue at this time.

I'll update this thread again if there's more
 
  • Like
Reactions: dirklammert

PeteS

Well-Known Member
Jun 8, 2017
178
34
28
Oregon
cPanel Access Level
Root Administrator
@cPanelLauren,

I can confirm this problem.

I have been tracking it since November. I hadn't found it reported here... until this post! Since it is self-healing I took my time, figuring that an update would fix it. Here's what I can report:
  • I first experienced it on 11/10/18
  • It has occurred 25 times to date
  • It occurs at random intervals, for me, ranging from 1-15 days
  • It has never been more than once a day
  • Time of day varies but is often about an hour or less later than the previous time (but not an identifiable pattern)
  • Most of the time it is every 1-6 days, but there was one interval each of 7, 9, 13, and 15 days
The cPanel version on 11/10/18 was 11.76.0.5 which was installed on 11/9/18. Version 11.74.0.10 was installed on 11/3/18. It must be something in one of these two (likely 11.76.0.5), and certainly not in 11.74.0.9 since it was installed on 10/4/18 and I did not have any issues during the 4 weeks is was running.

Here's the scenario:
  • p0f reports down, then 5 minutes later reports up, then 10 minutes later cpcrvd reports down, and reports up in 5 more minutes (it is actually down 20 minutes)
  • cpcrvd is always down for 20 minutes
  • cpsrvd can be manually started
  • p0f also reports down almost every time this occurs (a couple times exim and imap reported down instead)
Example from today (/var/log/messages):
Code:
Mar  6 22:26:52 host systemd: Stopping p0f passive fingerprinter...
Mar  6 22:26:52 host p0f: [!] WARNING: User-initiated shutdown.
Mar  6 22:26:52 host systemd: Stopping cPanel services...
Mar  6 22:26:52 host systemd: Stopped p0f passive fingerprinter.
Mar  6 22:26:52 host restartsrv_cpsrvd: Gracefully terminating process: systemctl with pid 11737 and owner root.
Mar  6 22:26:52 host restartsrv_cpsrvd: Gracefully terminating process: cpsrvd: with pid 11732 and owner root.
Mar  6 22:26:52 host restartsrv_cpsrvd: Gracefully terminating process: /usr/local with pid 11733 and owner root.
Mar  6 22:26:52 host restartsrv_cpsrvd: Waiting for 11737,11732,11733 to shutdown ....... terminated.
Mar  6 22:26:52 host systemd: Stopped cPanel services.
Mar  6 22:27:01 host systemd: Removed slice User Slice of root.
Mar  6 22:27:01 host systemd: Created slice User Slice of root.
Mar  6 22:27:01 host systemd: Started Session 41273 of user root.
Mar  6 22:27:19 host systemd: Starting p0f passive fingerprinter...
Mar  6 22:27:19 host p0f: --- p0f 3.09b by Michal Zalewski <[email protected]> ---
Mar  6 22:27:19 host p0f: [+] Closed 1 file descriptor.
Mar  6 22:27:19 host p0f: [+] Loaded 322 signatures from '/usr/local/cpanel/3rdparty/etc/p0f/p0f.fp'.
Mar  6 22:27:19 host p0f: [+] Intercepting traffic on interface 'any'.
Mar  6 22:27:19 host p0f: [+] Custom filtering rule enabled: less 400 and not dst port 80 and not dst port 443 and tcp[13] & 8==0
Mar  6 22:27:19 host p0f: [+] Listening on API socket '/var/cpanel/userhomes/cpanelconnecttrack/p0f.socket' (max 20 clients).
Mar  6 22:27:19 host p0f: [+] Privileges dropped: uid 988, gid 985, root '/var/cpanel/userhomes/cpanelconnecttrack'.
Mar  6 22:27:19 host p0f: [+] Daemon process created, PID 11819 (stderr kept as-is).
Mar  6 22:27:19 host p0f: Good luck, you're on your own now!
Mar  6 22:27:19 host systemd: Started p0f passive fingerprinter.
Mar  6 22:27:20 host pure-ftpd: ([email protected]) [INFO] New connection from 127.0.0.1
Mar  6 22:27:20 host pure-ftpd: ([email protected]) [INFO] Logout.
Mar  6 22:28:01 host systemd: Started Session 41274 of user root.
Mar  6 22:29:01 host systemd: Removed slice User Slice of root.
Mar  6 22:29:01 host systemd: Created slice User Slice of root.
Mar  6 22:29:01 host systemd: Started Session 41275 of user root.
Mar  6 22:29:55 host systemd: Created slice User Slice of petez.
Mar  6 22:29:55 host systemd-logind: New session 41276 of user petez.
Mar  6 22:29:55 host systemd: Started Session 41276 of user petez.
Mar  6 22:29:55 host dbus[4955]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Mar  6 22:29:55 host dbus[4955]: [system] Successfully activated service 'org.freedesktop.problems'
Mar  6 22:30:01 host systemd: Started Session 41277 of user root.
Mar  6 22:30:01 host systemd: Started Session 41279 of user root.
Mar  6 22:30:01 host systemd: Started Session 41278 of user root.
Mar  6 22:30:01 host systemd: Started Session 41281 of user root.
Mar  6 22:30:01 host systemd: Started Session 41282 of user root.
Mar  6 22:30:01 host systemd: Created slice User Slice of munin.
Mar  6 22:30:01 host systemd: Started Session 41283 of user munin.
Mar  6 22:30:01 host systemd: Started Session 41280 of user root.
Mar  6 22:30:10 host systemd: Removed slice User Slice of munin.
Mar  6 22:31:01 host systemd: Started Session 41284 of user root.
Mar  6 22:32:01 host systemd: Started Session 41285 of user root.
Mar  6 22:32:26 host pure-ftpd: ([email protected]) [INFO] New connection from 127.0.0.1
Mar  6 22:32:26 host pure-ftpd: ([email protected]) [INFO] Logout.
Mar  6 22:33:01 host systemd: Started Session 41286 of user root.
Mar  6 22:34:01 host systemd: Started Session 41287 of user root.
Mar  6 22:35:01 host systemd: Started Session 41288 of user root.
Mar  6 22:35:01 host systemd: Started Session 41289 of user root.
Mar  6 22:35:01 host systemd: Started Session 41292 of user root.
Mar  6 22:35:01 host systemd: Created slice User Slice of munin.
Mar  6 22:35:01 host systemd: Started Session 41293 of user munin.
Mar  6 22:35:01 host systemd: Started Session 41291 of user root.
Mar  6 22:35:01 host systemd: Started Session 41290 of user root.
Mar  6 22:35:09 host systemd: Removed slice User Slice of munin.
Mar  6 22:36:01 host systemd: Started Session 41294 of user root.
Mar  6 22:37:01 host systemd: Started Session 41295 of user root.
Mar  6 22:37:33 host pure-ftpd: ([email protected]) [INFO] New connection from 127.0.0.1
Mar  6 22:37:33 host pure-ftpd: ([email protected]) [INFO] Logout.
Mar  6 22:38:01 host systemd: Started Session 41296 of user root.
Mar  6 22:39:01 host systemd: Started Session 41298 of user root.
Mar  6 22:39:01 host systemd: Started Session 41297 of user root.
Mar  6 22:40:01 host systemd: Started Session 41301 of user root.
Mar  6 22:40:01 host systemd: Started Session 41300 of user root.
Mar  6 22:40:01 host systemd: Created slice User Slice of munin.
Mar  6 22:40:01 host systemd: Started Session 41299 of user munin.
Mar  6 22:40:01 host systemd: Started Session 41302 of user root.
Mar  6 22:40:01 host systemd: Started Session 41304 of user root.
Mar  6 22:40:01 host systemd: Started Session 41303 of user root.
Mar  6 22:40:10 host systemd: Removed slice User Slice of munin.
Mar  6 22:41:01 host systemd: Started Session 41306 of user root.
Mar  6 22:41:01 host systemd: Started Session 41305 of user root.
Mar  6 22:42:01 host systemd: Started Session 41307 of user root.
Mar  6 22:42:39 host pure-ftpd: ([email protected]) [INFO] New connection from 127.0.0.1
Mar  6 22:42:39 host pure-ftpd: ([email protected]) [INFO] Logout.
Mar  6 22:43:01 host systemd: Started Session 41308 of user root.
Mar  6 22:44:01 host systemd: Started Session 41309 of user root.
Mar  6 22:45:01 host systemd: Started Session 41310 of user root.
Mar  6 22:45:01 host systemd: Started Session 41311 of user root.
Mar  6 22:45:01 host systemd: Created slice User Slice of munin.
Mar  6 22:45:01 host systemd: Started Session 41312 of user munin.
Mar  6 22:45:01 host systemd: Started Session 41313 of user root.
Mar  6 22:45:09 host systemd: Removed slice User Slice of munin.
Mar  6 22:46:01 host systemd: Started Session 41314 of user root.
Mar  6 22:47:01 host systemd: Started Session 41315 of user root.
Mar  6 22:47:42 host pure-ftpd: ([email protected]) [INFO] New connection from 127.0.0.1
Mar  6 22:47:42 host pure-ftpd: ([email protected]) [INFO] Logout.
Mar  6 22:47:42 host systemd: Starting cPanel services...
Mar  6 22:47:42 host systemd: Stopping p0f passive fingerprinter...
Mar  6 22:47:42 host p0f: [!] WARNING: User-initiated shutdown.
Mar  6 22:47:42 host systemd: Starting mailman services...
Mar  6 22:47:42 host systemd: Stopped p0f passive fingerprinter.
Mar  6 22:47:42 host systemd: Starting p0f passive fingerprinter...
Mar  6 22:47:42 host p0f: --- p0f 3.09b by Michal Zalewski <[email protected]> ---
Mar  6 22:47:42 host p0f: [+] Closed 1 file descriptor.
Mar  6 22:47:42 host p0f: [+] Loaded 322 signatures from '/usr/local/cpanel/3rdparty/etc/p0f/p0f.fp'.
Mar  6 22:47:42 host p0f: [+] Intercepting traffic on interface 'any'.
Mar  6 22:47:42 host p0f: [+] Custom filtering rule enabled: less 400 and not dst port 80 and not dst port 443 and tcp[13] & 8==0
Mar  6 22:47:42 host p0f: [+] Listening on API socket '/var/cpanel/userhomes/cpanelconnecttrack/p0f.socket' (max 20 clients).
Mar  6 22:47:42 host p0f: [+] Privileges dropped: uid 988, gid 985, root '/var/cpanel/userhomes/cpanelconnecttrack'.
Mar  6 22:47:42 host p0f: [+] Daemon process created, PID 16487 (stderr kept as-is).
Mar  6 22:47:42 host p0f: Good luck, you're on your own now!
Mar  6 22:47:42 host systemd: Started p0f passive fingerprinter.
Mar  6 22:47:42 host restartsrv_mailman: (XID cpmtqy) The “mailman” service is not configured: there are no configured mailing lists
Mar  6 22:47:42 host systemd: PID file /usr/local/cpanel/3rdparty/mailman/data/master-qrunner.pid not readable (yet?) after start.
Mar  6 22:47:42 host systemd: Failed to start mailman services.
Mar  6 22:47:42 host systemd: Unit mailman.service entered failed state.
Mar  6 22:47:42 host systemd: mailman.service failed.
Mar  6 22:47:42 host restartsrv_cpsrvd: License is valid and has already updated recently.
Mar  6 22:47:42 host restartsrv_cpsrvd: Starting PID 16492: /usr/local/cpanel/libexec/cpsrvd-dormant
Mar  6 22:47:42 host systemd: Failed to read PID from file /var/run/cpsrvd.pid: Invalid argument
Mar  6 22:47:42 host systemd: cpanel.service: Supervising process 16492 which is not our child. We'll most likely not notice when it exits.
Mar  6 22:47:42 host systemd: Started cPanel services.
Mar  6 22:47:42 host systemd: Starting cPanel Greylisting Daemon...
Mar  6 22:47:42 host restartsrv_cpgreylistd: (XID ayvqp2) The “cpgreylistd” service is disabled.
Mar  6 22:47:42 host systemd: cpgreylistd.service: control process exited, code=exited status=2
Mar  6 22:47:42 host systemd: Failed to start cPanel Greylisting Daemon.
Mar  6 22:47:42 host systemd: Unit cpgreylistd.service entered failed state.
Mar  6 22:47:42 host systemd: cpgreylistd.service failed.
Mar  6 22:48:01 host systemd: Started Session 41316 of user root.
Mar  6 22:49:01 host systemd: Started Session 41317 of user root.
Mar  6 22:50:01 host systemd: Started Session 41322 of user root.
Mar  6 22:50:01 host systemd: Started Session 41321 of user root.
Mar  6 22:50:01 host systemd: Started Session 41319 of user root.
Mar  6 22:50:01 host systemd: Started Session 41323 of user root.
I have the applicable sections of logs from today saved.

I hope this helps.

I am not able to open a ticket at this time (I expect you'll ask) but can do so for a future occurance, if someone else hasn't already done so.

-Pete
 
Last edited by a moderator:
  • Like
Reactions: cPanelLauren

cPanelLauren

Forums Analyst II
Staff member
Nov 14, 2017
9,054
766
263
Houston
Hi @PeteS


Issues like this, especially intermittent issues are easiest to troubleshoot with access when the issue occurring. If possible next time please do open a ticket as we haven't one opened at this point.

I'm also curious why cpsrvd is restarting those services is anything noted in the chksrvd logs?

/var/log/chkservd.log

or in the tailwatchd logs?

/usr/local/cpanel/logs/tailwatchd_log