Apache restart loop & stuck PHP-FPM processes

APatchworkBoy

Active Member
Feb 26, 2018
27
7
3
West Yorkshire, UK
cPanel Access Level
Root Administrator
Morning folks... last night at 23:03 httpd decided to restart itself, and ended up in a loop.

Code:
[Fri Aug 24 07:12:57.922269 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00297: SIGUSR1 received.  Doing graceful restart

[Fri Aug 24 07:12:58.520698 2018] [mpm_worker:warn] [pid 25315:tid 139892578699232] AH00317: MaxRequestWorkers of 80 is not an integer multiple of ThreadsPerChild of 25, decreasing to nearest multiple 75

[Fri Aug 24 07:12:58.554426 2018] [http2:info] [pid 25315:tid 139892578699232] AH03090: mod_http2 (v1.10.20, feats=CHPRIO+SHA256+INVHD+DWINS, nghttp2 1.32.0), initializing...

[Fri Aug 24 07:12:58.555398 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00292: Apache/2.4.34 (cPanel) OpenSSL/1.0.2o mod_bwlimited/1.4 configured -- resuming normal operations

[Fri Aug 24 07:12:58.555434 2018] [core:notice] [pid 25315:tid 139892578699232] AH00094: Command line: '/usr/sbin/httpd'

[Fri Aug 24 07:17:58.963914 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00297: SIGUSR1 received.  Doing graceful restart

[Fri Aug 24 07:17:59.469715 2018] [mpm_worker:warn] [pid 25315:tid 139892578699232] AH00317: MaxRequestWorkers of 80 is not an integer multiple of ThreadsPerChild of 25, decreasing to nearest multiple 75

[Fri Aug 24 07:17:59.498680 2018] [http2:info] [pid 25315:tid 139892578699232] AH03090: mod_http2 (v1.10.20, feats=CHPRIO+SHA256+INVHD+DWINS, nghttp2 1.32.0), initializing...

[Fri Aug 24 07:17:59.499476 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00292: Apache/2.4.34 (cPanel) OpenSSL/1.0.2o mod_bwlimited/1.4 configured -- resuming normal operations

[Fri Aug 24 07:17:59.499506 2018] [core:notice] [pid 25315:tid 139892578699232] AH00094: Command line: '/usr/sbin/httpd'

[Fri Aug 24 07:23:00.332397 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00297: SIGUSR1 received.  Doing graceful restart
A quick ps -aux revealed a ton of php_fpm processes all for the same user. Running /script/restartsrv_apache_php_fpm suddenly allowed httpd to come up and stay up, and as expected the php_fpm processes all vanished down to the usual 4 and all has been fine since.

A check of /opt/cpanel/ea-php71/root/usr/var/log/php-fpm/error.log shows php-fpm reaching max_children (40) at 20:41 last night, then no further log entries until I restart php-fpm this morning at 08:04 to allow httpd to recover.

Code:
[23-Aug-2018 19:45:48] NOTICE: [pool doma_in] child 6530 started
[23-Aug-2018 19:49:51] NOTICE: [pool doma_in] child 6444 exited with code 0 after 369.456144 seconds from start
[23-Aug-2018 19:49:51] NOTICE: [pool doma_in] child 6663 started
[23-Aug-2018 19:50:48] NOTICE: [pool doma_in] child 6468 exited with code 0 after 380.832191 seconds from start
[23-Aug-2018 19:50:48] NOTICE: [pool doma_in] child 6711 started
[23-Aug-2018 19:51:48] NOTICE: [pool doma_in] child 6526 exited with code 0 after 386.958746 seconds from start
[23-Aug-2018 19:51:48] NOTICE: [pool doma_in] child 6737 started
[23-Aug-2018 19:52:18] NOTICE: [pool doma_in] child 6530 exited with code 0 after 389.997204 seconds from start
[23-Aug-2018 19:52:18] NOTICE: [pool doma_in] child 6761 started
[23-Aug-2018 19:56:20] NOTICE: [pool doma_in] child 6663 exited with code 0 after 388.938428 seconds from start
[23-Aug-2018 19:56:20] NOTICE: [pool doma_in] child 6891 started
[23-Aug-2018 19:57:11] NOTICE: [pool doma_in] child 6711 exited with code 0 after 383.051672 seconds from start
[23-Aug-2018 19:57:11] NOTICE: [pool doma_in] child 6916 started
[23-Aug-2018 19:58:12] NOTICE: [pool doma_in] child 6737 exited with code 0 after 384.241633 seconds from start
[23-Aug-2018 19:58:12] NOTICE: [pool doma_in] child 6944 started
[23-Aug-2018 19:58:48] NOTICE: [pool doma_in] child 6761 exited with code 0 after 389.961665 seconds from start
[23-Aug-2018 19:58:48] NOTICE: [pool doma_in] child 6954 started
[23-Aug-2018 20:02:44] NOTICE: [pool doma_in] child 6891 exited with code 0 after 383.337425 seconds from start
[23-Aug-2018 20:02:44] NOTICE: [pool doma_in] child 7135 started
[23-Aug-2018 20:03:20] NOTICE: [pool doma_in] child 6916 exited with code 0 after 368.832361 seconds from start
[23-Aug-2018 20:03:20] NOTICE: [pool doma_in] child 7168 started
[23-Aug-2018 20:04:47] NOTICE: [pool doma_in] child 6944 exited with code 0 after 395.402790 seconds from start
[23-Aug-2018 20:04:47] NOTICE: [pool doma_in] child 7220 started
[23-Aug-2018 20:05:11] NOTICE: [pool doma_in] child 6954 exited with code 0 after 383.537759 seconds from start
[23-Aug-2018 20:05:11] NOTICE: [pool doma_in] child 7255 started
[23-Aug-2018 20:08:58] NOTICE: [pool doma_in] child 7135 exited with code 0 after 374.449430 seconds from start
[23-Aug-2018 20:08:58] NOTICE: [pool doma_in] child 7338 started
[23-Aug-2018 20:09:48] NOTICE: [pool doma_in] child 7168 exited with code 0 after 387.326123 seconds from start
[23-Aug-2018 20:09:48] NOTICE: [pool doma_in] child 7392 started
[23-Aug-2018 20:11:12] NOTICE: [pool doma_in] child 7220 exited with code 0 after 384.725260 seconds from start
[23-Aug-2018 20:11:12] NOTICE: [pool doma_in] child 7470 started
[23-Aug-2018 20:11:40] NOTICE: [pool doma_in] child 7255 exited with code 0 after 389.052525 seconds from start
[23-Aug-2018 20:11:40] NOTICE: [pool doma_in] child 7477 started
[23-Aug-2018 20:15:42] NOTICE: [pool doma_in] child 7338 exited with code 0 after 404.041951 seconds from start
[23-Aug-2018 20:15:42] NOTICE: [pool doma_in] child 7809 started
[23-Aug-2018 20:15:50] NOTICE: [pool doma_in] child 7392 exited with code 0 after 362.688153 seconds from start
[23-Aug-2018 20:15:50] NOTICE: [pool doma_in] child 7812 started
[23-Aug-2018 20:17:43] NOTICE: [pool doma_in] child 7470 exited with code 0 after 390.406430 seconds from start
[23-Aug-2018 20:17:43] NOTICE: [pool doma_in] child 7861 started
[23-Aug-2018 20:17:58] NOTICE: [pool doma_in] child 7477 exited with code 0 after 377.616078 seconds from start
[23-Aug-2018 20:17:58] NOTICE: [pool doma_in] child 7865 started
[23-Aug-2018 20:21:50] NOTICE: [pool doma_in] child 7809 exited with code 0 after 368.196082 seconds from start
[23-Aug-2018 20:21:50] NOTICE: [pool doma_in] child 7996 started
[23-Aug-2018 20:22:21] NOTICE: [pool doma_in] child 7812 exited with code 0 after 390.710030 seconds from start
[23-Aug-2018 20:22:21] NOTICE: [pool doma_in] child 8024 started
[23-Aug-2018 20:24:18] NOTICE: [pool doma_in] child 7861 exited with code 0 after 395.430634 seconds from start
[23-Aug-2018 20:24:18] NOTICE: [pool doma_in] child 8081 started
[23-Aug-2018 20:24:21] NOTICE: [pool doma_in] child 7865 exited with code 0 after 382.774186 seconds from start
[23-Aug-2018 20:24:21] NOTICE: [pool doma_in] child 8083 started
[23-Aug-2018 20:28:12] NOTICE: [pool doma_in] child 7996 exited with code 0 after 381.299786 seconds from start
[23-Aug-2018 20:28:12] NOTICE: [pool doma_in] child 8212 started
[23-Aug-2018 20:28:53] NOTICE: [pool doma_in] child 8024 exited with code 0 after 392.200477 seconds from start
[23-Aug-2018 20:28:53] NOTICE: [pool doma_in] child 8219 started
[23-Aug-2018 20:30:42] NOTICE: [pool doma_in] child 8083 exited with code 0 after 380.700935 seconds from start
[23-Aug-2018 20:30:42] NOTICE: [pool doma_in] child 8309 started
[23-Aug-2018 20:30:48] NOTICE: [pool doma_in] child 8081 exited with code 0 after 390.045318 seconds from start
[23-Aug-2018 20:30:48] NOTICE: [pool doma_in] child 8313 started
[23-Aug-2018 20:41:11] WARNING: [pool doma_in] server reached max_children setting (40), consider raising it
[24-Aug-2018 08:04:29] NOTICE: Terminating ...
[24-Aug-2018 08:04:29] NOTICE: exiting, bye-bye!
[24-Aug-2018 08:04:29] NOTICE: fpm is running, pid 32041
[24-Aug-2018 08:04:29] NOTICE: ready to handle connections
[24-Aug-2018 08:04:42] NOTICE: Terminating ...
[24-Aug-2018 08:04:42] NOTICE: exiting, bye-bye!
[24-Aug-2018 08:04:42] NOTICE: fpm is running, pid 32302
[24-Aug-2018 08:04:42] NOTICE: ready to handle connections
The domain_in in question's own php-fpm error log in /home/user/log/doma_in.php.error.log reveals nothing over that duration of time.

I _should_ have straced one of the processes, but didn't. There were 3x users who were sat on that website all night (which checks for notifications and other things periodically via ajax) and it would appear something got stuck somewhere, causing more and more child processes to be created til the server hit max_children. Eventually this took httpd down and chksrvd did it's job and attempted to restart itself, but it seems at no point did it attempt to restart php_fpm alongside so nothing could come back up without manual intervention.

The actual question: Is there anyway to modify whatever routine chksrvd uses to restart apache to also restart apache_php_fpm at the same time, whilst I examine the code on the website in question to find out the actual cause...? This would be enough to prevent the problem taking down all sites on the server. ('PHP-FPM service for Apache' is enabled and monitored in Service Manager but it would appear that if service manager finds a need to restart apache, it _only_ restarts apache, not apache_php_fpm)

I've noticed this issue of stuck php-fpm processes a few times with this particular server and site, mainly because it's also accessed from an internal network and if our IT guys reboot our proxy server (for instance) it causes the same problem for users with this particular web application sat open, until either they close and reopen their browser to restart their _session or I manually restart apache_php_fpm.
 

cPanelLauren

Product Owner II
Staff member
Nov 14, 2017
13,266
1,300
363
Houston
Hi @APatchworkBoy
While in this instance it looks like your issue was related to the PHP-FPM user pool being maxed out I do think there is a further issue here as it shouldn't enter into a restart loop like this. I believe that this issue warrants more in-depth investigation and because of that can you please open a ticket using the link in my signature? Once open please reply with the Ticket ID here so that we can update this thread with the resolution once the ticket is resolved.


Thanks!
 

APatchworkBoy

Active Member
Feb 26, 2018
27
7
3
West Yorkshire, UK
cPanel Access Level
Root Administrator
Evening @cPanelLauren (UK time)... ta for that - basic investigation = I suspect the root cause was an XMLRPC attack resulting in the segfault. Mod_security _was_ blocking this at the time so not sure why it resulted in segfaults (poss linked to opcache & php-fpm in use?). The IP has since been added as a deny rule in /etc/hosts.allow - details in the ticket. Ticket ID: 10200131

Normal day-to-day use, the server rarely gets anywhere _near_ hitting maxchildren.
 
Last edited by a moderator:

cPanelLauren

Product Owner II
Staff member
Nov 14, 2017
13,266
1,300
363
Houston
Hi @APatchworkBoy

I'm glad you opened a ticket I just checked in on it and it seems that you may have identified a coding issue as the cause for the problem as well?

Thanks!
 

APatchworkBoy

Active Member
Feb 26, 2018
27
7
3
West Yorkshire, UK
cPanel Access Level
Root Administrator
Sorry @cPanelLauren - been on leave for a week...!

Combination of issues - maxrequestworkers too low combined with a possible coding error resulting in a PHP Session race condition causing maxrequestworkers to be hit. Am waiting for recurrence to identify for certain, although I think the session race condition’s now been mitigated so may never know (nothing concrete in the logs to indicate it _was_ a race, but as I’ve said: ordinarily this box gets nowhere near hitting it’s limits). My segfault references in earlier post weren’t segfaults - misdiagnosis at my end!

Have tweaked config according to suggestions by (the ever-wonderful) Roman and will be monitoring from tomorrow when the server will be back under it’s 'usual' workload (educational establishment - all staff & students have been off for 6 weeks so server traffic was technically at it’s lowest when all this happened).
 

cPanelLauren

Product Owner II
Staff member
Nov 14, 2017
13,266
1,300
363
Houston
Hi @APatchworkBoy

I'm glad to hear that they were able to get you situated, it's easiest to identify an issue when you're able to access the server and see what's going in real time.

Thanks!
 

cPanelLauren

Product Owner II
Staff member
Nov 14, 2017
13,266
1,300
363
Houston
Hello @masterross

Without knowing why it failed it's hard to provide advice. If you check the error logs as listed previously in this thread and post relevant errors we might be able to assist better