Morning folks... last night at 23:03 httpd decided to restart itself, and ended up in a loop.
A quick ps -aux revealed a ton of php_fpm processes all for the same user. Running /script/restartsrv_apache_php_fpm suddenly allowed httpd to come up and stay up, and as expected the php_fpm processes all vanished down to the usual 4 and all has been fine since.
A check of /opt/cpanel/ea-php71/root/usr/var/log/php-fpm/error.log shows php-fpm reaching max_children (40) at 20:41 last night, then no further log entries until I restart php-fpm this morning at 08:04 to allow httpd to recover.
The domain_in in question's own php-fpm error log in /home/user/log/doma_in.php.error.log reveals nothing over that duration of time.
I _should_ have straced one of the processes, but didn't. There were 3x users who were sat on that website all night (which checks for notifications and other things periodically via ajax) and it would appear something got stuck somewhere, causing more and more child processes to be created til the server hit max_children. Eventually this took httpd down and chksrvd did it's job and attempted to restart itself, but it seems at no point did it attempt to restart php_fpm alongside so nothing could come back up without manual intervention.
The actual question: Is there anyway to modify whatever routine chksrvd uses to restart apache to also restart apache_php_fpm at the same time, whilst I examine the code on the website in question to find out the actual cause...? This would be enough to prevent the problem taking down all sites on the server. ('PHP-FPM service for Apache' is enabled and monitored in Service Manager but it would appear that if service manager finds a need to restart apache, it _only_ restarts apache, not apache_php_fpm)
I've noticed this issue of stuck php-fpm processes a few times with this particular server and site, mainly because it's also accessed from an internal network and if our IT guys reboot our proxy server (for instance) it causes the same problem for users with this particular web application sat open, until either they close and reopen their browser to restart their _session or I manually restart apache_php_fpm.
Code:
[Fri Aug 24 07:12:57.922269 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00297: SIGUSR1 received. Doing graceful restart
[Fri Aug 24 07:12:58.520698 2018] [mpm_worker:warn] [pid 25315:tid 139892578699232] AH00317: MaxRequestWorkers of 80 is not an integer multiple of ThreadsPerChild of 25, decreasing to nearest multiple 75
[Fri Aug 24 07:12:58.554426 2018] [http2:info] [pid 25315:tid 139892578699232] AH03090: mod_http2 (v1.10.20, feats=CHPRIO+SHA256+INVHD+DWINS, nghttp2 1.32.0), initializing...
[Fri Aug 24 07:12:58.555398 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00292: Apache/2.4.34 (cPanel) OpenSSL/1.0.2o mod_bwlimited/1.4 configured -- resuming normal operations
[Fri Aug 24 07:12:58.555434 2018] [core:notice] [pid 25315:tid 139892578699232] AH00094: Command line: '/usr/sbin/httpd'
[Fri Aug 24 07:17:58.963914 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00297: SIGUSR1 received. Doing graceful restart
[Fri Aug 24 07:17:59.469715 2018] [mpm_worker:warn] [pid 25315:tid 139892578699232] AH00317: MaxRequestWorkers of 80 is not an integer multiple of ThreadsPerChild of 25, decreasing to nearest multiple 75
[Fri Aug 24 07:17:59.498680 2018] [http2:info] [pid 25315:tid 139892578699232] AH03090: mod_http2 (v1.10.20, feats=CHPRIO+SHA256+INVHD+DWINS, nghttp2 1.32.0), initializing...
[Fri Aug 24 07:17:59.499476 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00292: Apache/2.4.34 (cPanel) OpenSSL/1.0.2o mod_bwlimited/1.4 configured -- resuming normal operations
[Fri Aug 24 07:17:59.499506 2018] [core:notice] [pid 25315:tid 139892578699232] AH00094: Command line: '/usr/sbin/httpd'
[Fri Aug 24 07:23:00.332397 2018] [mpm_worker:notice] [pid 25315:tid 139892578699232] AH00297: SIGUSR1 received. Doing graceful restart
A check of /opt/cpanel/ea-php71/root/usr/var/log/php-fpm/error.log shows php-fpm reaching max_children (40) at 20:41 last night, then no further log entries until I restart php-fpm this morning at 08:04 to allow httpd to recover.
Code:
[23-Aug-2018 19:45:48] NOTICE: [pool doma_in] child 6530 started
[23-Aug-2018 19:49:51] NOTICE: [pool doma_in] child 6444 exited with code 0 after 369.456144 seconds from start
[23-Aug-2018 19:49:51] NOTICE: [pool doma_in] child 6663 started
[23-Aug-2018 19:50:48] NOTICE: [pool doma_in] child 6468 exited with code 0 after 380.832191 seconds from start
[23-Aug-2018 19:50:48] NOTICE: [pool doma_in] child 6711 started
[23-Aug-2018 19:51:48] NOTICE: [pool doma_in] child 6526 exited with code 0 after 386.958746 seconds from start
[23-Aug-2018 19:51:48] NOTICE: [pool doma_in] child 6737 started
[23-Aug-2018 19:52:18] NOTICE: [pool doma_in] child 6530 exited with code 0 after 389.997204 seconds from start
[23-Aug-2018 19:52:18] NOTICE: [pool doma_in] child 6761 started
[23-Aug-2018 19:56:20] NOTICE: [pool doma_in] child 6663 exited with code 0 after 388.938428 seconds from start
[23-Aug-2018 19:56:20] NOTICE: [pool doma_in] child 6891 started
[23-Aug-2018 19:57:11] NOTICE: [pool doma_in] child 6711 exited with code 0 after 383.051672 seconds from start
[23-Aug-2018 19:57:11] NOTICE: [pool doma_in] child 6916 started
[23-Aug-2018 19:58:12] NOTICE: [pool doma_in] child 6737 exited with code 0 after 384.241633 seconds from start
[23-Aug-2018 19:58:12] NOTICE: [pool doma_in] child 6944 started
[23-Aug-2018 19:58:48] NOTICE: [pool doma_in] child 6761 exited with code 0 after 389.961665 seconds from start
[23-Aug-2018 19:58:48] NOTICE: [pool doma_in] child 6954 started
[23-Aug-2018 20:02:44] NOTICE: [pool doma_in] child 6891 exited with code 0 after 383.337425 seconds from start
[23-Aug-2018 20:02:44] NOTICE: [pool doma_in] child 7135 started
[23-Aug-2018 20:03:20] NOTICE: [pool doma_in] child 6916 exited with code 0 after 368.832361 seconds from start
[23-Aug-2018 20:03:20] NOTICE: [pool doma_in] child 7168 started
[23-Aug-2018 20:04:47] NOTICE: [pool doma_in] child 6944 exited with code 0 after 395.402790 seconds from start
[23-Aug-2018 20:04:47] NOTICE: [pool doma_in] child 7220 started
[23-Aug-2018 20:05:11] NOTICE: [pool doma_in] child 6954 exited with code 0 after 383.537759 seconds from start
[23-Aug-2018 20:05:11] NOTICE: [pool doma_in] child 7255 started
[23-Aug-2018 20:08:58] NOTICE: [pool doma_in] child 7135 exited with code 0 after 374.449430 seconds from start
[23-Aug-2018 20:08:58] NOTICE: [pool doma_in] child 7338 started
[23-Aug-2018 20:09:48] NOTICE: [pool doma_in] child 7168 exited with code 0 after 387.326123 seconds from start
[23-Aug-2018 20:09:48] NOTICE: [pool doma_in] child 7392 started
[23-Aug-2018 20:11:12] NOTICE: [pool doma_in] child 7220 exited with code 0 after 384.725260 seconds from start
[23-Aug-2018 20:11:12] NOTICE: [pool doma_in] child 7470 started
[23-Aug-2018 20:11:40] NOTICE: [pool doma_in] child 7255 exited with code 0 after 389.052525 seconds from start
[23-Aug-2018 20:11:40] NOTICE: [pool doma_in] child 7477 started
[23-Aug-2018 20:15:42] NOTICE: [pool doma_in] child 7338 exited with code 0 after 404.041951 seconds from start
[23-Aug-2018 20:15:42] NOTICE: [pool doma_in] child 7809 started
[23-Aug-2018 20:15:50] NOTICE: [pool doma_in] child 7392 exited with code 0 after 362.688153 seconds from start
[23-Aug-2018 20:15:50] NOTICE: [pool doma_in] child 7812 started
[23-Aug-2018 20:17:43] NOTICE: [pool doma_in] child 7470 exited with code 0 after 390.406430 seconds from start
[23-Aug-2018 20:17:43] NOTICE: [pool doma_in] child 7861 started
[23-Aug-2018 20:17:58] NOTICE: [pool doma_in] child 7477 exited with code 0 after 377.616078 seconds from start
[23-Aug-2018 20:17:58] NOTICE: [pool doma_in] child 7865 started
[23-Aug-2018 20:21:50] NOTICE: [pool doma_in] child 7809 exited with code 0 after 368.196082 seconds from start
[23-Aug-2018 20:21:50] NOTICE: [pool doma_in] child 7996 started
[23-Aug-2018 20:22:21] NOTICE: [pool doma_in] child 7812 exited with code 0 after 390.710030 seconds from start
[23-Aug-2018 20:22:21] NOTICE: [pool doma_in] child 8024 started
[23-Aug-2018 20:24:18] NOTICE: [pool doma_in] child 7861 exited with code 0 after 395.430634 seconds from start
[23-Aug-2018 20:24:18] NOTICE: [pool doma_in] child 8081 started
[23-Aug-2018 20:24:21] NOTICE: [pool doma_in] child 7865 exited with code 0 after 382.774186 seconds from start
[23-Aug-2018 20:24:21] NOTICE: [pool doma_in] child 8083 started
[23-Aug-2018 20:28:12] NOTICE: [pool doma_in] child 7996 exited with code 0 after 381.299786 seconds from start
[23-Aug-2018 20:28:12] NOTICE: [pool doma_in] child 8212 started
[23-Aug-2018 20:28:53] NOTICE: [pool doma_in] child 8024 exited with code 0 after 392.200477 seconds from start
[23-Aug-2018 20:28:53] NOTICE: [pool doma_in] child 8219 started
[23-Aug-2018 20:30:42] NOTICE: [pool doma_in] child 8083 exited with code 0 after 380.700935 seconds from start
[23-Aug-2018 20:30:42] NOTICE: [pool doma_in] child 8309 started
[23-Aug-2018 20:30:48] NOTICE: [pool doma_in] child 8081 exited with code 0 after 390.045318 seconds from start
[23-Aug-2018 20:30:48] NOTICE: [pool doma_in] child 8313 started
[23-Aug-2018 20:41:11] WARNING: [pool doma_in] server reached max_children setting (40), consider raising it
[24-Aug-2018 08:04:29] NOTICE: Terminating ...
[24-Aug-2018 08:04:29] NOTICE: exiting, bye-bye!
[24-Aug-2018 08:04:29] NOTICE: fpm is running, pid 32041
[24-Aug-2018 08:04:29] NOTICE: ready to handle connections
[24-Aug-2018 08:04:42] NOTICE: Terminating ...
[24-Aug-2018 08:04:42] NOTICE: exiting, bye-bye!
[24-Aug-2018 08:04:42] NOTICE: fpm is running, pid 32302
[24-Aug-2018 08:04:42] NOTICE: ready to handle connections
I _should_ have straced one of the processes, but didn't. There were 3x users who were sat on that website all night (which checks for notifications and other things periodically via ajax) and it would appear something got stuck somewhere, causing more and more child processes to be created til the server hit max_children. Eventually this took httpd down and chksrvd did it's job and attempted to restart itself, but it seems at no point did it attempt to restart php_fpm alongside so nothing could come back up without manual intervention.
The actual question: Is there anyway to modify whatever routine chksrvd uses to restart apache to also restart apache_php_fpm at the same time, whilst I examine the code on the website in question to find out the actual cause...? This would be enough to prevent the problem taking down all sites on the server. ('PHP-FPM service for Apache' is enabled and monitored in Service Manager but it would appear that if service manager finds a need to restart apache, it _only_ restarts apache, not apache_php_fpm)
I've noticed this issue of stuck php-fpm processes a few times with this particular server and site, mainly because it's also accessed from an internal network and if our IT guys reboot our proxy server (for instance) it causes the same problem for users with this particular web application sat open, until either they close and reopen their browser to restart their _session or I manually restart apache_php_fpm.