Suspect Updates are crashing server, or server cannot cope...

JohnnyBgood

Member
Feb 6, 2015
14
0
51
cPanel Access Level
Root Administrator
Hi all,

We are running a AWS server (t2.2xlarge) with centos 7 - since oct 2019. We use cloudflare (and while we tried to resolve this issue we enabled cloudflare ddos protection - just to rule it out)

Most of the time the site seems to run fine with load averages of: 0.51 0.51 0.50

The site gets 700-900k page views a day, and each day it peaks at 2500-4000 active users (from analytics) on the site. But as mentioned - the server seems to handle these numbers fine, so we do not think the sever is not handling the load.

Twice our server crashed, and we haven't been able to work out the reason. The first time was in Sept 2020, and I asked about it here

Yesterday we had the same problem, it looks like this was causing the issue:

Code:
 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
Our load averages went to: 1120.90 911.14 570.19

We tried restarting the server, which did nothing, the site came back, then it started hanging/crashing again. We did yum update - which also did not help.

We then tried /scripts/upcp --force which we think started to resolve the issue, after we restarted apache, MySQL, etc in WHM, before restarting the server and doing yum clean all and rpm --rebuilddb We then just kept killing all new instances of mysqld.pid as they occurred.

Once the server was back - we have been trying to look to find out what was wrong. So far we can find nothing.

Nothing stands out in the error_log, and all other logs we've checked just look normal typical things.

The only thing we can find is in our mysqld.log - both dates when the site crashed, we found this:

Code:
2020-11-20T18:55:00.531148Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 2808360152
2020-11-20T18:55:00.531156Z 0 [Note] InnoDB: Database was not shutdown normally!
2020-11-20T18:55:00.531162Z 0 [Note] InnoDB: Starting crash recovery.
But we cannot find why it crashed.

The site crashed at about 16:06 UTC, that morning - we had an email from the server saying:

Code:
Cron <[email protected]> /usr/bin/kcarectl -q --auto-update

sysctl: cannot stat /proc/sys/fs/enforce_symlinksifowner: No such file or directory
sysctl: cannot stat /proc/sys/fs/symlinkown_gid: No such file or directory
But we cannot see if the site tried to update again around the time the site went down. We are thinking something could be trying to update or run on the server automatically that is crashing the site?

If anyone can give us any help, we would be really grateful.
 
Last edited by a moderator:

cPRex

Jurassic Moderator
Staff member
Oct 19, 2014
715
97
153
cPanel Access Level
Root Administrator
Hey there!

First of all, cool username :D

Sorry to hear about the issues with the site. The mysql process that you mention is just the main process, which runs on CentOS 7 systems. I see the same thing on a CentOS 7 test machine when I check with the "ps aux" command:

Code:
[[email protected] ~]# ps aux | grep mysql
mysql     1317  0.0  3.7 1036920 77220 ?       Sl   Nov20   1:22 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
so I don't think that is the issue with the system, unless MySQL was using too may resources and causing stability issues.

The second thing you posted, which shows the InnoDB crash, is much more concerning. That indicates there is damage to the MySQL data on the system, which could keep things from working well and will cause issues. The guide we have posted in the link below explains how you can confirm this (although I would say the log you have provided is enough to confirm the issue) and also gives a link to some resources the help repair that:


The KernelCare update notification is likely not related. It's something that should still be investigated, but it seems like the MySQL service and crashed tables are more the core issue that needs to be dealt with.
 
  • Like
Reactions: JohnnyBgood

JohnnyBgood

Member
Feb 6, 2015
14
0
51
cPanel Access Level
Root Administrator
Hi cPRex! Thank you so much for your reply. I have followed the guide you provided, but could not find anything.

We did not have a /var/lib/mysql/HOSTNAME.err error log, the only log we had was: /var/log/mysqld.log

I did a search in that log for the keywords in the guide, "corruption | failed | corrupt | deleted | moved" We had none of those words.

When I ran the checks on the tables in MySQL - all the tables came back "ok" (see attached)

So I am not sure if the MySQL data is corrupt?


The server copes with a peak of 3k concurrent users, and 800k pageviews a day -- but could it be possible that during a server update - that it struggles? Our daily process logs show we use 37% of memory each day (Is there any way to check this theory?)

Do you, or anyone else have any other ideas what I could look into to resolve this issue?

Thanks again for your reply and help.
 

Attachments

cPRex

Jurassic Moderator
Staff member
Oct 19, 2014
715
97
153
cPanel Access Level
Root Administrator
Thanks for the additional details. The earlier log showed the InnoDB crash clearly, so you may want to search around that point in the log file. The hostname.err log is just the default location, but it can be changed on the system in the /etc/my.cnf file.

For the memory check, it's not really accurate to get any type of summary, as it's best to see the usage in real time while you're experiencing the issues with the machine. It's possible while updates are performed that affect services, such as an Apache restart or MySQL restart, that there could be additional slowness with the system. If you need to, you can adjust root's crontab to change the time the updates run on the system, or you could manually run "/scripts/upcp --force" to see if you can reproduce the issues with the system just by running that.