In Progress [CPANEL-32760] Amazon S3 backup transport failure (Amazon responded with 403 Forbidden)

Operating System & Version
Cloudlinux 7.8
cPanel & WHM Version
88.0.12

pwells

Member
Apr 28, 2015
20
2
53
Australia
cPanel Access Level
Root Administrator
I am facing an intermittent issue with backup transport to Amazon S3. The majority of accounts copy across just fine, however one or two accounts fail with the error message:
Code:
 Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden
 at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136.
After each failure the system retries the backup process for these accounts and fails each time. Every subsequent run of the backup process fails on the same accounts (though these are not the largest accounts). I can't work out specifically why these accounts are failing and others are succeeding.

I was able to trace this back to the MultiPart upload function `_do_multipart_upload` in `/usr/local/cpanel/Cpanel/Transport/Files/AmazonS3.pm`. It appears that one part of the MultiPart upload is failing for some reason, which is crashing the entire upload.

Below is an excerpt from one of the backup transport logs. This shows a successful transfer, followed by a failed transfer. This particular one recovered on the third try, however this is not always the case.
Code:
[2020-07-07 01:46:19 +1000] info [cpbackup_transporter] Upload attempt #1 starting for /backup/weekly/2020-07-07/accounts/agedcareservices.tar.gz to /weekly/2020-07-07/accounts/agedcareservices.tar.gz for destination:  Articuno Backup
[2020-07-07 01:46:31 +1000] info [cpbackup_transporter] Successful transfer of /backup/weekly/2020-07-07/accounts/agedcareservices.tar.gz to /weekly/2020-07-07/accounts/agedcareservices.tar.gz for destination Articuno Backup
[2020-07-07 01:46:31 +1000] info [cpbackup_transporter] The backup has been successfully uploaded at least once, now we will delete the local copy (/backup/weekly/2020-07-07/accounts/agedcareservices.tar.gz) since keeplocal (0) is disabled.
[2020-07-07 01:46:40 +1000] info [cpbackup_transporter] cpbackup_transporter - Processing next task
[2020-07-07 01:46:40 +1000] info [cpbackup_transporter] Instantiating Object
[2020-07-07 01:46:41 +1000] info [cpbackup_transporter] Starting a "copy" operation on the "Articuno Backup" destination ID "FM*****redacted*****BP8".
[2020-07-07 01:46:41 +1000] info [cpbackup_transporter] Validating destination path /weekly/2020-07-07/accounts/
[2020-07-07 01:46:42 +1000] info [cpbackup_transporter] Path exists
[2020-07-07 01:46:42 +1000] info [cpbackup_transporter] Uploading account backup /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz (from weekly/2020-07-07/accounts/johnson1.tar.gz)
[2020-07-07 01:46:42 +1000] info [cpbackup_transporter] Attempting to upload /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination:  Articuno Backup
[2020-07-07 01:46:42 +1000] info [cpbackup_transporter] Upload attempt #1 starting for /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination:  Articuno Backup
[2020-07-07 02:18:49 +1000] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden
 at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136.
 at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27.
    Cpanel::LoggerAdapter::warn(Cpanel::LoggerAdapter=HASH(0x17183e0), "Upload attempt failed: Amazon::S3: Amazon responded with 403 "...) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 658
    Cpanel::Backup::Queue::transport_backup::attempt_to_upload_file(Cpanel::Backup::Queue::transport_backup=HASH(0x151a120), Cpanel::Transport::Files::AmazonS3=HASH(0x37bab98), "/backup/weekly/2020-07-07/accounts/johnson1.tar.gz", "/weekly/2020-07-07/accounts/johnson1.tar.gz", Cpanel::LoggerAdapter=HASH(0x17183e0)) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 333
    Cpanel::Backup::Queue::transport_backup::process_task(Cpanel::Backup::Queue::transport_backup=HASH(0x151a120), Cpanel::TaskQueue::Task=HASH(0x3fc3078), Cpanel::LoggerAdapter=HASH(0x17183e0)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 635
    eval {...} called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 638
    Cpanel::TaskQueue::__ANON__() called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
    eval {...} called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
    Cpanel::StateFile::Guard::call_unlocked(Cpanel::StateFile::Guard=HASH(0x3fdfb40), CODE(0x303db28)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 643
    Cpanel::TaskQueue::process_next_task(Cpanel::TaskQueue=HASH(0x25bf8f8)) called at /usr/local/cpanel/bin/cpbackup_transporter line 168
    eval {...} called at /usr/local/cpanel/bin/cpbackup_transporter line 166
[2020-07-07 02:18:51 +1000] info [cpbackup_transporter] Upload attempt #2 starting for /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination:  Articuno Backup
[2020-07-07 02:49:54 +1000] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden
 at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136.
 at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27.
    Cpanel::LoggerAdapter::warn(Cpanel::LoggerAdapter=HASH(0x17183e0), "Upload attempt failed: Amazon::S3: Amazon responded with 403 "...) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 658
    Cpanel::Backup::Queue::transport_backup::attempt_to_upload_file(Cpanel::Backup::Queue::transport_backup=HASH(0x151a120), Cpanel::Transport::Files::AmazonS3=HASH(0x37bab98), "/backup/weekly/2020-07-07/accounts/johnson1.tar.gz", "/weekly/2020-07-07/accounts/johnson1.tar.gz", Cpanel::LoggerAdapter=HASH(0x17183e0)) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 333
    Cpanel::Backup::Queue::transport_backup::process_task(Cpanel::Backup::Queue::transport_backup=HASH(0x151a120), Cpanel::TaskQueue::Task=HASH(0x3fc3078), Cpanel::LoggerAdapter=HASH(0x17183e0)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 635
    eval {...} called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 638
    Cpanel::TaskQueue::__ANON__() called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
    eval {...} called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
    Cpanel::StateFile::Guard::call_unlocked(Cpanel::StateFile::Guard=HASH(0x3fdfb40), CODE(0x303db28)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 643
    Cpanel::TaskQueue::process_next_task(Cpanel::TaskQueue=HASH(0x25bf8f8)) called at /usr/local/cpanel/bin/cpbackup_transporter line 168
    eval {...} called at /usr/local/cpanel/bin/cpbackup_transporter line 166
[2020-07-07 02:49:55 +1000] info [cpbackup_transporter] Upload attempt #3 starting for /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination:  Articuno Backup
[2020-07-07 04:50:09 +1000] info [cpbackup_transporter] Successful transfer of /backup/weekly/2020-07-07/accounts/johnson1.tar.gz to /weekly/2020-07-07/accounts/johnson1.tar.gz for destination Articuno Backup
[2020-07-07 04:50:09 +1000] info [cpbackup_transporter] The backup has been successfully uploaded at least once, now we will delete the local copy (/backup/weekly/2020-07-07/accounts/johnson1.tar.gz) since keeplocal (0) is disabled.
[2020-07-07 04:50:09 +1000] info [cpbackup_transporter] cpbackup_transporter - Processing next task
Note the timestamps. The backup appears to fail part-way through the transfer process, with the first attempt failing at 32 minutes, the second at 31 minutes and the final successful transfer taking 1:59.

I dug through the forums and feature requests and found this request: https://features.cpanel.net/topic/a...led-uploads-in-chunks-rather-than-entire-file which would help significantly here, though doesn't describe this exact issue.

I also logged by own feature request after digging though the code and discovering that transports are single-threaded, here: https://features.cpanel.net/topic/2...pport-concurrent-uploads-for-faster-transport Again, this won't resolve the problem, but would make it much faster.


Has anybody experienced a similar issue? Why are we seeing a 403 forbidden error on some of these requests?
 

f3cp

Registered
Jun 24, 2015
2
0
1
Australia
cPanel Access Level
Root Administrator
I have also been encountering similar intermittent transport failures recently.
This only started occurring since a few weeks ago however I have been getting failures almost daily, however the affected accounts are different each day.
Prior to this backups had been working without issue, and I haven't made any configuration changes recently.

The log usually contains "Upload attempt failed: Amazon::S3: Amazon responded with 403" the same as in the previous comment, but I have also noticed this failure message on certain attempts:

“Cpanel::Transport::Files::put” attempted to validate a file that does not exist: /backup/2020-07-16/accounts/AFFECTEDACCOUNT.tar.gz at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27.

An account could fail on due to either or both reasons for subsequent attempts, and I haven't so far been able to identify a pattern as to why this is occurring.
Looking at the backup logs, the backups all appear to be generating with no failures, so i'm not really sure why they would return this error.
 

cPanelLauren

Product Owner
Staff member
Nov 14, 2017
13,296
1,271
313
Houston
It sounds as though a timeout is being reached though this isn't clear as to which or why.
Can you please open a ticket using the link in my signature? Once open please reply with the Ticket ID here so that we can update this thread with the resolution once the ticket is resolved.


Thanks!
 

pwells

Member
Apr 28, 2015
20
2
53
Australia
cPanel Access Level
Root Administrator
I have done some more digging into the Transporter Logs this morning and can confirm that the affected accounts are indeed random. Most accounts, however, recover after the first or second failure (it tries each account three times).

I have also noticed that this only affects two of our WHM Virtual Servers. Our other three Virtual Servers do have any 403 failures in the logs, even though they are in the same datacentre, same subnet and use the same AWS access keys to the same S3 account. One of these is even on the same physical server (Host) as one of the affected Virtual Servers.

I have opened a support ticket. ID: 93510870
 

cPanelLauren

Product Owner
Staff member
Nov 14, 2017
13,296
1,271
313
Houston
Hello,


I checked in on that ticket and found that there has been some work done it and it is currently awaiting your response. I'm following it now and will do my best to add any updates as they're available.
 

pwells

Member
Apr 28, 2015
20
2
53
Australia
cPanel Access Level
Root Administrator
I have an update.

I was able to determine (by modifying the output of the '_croak_if_response_error' in /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3.pm to log the server response message rather than just the status-code) that the specific error message was: "RequestTimeTooSkewed": "The difference between the request time and the current time is too large.".

The server time is correct (within 4 seconds of Amazon's servers).

The log output is below:
Code:
[2020-07-21 11:48:17 +1000] info [cpbackup_transporter] Upload attempt #1 starting for /cluster/2020-07-21/accounts/runf$
[2020-07-21 13:48:23 +1000] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden - <?xml version="1.0" encoding="UTF-8"?>
<Error><Code>RequestTimeTooSkewed</Code><Message>The difference between the request time and the current time is too large.</Message><RequestTime>20200721T033148Z</RequestTime><ServerTime>2020-07-21T03:48:24Z</ServerTime><MaxAllowedSkewMilliseconds>900000</MaxAllowedSkewMilliseconds><R$
at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136.
at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27.
Cpanel::LoggerAdapter::warn(Cpanel::LoggerAdapter=HASH(0x23ff440), "Upload attempt failed: Amazon::S3: Amazon responded with 403 "...) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 658
The request was apparently made at 03:31:48 and Amazon's server received this at: 03:48:24. A 16 minute, 36 second difference. The log entry time confirms that our server time is only off by a few seconds (+10hrs due to timezone).

It seems that the upload request is stalling part-way through for some reason.
 
  • Like
Reactions: cPanelLauren

cPanelLauren

Product Owner
Staff member
Nov 14, 2017
13,296
1,271
313
Houston
I just double checked the ticket to make sure it was added there as well. Thanks for this and I am watching closely to see if it spawns a case. I am curious to understand the cause of this delay, it doesn't seem to make a lot of sense. Nice move though modifying the PM you may want to backup the modification in case an update overwrites it as well.
 

pwells

Member
Apr 28, 2015
20
2
53
Australia
cPanel Access Level
Root Administrator
So the cPanel Support were not able to determine why these requests are being delayed in-transit. They have given up on diagnosing this on the basis that this is likely a network issue rather than software issue - fair, but we are stuck without a functioning backup system.

This morning the issue has affected another of our servers (now affecting 3 of our 5 servers).

Is there any way to have this fix pushed through as a priority? https://features.cpanel.net/topic/a...led-uploads-in-chunks-rather-than-entire-file While this would not directly fix the network issue of packets being delayed in transit, it would ensure that the backup process does not fail because of this.
 

cPanelLauren

Product Owner
Staff member
Nov 14, 2017
13,296
1,271
313
Houston
That doesn't seem to be the sentiment from the analyst's notes in the ticket. They left notes indicating that you were going to perform some networking tests rather than have them perform more from our end. Maybe there was some misunderstanding in communication?
 

pwells

Member
Apr 28, 2015
20
2
53
Australia
cPanel Access Level
Root Administrator
Sorry, I should have mentioned that I did indeed complete the testing as recommended by the support technicians. I was unable to reply to the ticket as it was closed. I did not find anything noteworthy in these tests.

I don't believe that there is a major issue with the networking setup on our server. Any network stability issues appear to be external to our servers on the Internet somewhere between our server and Amazon's servers. I believe the major issue here is therefore that the backup transporter does not gracefully handle random request failures - instead restarting the entire transfer process.

It is essentially impossible to guarantee delivery of all requests over the Internet. In our case these are sending from our server in a Sydney Datacentre, to Amazon S3 in the US - a quick traceroute shows that this is 29 hops - so there is plenty of opportunity for something to go wrong.
We are seeing roughly 1 in 1000 PUT requests failing during the backup transport process (with each request having a payload of ~10MB). While this number is a bit high, it is not unheard of, and requests this large are basically a worst-case scenario.

The backup transporter should have been designed to handle these type of errors. Designing this to function only in a best-case scenario where there are no lost requests was reckless. This is why I am pushing for this fix: https://features.cpanel.net/topic/a...led-uploads-in-chunks-rather-than-entire-file what can we do to make this happen?
 

plague

Well-Known Member
Sep 22, 2006
76
17
158
Brasil
cPanel Access Level
Root Administrator
Twitter
One of our servers are experiencing the same behavior. The server have 4 accounts, the smallest makes a 9GB backup file.
The server is hosted in Brazil and in this case the error is not intermittent. Every upload attempt is failling after 20 to 40 minutes since July


Code:
[2020-08-12 02:24:21 -0300] warn [cpbackup_transporter] Upload attempt failed: Amazon::S3: Amazon responded with 403 Forbidden
 at /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3/Bucket.pm line 136.
 at /usr/local/cpanel/Cpanel/LoggerAdapter.pm line 27.
        Cpanel::LoggerAdapter::warn(Cpanel::LoggerAdapter=HASH(0x11ef890), "Upload attempt failed: Amazon::S3: Amazon responded with 403 "...) called at /usr/local/cpanel/Cpanel/Backup/Queue.pm line 658
        Cpanel::Backup::Queue::transport_backup::attempt_to_upload_file(Cpanel::Backup::Queue::transport_backup=HASH(0xff1200), Cpanel::Transport::Files::AmazonS3=HASH(0x20f7dd8), "/backup/2020-08-12/accounts/vendasgoverno.tar.gz", "/2020-08-12/accounts/vendasgoverno.ta$
        Cpanel::Backup::Queue::transport_backup::process_task(Cpanel::Backup::Queue::transport_backup=HASH(0xff1200), Cpanel::TaskQueue::Task=HASH(0x208ebe0), Cpanel::LoggerAdapter=HASH(0x11ef890)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 635
        eval {...} called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 638
        Cpanel::TaskQueue::__ANON__() called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
        eval {...} called at /usr/local/cpanel/Cpanel/StateFile.pm line 241
        Cpanel::StateFile::Guard::call_unlocked(Cpanel::StateFile::Guard=HASH(0x208e478), CODE(0x1c4e220)) called at /usr/local/cpanel/Cpanel/TaskQueue.pm line 643
        Cpanel::TaskQueue::process_next_task(Cpanel::TaskQueue=HASH(0x208e538)) called at /usr/local/cpanel/bin/cpbackup_transporter line 168
        eval {...} called at /usr/local/cpanel/bin/cpbackup_transporter line 166
 

cPanelLauren

Product Owner
Staff member
Nov 14, 2017
13,296
1,271
313
Houston
Hi guys,

based on the discussion in the ticket and some notes added to it that you're unable to see, there is a suggestion that this may be related to an internal case CPANEL-32760 - S3 Compatible have a default region of us-east-1, this can fail for some custom regions.

I believe this suggestion is correct and I believe it to be related to the issues you're both having most likely. I sent the tech support supervisor a message indicating this and my assumptions and I'm awaiting her response. Until then, I let her know that I'm going to go ahead and mark this thread as related to the internal case. I'll let you know what comes of this when I have more information.

In the internal case there is a suggested workaround for this which I don't particularly like but it's as follows:

Known work around(s)
Modify the code and disable updates. The /usr/local/cpanel/3rdparty/perl/530/lib/perl5/cpanel_lib/Amazon/S3.pm file is part of the cpanel-perl-530-Amazon-S3-0.45-1.cp1186.noarch RPM which cannot be excluded without excluding all of cpanel-perl-530.