sneader

Well-Known Member
Aug 21, 2003
1,195
68
178
La Crosse, WI
cPanel Access Level
Root Administrator
Googlebot is really loading one of my servers, hitting some strange URLs for one particular customer (poorly written shopping cart). A new cart is being investigated, meanwhile I thought we could simply try to catch these bad URLs and redirect them to the home page or something.

However, the "gotcha" is that these are HTTPS URLs, and you cannot use {REQUEST_URI} on HTTPS.

For example, here's a bad URL it's trying to hit:

https://www.example.com/cart/https://www.example.com/cart/checkout/selectAddressshop/Blow-Out-Deal!-Extra-Loud-Alarm-Clock-with-Green-LED-3-for-19-99-Shipped.207Acer-KG-UXH1P-Dual-Band-VHF-Plus-200-MHZ-Handheld-220-Special!-129-95-Shipped-With-Programming-Cable-and-Software!.137shop/Accessories.23YT34010X3-SMA-FEMALE-to-UHF-female-Fits-Sony-and-more.221acer.info.htmlorder?returnPath=

If this wasn't HTTPS, I'd do something like:

RewriteEngine on
RewriteCond %{HTTP_HOST} ^example.com$
RewriteCond %{REQUEST_URI} ^/cart/https
RewriteRule ^(.*)$ http://www.example.com/cart/$1 [R=301,L]

The syntax may not be right, but what I'm trying to is say... if anyone tries to go to a URL that starts with /cart/https.... that is bogus and redirect them.

But {REQUEST_URI} doesn't work with HTTPS.

Any ideas, either to solve this, or where to go for a "consultant" to help figure out a workaround?

- Scott
 

sneader

Well-Known Member
Aug 21, 2003
1,195
68
178
La Crosse, WI
cPanel Access Level
Root Administrator
Well, the problem with using robots.txt is that you'd have to enter something like:

Disallow: /cart/https://www.example.com/

I'll try it, but it just doesn't look like something it will understand, does it?

- Scott
 

sneader

Well-Known Member
Aug 21, 2003
1,195
68
178
La Crosse, WI
cPanel Access Level
Root Administrator
In Webmaster Tools, you can test your robots.txt file. I have "Disallow: /cart/https://www.example.com/" in robots.txt. When I feed the tester this URL:

It says "Not in Domain".

When I feed it the same URL, but I change the beginning from https to http, then it says "blocked by robots.txt"

So... I'm sunk. It appears there is NO WAY to control what Google spiders, if it decides to use HTTPS to hit your site.

That just seems wrong. How do I contact this Matt Cutts guy? :)

- Scott