Google Crawler Issue

In late February 2011, Google made a number of changes to its Googlebot web crawler. I do not know the exact dates because I have not been able to find any documentation with useful details. However, it is likely that these problems were caused by the Farmer/Panda update on Feb 23, 2011.

By the way, I did report both issues in the help forums.

As explained below, this may actually be a GoDaddy problem .. however, since the web developers are not at fault, Google should be smart enough to fix errors caused by ISP's.

Verification Failure | Googlebot's bogus urls | More bogus Googlebot urls

Verification Failure

Goggle offers Webmaster Tools to help manage sites. I use this to monitor search performance and to find certain types of html errors.

I don't logon very often, but I did today (03-15-2011) to check a new site. I noticed a dropdown list that shows all the sites I have access to. When I clicked it, I noticed that was not in the list. When I tried to access it from the main page, it said that I wasn't verified.

Weird! I have been using that site for years. I checked - the verification file was still there.

The earliest known Verification Failure was 2/14/11 - 28 days ago. It may have been earlier since the log does not go back any farther.

At any rate, the old verification method (to create a file with the required file name) has been deprecated .. you must now use a file with the same name, but with the contents required by google. I really don't have a problem with that, but the page that says that the page is failing registration should have a note on it explaining the change. In addition, the change should be included in the FAQ.

Googlebot's bogus urls

Once I could access my site, I went to the dashboard and saw the following

Huh! That should be under 300.

I later noticed a message

with numerous problem links similar to the following. (This is a single url, wrapped to be readable.) However, this is a composite of several valid urls. This was not a problem previously. I also got the following The value should be under 300. Apparently, since google now thinks that my site is some kind of spammer (or something), my search ranks have tanked and my alexa ranging has tanked.

This appears to also be associated with my site no longer being "verified". (I have fixed that, but an explanation on the site would have been nice.)

Also, once this is fixed, how do I clear the robots cache so that only valid data is displayed?

I don't know if these are related to the new software, but .. starting Feb 27, Googlebot began finding the bogus urls shown above.

The Crawl Stats show a volume increase starting at the end of January, a double peak in the middle of February, and a very large peak at the end of February. I interpret this to indicate 3 separate software releases. In December and the first part of January, the pages crawled per day are indistinguishable from zero - the table says a low of 13. After the software change, the peak goes to 80,567 .. average 11,763.

The number of bytes transferred jumped from a few thousand per day to a peak of 1,959,620 kilobytes. (Almost 2 gigabytes.) For people who have to pay for extra bandwidth, this could be very expensive.

At any rate, it is fairly obvious that Googlebot contains (contained) a very serious bug.

On the other hand, this may be a very old problem (2009) that was fixed and has now come back. I love how they blame the person who develops a web page for the failings of Googlebot. There was another post in Oct 2008 (requires a login to read).

It is also possible that Googlebot is not the problem. If some other site has all these links, then Googlebot may simply be following them.

GoDaddy is part of the problem

On 01-25-11, I noticed that GoDaddy had changed the 404 not found settings to show their custom page. I have no idea when that actually happened. Since I considered their page to be particularly hideous, (it is a stupid graphic and does not even say "404 not found"), I changed it to use my home page. Presumably, that is what caused the problems. For the sake of argument, I have decided to create my own 404 page with no links to anywhere.

At GoDaddy, under Hosting Control Center / Settings / 404 Error Behavior, there are only 3 options

I would have preferred "none of the above" and let the server simply send a 404 error. However, that was not an option.

I have verified GoDaddy's response - when a page is not found, it substitutes the selected 404 page and returns a status code of 200 OK. This is a major violation. If a bad url is used, the server MUST return 404 not found. So .. part of this problem is GoDaddy's fault. However, the primary fault is still Google's .. they should have never even tried the tens of thousands of bogus URLs. NEVER!

Note: After changing the response, it takes up to 30 minutes for the change to be seen. This makes debug and test extremely slow.
It turns out that if you select a custom file .. and then rename the file .. you will get a real 404 not found error. Score - that is what I really want.

GoDaddy is all of the problem

Get this, from GoDaddy's To Set Up Your Custom 404 Page

You would think that something that important would be placed on the page where you select the type of 404 page you want. Particularly since most home pages are full of relative links. I think I'll take back what I said above - this appears to be 100% GoDaddy's fault.

That said, there should be tens of thousands of sites with this problem .. and Google should be able to fix it.

Soft 404 errors

As usual, once you understand an issue, you can find more information about it.

In Farewell to soft 404s, Google explains the problem. Basically, if you provide your own 404 page, the robots will actually think that a real page has been found and then link thousands of bogus url's to that page.

I could solve this problem if there was an html tag that let me set the header code. Since there isn't, I have to rely on the ISP to do the right thing. Unfortunately, GoDaddy has selected the wrong answer.

My "solution" was

With complete stupidity, GoDaddy won't allow you to select a "404 message file" unless it first exists. That is why I have to first create a file and later remove (rename) it.

More bogus Googlebot urls

In August 2012, I discovered yet another Googlebot error. In this case, forward slashes have been replaced with the html codes for backslashes. The web server (IIS 6) has no problem with these. However, when the robot tries to follow the relative links on those pages, it ignores the fact that %5C must be interpreted as a backslash and maps all the links back to the root directory. As a result, every page on my site produces a bogus 404 error.

There are currently about 4,000 of these bogus 404 errors. I do not know where the base url's are coming from. I have checked .. and they are not from my pages. It is possible that Google's robot is generating them, or they may be coming from someone's mirror of my site. It really does not matter - in my opinion

By the way, both Firefox and IE also have trouble with these links. They are able to read the pages with the %5C (backslash) codes with no problem since IIS is locating the correct files. However, when any of the relative links are clicked, both browsers are dropping the directory information and producing 404 errors .. just like Googlebot.

At any rate, the Google webmaster tools are useless with this many bogus errors .. and my site performance (number of visits) has been very poor since these problems started. (To be clear, it was much better before the first series of 24,000 bogus 404 errors, and has never recovered.)

Author: Robert Clemenzi
URL: http:// / blogs / Google_Crawler_Issue.html