Google Crawler Issue
03-15-2011

In late February 2011, Google made a number of changes to its Googlebot web crawler. I do not know the exact dates because I have not been able to find any documentation with useful details. However, it is likely that these problems were caused by the Farmer/Panda update on Feb 23, 2011.

By the way, I did report both issues in the help forums.

As explained below, this may actually be a GoDaddy problem .. however, since the web developers are not at fault, Google should be smart enough to fix errors caused by ISP's.

Verification Failure | Googlebot's bogus urls


Verification Failure

Goggle offers Webmaster Tools to help manage sites. I use this to monitor search performance and to find certain types of html errors.

I don't logon very often, but I did today (03-15-2011) to check a new site. I noticed a dropdown list that shows all the sites I have access to. When I clicked it, I noticed that mc-computing.com was not in the list. When I tried to access it from the main page, it said that I wasn't verified.

Weird! I have been using that site for years. I checked - the verification file was still there.

The earliest known Verification Failure was 2/14/11 - 28 days ago. It may have been earlier since the log does not go back any farther.

At any rate, the old verification method (to create a file with the required file name) has been deprecated .. you must now use a file with the same name, but with the contents required by google. I really don't have a problem with that, but the page that says that the page is failing registration should have a note on it explaining the change. In addition, the change should be included in the FAQ.


Googlebot's bogus urls

Once I could access my site, I went to the dashboard and saw the following

Huh! That should be under 300.

I later noticed a message

with numerous problem links similar to the following. (This is a single url, wrapped to be readable.) However, this is a composite of several valid urls. This was not a problem previously. I also got the following The value should be under 300. Apparently, since google now thinks that my site is some kind of spammer (or something), my search ranks have tanked and my alexa ranging has tanked.

This appears to also be associated with my site no longer being "verified". (I have fixed that, but an explanation on the site would have been nice.)

Also, once this is fixed, how do I clear the robots cache so that only valid data is displayed?

I don't know if these are related to the new software, but .. starting Feb 27, Googlebot began finding the bogus urls shown above.

The Crawl Stats show a volume increase starting at the end of January, a double peak in the middle of February, and a very large peak at the end of February. I interpret this to indicate 3 separate software releases. In December and the first part of January, the pages crawled per day are indistinguishable from zero - the table says a low of 13. After the software change, the peak goes to 80,567 .. average 11,763.

The number of bytes transferred jumped from a few thousand per day to a peak of 1,959,620 kilobytes. (Almost 2 gigabytes.) For people who have to pay for extra bandwidth, this could be very expensive.

At any rate, it is fairly obvious that Googlebot contains (contained) a very serious bug.

On the other hand, this may be a very old problem (2009) that was fixed and has now come back. I love how they blame the person who develops a web page for the failings of Googlebot. That was another post in Oct 2008 (requires a login to read).

It is also possible that Googlebot is not the problem. If some other site has all these links, then Googlebot may simply be following them.


GoDaddy is part of the problem

On 01-25-11, I noticed that GoDaddy had changed the 404 not found settings to show their custom page. I have no idea when that actually happened. Since I considered their page to be particularly hideous, (it is a stupid graphic and does not even say "404 not found"), I changed it to use my home page. Presumably, that is what caused the problems. For the sake of argument, I have decided to create my own 404 page with no links to anywhere.

At GoDaddy, under Hosting Control Center / Settings / 404 Error Behavior, there are only 3 options

I would have preferred "none of the above" and let the server simply send a 404 error. However, that was not an option.

I have verified GoDaddy's response - when a page is not found, it substitutes the selected 404 page and returns a status code of 200 OK. This is a major violation. If a bad url is used, the server MUST return 404 not found. So .. part of this problem is GoDaddy's fault. However, the primary fault is still Google's .. they should have never even tried the tens of thousands of bogus URLs. NEVER!

Note: After changing the response, it takes up to 30 minutes for the change to be seen. This makes debug and test extremely slow.
It turns out that if you select a custom file .. and then rename the file .. you will get a real 404 not found error. Score - that is what I really want.


GoDaddy is all of the problem

Get this, from GoDaddy's To Set Up Your Custom 404 Page

You would think that something that important would be placed on the page where you select the type of 404 page you want. Particularly since most home pages are full of relative links. I think I'll take back what I said above - this appears to be 100% GoDaddy's fault.

That said, there should be tens of thousands of sites with this problem .. and Google should be able to fix it.


Soft 404 errors

As usual, once you understand an issue, you can find more information about it.

In Farewell to soft 404s, Google explains the problem.

I could solve this problem if there was an html tag that let me set the header code. Since there isn't, I have to rely on the ISP to do the right thing. Unfortunately, GoDaddy has selected the wrong answer.


Author: Robert Clemenzi
URL: http:// mc-computing.com / blogs / Google_Crawler_Issue.html