Google Crawler Issue

Google Crawler Issue
03-15-2011

In late February 2011, Google made a number of changes to its Googlebot web crawler. I do not know the exact dates because I have not been able to find any documentation with useful details. However, it is likely that these problems were caused by the Farmer/Panda update on Feb 23, 2011.

By the way, I did report both issues in the help forums.

As explained below, this may actually be a GoDaddy problem .. however, since the web developers are not at fault, Google should be smart enough to fix errors caused by ISP's.

Verification Failure | Googlebot's bogus urls | More bogus Googlebot urls

Verification Failure

Goggle offers Webmaster Tools to help manage sites. I use this to monitor search performance and to find certain types of html errors.

I don't logon very often, but I did today (03-15-2011) to check a new site. I noticed a dropdown list that shows all the sites I have access to. When I clicked it, I noticed that mc-computing.com was not in the list. When I tried to access it from the main page, it said that I wasn't verified.

Weird! I have been using that site for years. I checked - the verification file was still there.

The earliest known Verification Failure was 2/14/11 - 28 days ago. It may have been earlier since the log does not go back any farther.

At any rate, the old verification method (to create a file with the required file name) has been deprecated .. you must now use a file with the same name, but with the contents required by google. I really don't have a problem with that, but the page that says that the page is failing registration should have a note on it explaining the change. In addition, the change should be included in the FAQ.

Googlebot's bogus urls

Once I could access my site, I went to the dashboard and saw the following

Restricted by robots.txt   23,918

Huh! That should be under 300.

I later noticed a message

Googlebot found an extremely high number of URLs on your site: http://mc-computing.com/

with numerous problem links similar to the following. (This is a single url, wrapped to be readable.)

http://mc-computing.com/Databases/MySQL/Parasites/Science_Facts/Lapse_Rate
        /blogs/ISPs/WordPress/qs/Global_Warming/NewsPapers/Lohachara

However, this is a composite of several valid urls.

http://mc-computing.com/Databases/MySQL
http://mc-computing.com/Science_Facts/Lapse_Rate
http://mc-computing.com/blogs
http://mc-computing.com/ISPs/WordPress
http://mc-computing.com/qs/Global_Warming/NewsPapers/Lohachara.html

This was not a problem previously. I also got the following

Restricted by robots.txt ?(24,468)

The value should be under 300. Apparently, since google now thinks that my site is some kind of spammer (or something), my search ranks have tanked and my alexa ranging has tanked.

This appears to also be associated with my site no longer being "verified". (I have fixed that, but an explanation on the site would have been nice.)

Also, once this is fixed, how do I clear the robots cache so that only valid data is displayed?

I don't know if these are related to the new software, but .. starting Feb 27, Googlebot began finding the bogus urls shown above.

The Crawl Stats show a volume increase starting at the end of January, a double peak in the middle of February, and a very large peak at the end of February. I interpret this to indicate 3 separate software releases. In December and the first part of January, the pages crawled per day are indistinguishable from zero - the table says a low of 13. After the software change, the peak goes to 80,567 .. average 11,763.

The number of bytes transferred jumped from a few thousand per day to a peak of 1,959,620 kilobytes. (Almost 2 gigabytes.) For people who have to pay for extra bandwidth, this could be very expensive.

At any rate, it is fairly obvious that Googlebot contains (contained) a very serious bug.

On the other hand, this may be a very old problem (2009) that was fixed and has now come back. I love how they blame the person who develops a web page for the failings of Googlebot. There was another post in Oct 2008 (requires a login to read).

It is also possible that Googlebot is not the problem. If some other site has all these links, then Googlebot may simply be following them.

GoDaddy is part of the problem

On 01-25-11, I noticed that GoDaddy had changed the 404 not found settings to show their custom page. I have no idea when that actually happened. Since I considered their page to be particularly hideous, (it is a stupid graphic and does not even say "404 not found"), I changed it to use my home page. Presumably, that is what caused the problems. For the sake of argument, I have decided to create my own 404 page with no links to anywhere.

At GoDaddy, under Hosting Control Center / Settings / 404 Error Behavior, there are only 3 options

Use home page
Use custom page
Use GoDaddy.com, Inc.'s Default 404 Error Page

I would have preferred "none of the above" and let the server simply send a 404 error. However, that was not an option.

I have verified GoDaddy's response - when a page is not found, it substitutes the selected 404 page and returns a status code of 200 OK. This is a major violation. If a bad url is used, the server MUST return 404 not found. So .. part of this problem is GoDaddy's fault. However, the primary fault is still Google's .. they should have never even tried the tens of thousands of bogus URLs. NEVER!

Note: After changing the response, it takes up to 30 minutes for the change to be seen. This makes debug and test extremely slow.
It turns out that if you select a custom file .. and then rename the file .. you will get a real 404 not found error. Score - that is what I really want.

GoDaddy is all of the problem

Get this, from GoDaddy's To Set Up Your Custom 404 Page

NOTE: Links in the 404 page must be absolute, e.g. http://www.yoursite.com/page.html or else they will break when clicked from the 404 page.

You would think that something that important would be placed on the page where you select the type of 404 page you want. Particularly since most home pages are full of relative links. I think I'll take back what I said above - this appears to be 100% GoDaddy's fault.

That said, there should be tens of thousands of sites with this problem .. and Google should be able to fix it.

Soft 404 errors

As usual, once you understand an issue, you can find more information about it.

In Farewell to soft 404s, Google explains the problem. Basically, if you provide your own 404 page, the robots will actually think that a real page has been found and then link thousands of bogus url's to that page.

I could solve this problem if there was an html tag that let me set the header code. Since there isn't, I have to rely on the ISP to do the right thing. Unfortunately, GoDaddy has selected the wrong answer.

My "solution" was

To create the 404 page described above
Configure GoDaddy to use that page
Rename 404.html to 404xx.html so that it won't be found

With complete stupidity, GoDaddy won't allow you to select a "404 message file" unless it first exists. That is why I have to first create a file and later remove (rename) it.

More bogus Googlebot urls

In August 2012, I discovered yet another Googlebot error. In this case, forward slashes have been replaced with the html codes for backslashes. The web server (IIS 6) has no problem with these. However, when the robot tries to follow the relative links on those pages, it ignores the fact that %5C must be interpreted as a backslash and maps all the links back to the root directory. As a result, every page on my site produces a bogus 404 error.

Bad url - it simply does not exist
  http://mc-computing.com/Basics.html

Google says that it is linked from
  http://mc-computing.com/Languages%5CActionScript%5Cindex.html

It is actually linked from
  http://mc-computing.com/Languages/ActionScript/index.html

This is the url Google should have used
  http://mc-computing.com/Languages/ActionScript/Basics.html

There are currently about 4,000 of these bogus 404 errors. I do not know where the base url's are coming from. I have checked .. and they are not from my pages. It is possible that Google's robot is generating them, or they may be coming from someone's mirror of my site. It really does not matter - in my opinion

This is clearly a Googlebot design error.

By the way, both Firefox and IE also have trouble with these links. They are able to read the pages with the %5C (backslash) codes with no problem since IIS is locating the correct files. However, when any of the relative links are clicked, both browsers are dropping the directory information and producing 404 errors .. just like Googlebot.

At any rate, the Google webmaster tools are useless with this many bogus errors .. and my site performance (number of visits) has been very poor since these problems started. (To be clear, it was much better before the first series of 24,000 bogus 404 errors, and has never recovered.)

Author: Robert Clemenzi