By the way, I did report both issues in the help forums.
As explained below, this may actually be a GoDaddy problem .. however, since the web developers are not at fault, Google should be smart enough to fix errors caused by ISP's.
Verification Failure
I don't logon very often, but I did today (03-15-2011) to check a new site. I noticed a dropdown list that shows all the sites I have access to. When I clicked it, I noticed that mc-computing.com was not in the list. When I tried to access it from the main page, it said that I wasn't verified.
Weird! I have been using that site for years. I checked - the verification file was still there.
The earliest known Verification Failure was 2/14/11 - 28 days ago. It may have been earlier since the log does not go back any farther.
At any rate, the old verification method (to create a file with the required file name) has been deprecated .. you must now use a file with the same name, but with the contents required by google. I really don't have a problem with that, but the page that says that the page is failing registration should have a note on it explaining the change. In addition, the change should be included in the FAQ.
Googlebot's bogus urls
Restricted by robots.txt 23,918 |
Huh! That should be under 300.
I later noticed a message
Googlebot found an extremely high number of URLs on your site: http://mc-computing.com/ |
http://mc-computing.com/Databases/MySQL/Parasites/Science_Facts/Lapse_Rate /blogs/ISPs/WordPress/qs/Global_Warming/NewsPapers/Lohachara |
http://mc-computing.com/Databases/MySQL http://mc-computing.com/Science_Facts/Lapse_Rate http://mc-computing.com/blogs http://mc-computing.com/ISPs/WordPress http://mc-computing.com/qs/Global_Warming/NewsPapers/Lohachara.html |
Restricted by robots.txt ?(24,468) |
This appears to also be associated with my site no longer being "verified". (I have fixed that, but an explanation on the site would have been nice.)
Also, once this is fixed, how do I clear the robots cache so that only valid data is displayed?
I don't know if these are related to the new software, but .. starting Feb 27, Googlebot began finding the bogus urls shown above.
The Crawl Stats show a volume increase starting at the end of January, a double peak in the middle of February, and a very large peak at the end of February. I interpret this to indicate 3 separate software releases. In December and the first part of January, the pages crawled per day are indistinguishable from zero - the table says a low of 13. After the software change, the peak goes to 80,567 .. average 11,763.
The number of bytes transferred jumped from a few thousand per day to a peak of 1,959,620 kilobytes. (Almost 2 gigabytes.) For people who have to pay for extra bandwidth, this could be very expensive.
At any rate, it is fairly obvious that Googlebot contains (contained) a very serious bug.
On the other hand, this may be a very old problem (2009) that was fixed and has now come back. I love how they blame the person who develops a web page for the failings of Googlebot. There was another post in Oct 2008 (requires a login to read).
It is also possible that Googlebot is not the problem. If some other site has all these links, then Googlebot may simply be following them.
GoDaddy is part of the problem
At GoDaddy, under Hosting Control Center / Settings / 404 Error Behavior, there are only 3 options
I have verified GoDaddy's response - when a page is not found, it substitutes the selected 404 page and returns a status code of 200 OK. This is a major violation. If a bad url is used, the server MUST return 404 not found. So .. part of this problem is GoDaddy's fault. However, the primary fault is still Google's .. they should have never even tried the tens of thousands of bogus URLs. NEVER!
Note: | After changing the response, it takes up to 30 minutes for the change to be seen. This makes debug and test extremely slow. |
Get this, from GoDaddy's To Set Up Your Custom 404 Page
NOTE: Links in the 404 page must be absolute, e.g. http://www.yoursite.com/page.html or else they will break when clicked from the 404 page. |
That said, there should be tens of thousands of sites with this problem .. and Google should be able to fix it.
Soft 404 errors
In Farewell to soft 404s, Google explains the problem. Basically, if you provide your own 404 page, the robots will actually think that a real page has been found and then link thousands of bogus url's to that page.
I could solve this problem if there was an html tag that let me set the header code. Since there isn't, I have to rely on the ISP to do the right thing. Unfortunately, GoDaddy has selected the wrong answer.
My "solution" was
More bogus Googlebot urls
Bad url - it simply does not exist http://mc-computing.com/Basics.html Google says that it is linked from http://mc-computing.com/Languages%5CActionScript%5Cindex.html It is actually linked from http://mc-computing.com/Languages/ActionScript/index.html This is the url Google should have used http://mc-computing.com/Languages/ActionScript/Basics.html |
There are currently about 4,000 of these bogus 404 errors. I do not know where the base url's are coming from. I have checked .. and they are not from my pages. It is possible that Google's robot is generating them, or they may be coming from someone's mirror of my site. It really does not matter - in my opinion
This is clearly a Googlebot design error. |
By the way, both Firefox and IE also have trouble with these links. They are able to read the pages with the %5C (backslash) codes with no problem since IIS is locating the correct files. However, when any of the relative links are clicked, both browsers are dropping the directory information and producing 404 errors .. just like Googlebot.
At any rate, the Google webmaster tools are useless with this many bogus errors .. and my site performance (number of visits) has been very poor since these problems started. (To be clear, it was much better before the first series of 24,000 bogus 404 errors, and has never recovered.)
Author: Robert Clemenzi