HTML Examples - Site Maintenance

There is more to maintaining a web site than just making regular updates.

Check 404 errors
Check how people are finding your site
What pages link to your site
Determining which pages people find useful

Generically, these need to be done for all sites. The specifics given below apply only to this site. Basically, with Apache on a unix server, the statistics are found by examining various log files. In each case, I use a cgi shell script.

404 Errors

I consider this to be the most important statistic because it tracks users having problems finding your pages. I have successfully used this data to find various typos and several browser design problems.

On a unix server (an operating system with case-sensitive filenames) the most common typo is having the wrong case. Since I do most of my testing in Windows (a case-preserving operating system which ignores case in filenames), there were a few errors which slipped through. A regular review of the 404 errors easily identifies these.

Even after all the case sensitive problems were fixed, I noticed that there are still quit a few case related 404 errors. Apparently, users are typing in the urls and, since they are not aware that this site is case-sensitive, there are numerous errors.

I also noticed that several urls were failing that had no obvious problems. After a little checking, I discovered that these were browser dependent - i.e. they worked fine with IE 4.72 but failed with Netscape. I went through and fixed these so that they now work with both products.

The following is the code for 404.cgi.

#!/bin/sh # # 404.cgi by Robert Clemenzi 1-13-00 # echo Content-type: text/html echo echo \<html\>\<head\>\<title\>404 Errors\</title\>\</head\> echo \<body\> echo \<h2\>404 Errors related to clemenzi pages \</h2\> \<hr\> echo \<xmp\> /bin/tail -50000 /l/apache/logs/error_log | grep clemenzi | egrep -v "(ico|ICO|\.\.)" echo \</xmp\> echo \</body\>\</html\> Notes on the code

The actual path name was modified for security reasons.
The tail command limits the amount of data returned to only the last few days, not to whatever is in the log file.
The deprecated xmp tag is necessary because some of the errors contain greater than and/or less than characters. The xmp tag allows these to be displayed instead of interpreting them and causing none of the errors to be displayed.
egrep -v is used to limit the number of errors. IE 5 automatically searches the site for .ico files and generates numerous errors. I obviously don't care about these. Some site grabbers (programs which automatically download entire sites) have design problems which generate thousands of 404 errors. These errors are easily identified by .. in the url. The \.\. removes these. I do have a few relative links which legitimately have .. sequence, but I would rather loose that data than be overwhelmed with irrelevant errors.

How Users Find Your Site

How do users find your site? Are they using search engines, links from other pages, or bookmarks?

Well, the Apache server collects these statistics and a cgi script runs a query and displays the results.

It turns out that most of my pages are found via search engines. Mostly www.altavista.com.

What pages link to your site

It's neet to discover who is linking to your site. The trick is to search for pages which link to you and exclude your pages (which presumably link to eachother). Unfortunately, the syntax and results depend on which search engine you are using.

Altavista

  +link:cpcug.org/user/clemenzi -url:user/clemenzi

Page Count - What Pages People Use

Tracking usage per page tells you which pages people find most often, which, perhaps, indicates which pages are most useful. This data can help you determine which pages to spend the most time updating and which are just using space.

In my case, the pages I write indicate information that I am interested in. However, if a page gets more hits than others, then I know that time spent improving that page is well spent

There are many ways to count the number of people using your site

Use an imbedded counter (script)
Replace every page with a link to a unique cgi script
Route all links through a single program (ASP, cgi, ...)
Let the server collect statistics and process these

I prefer the last choice because it is transparent and does not require me to modify my site. I mean, having 100 pages and having to support an equal number of cgi files makes no sense at all. Imbedded counters are executed every time your page is acessed. This not only slows down page access, but is browser dependent. The only thing that makes any sense is to let the server collect the statistics.

A few rules, when counting hits

Never count images
Never count imbedded content (style sheets, Java classes, ...)
Ignore frames which simply provide menu functions (because they have no content)

I run a cgi script as a cron job every day around 3:00 AM. In unix, a cron job is a program that you can schedule to run automatically at a specific time. I picked 3:00 AM because I assumed that the server usage would be low at that time.

The following script processes the log file and generates an html file which displays the data in a table.

#!/bin/sh grep 'Last updated' /web/statistics/index.html > stats.htm echo '<h2><center>Current stats for ...</center></h2>' >> stats.htm echo 'Because the log file ...<p>' >> stats.htm echo '<table>' >> stats.htm grep clemenzi/t /web/statistics/index.html | _ egrep -v "(\.cgi|\.\.)" | _ awk '{print "<tr><td align=right>"$4" _ <td width=30><td><a href="$6">" substr($6,16,80)"</a>"}' >> stats.htm echo '</table>' >> stats.htm Notes:

The code is reformatted to be readable (Some strings were shortened and _ indicates continued on next line)
The paths to index.html and stats.htm were modified
$4 is the count
$6 is the full url
substr($6,16,80) is the part of the url I want to see
egrep -v can be used to remove stuff you don't want

Author: Robert Clemenzi - clemenzi@cpcug.org