HTML Examples - Site Maintenance
There is more to maintaining a web site than just making
regular updates.
Generically, these need to be done for all sites.
The specifics given below apply only to this site.
Basically, with Apache on a unix server,
the statistics are found by examining
various log files.
In each case, I use a
cgi shell script.
404 Errors
I consider this to be the most important statistic because
it tracks users having problems finding your pages.
I have successfully used this data to find
various typos and several browser design problems.
On a unix server
(an operating system with case-sensitive filenames)
the most common typo is having the wrong case.
Since I do most of my testing in Windows
(a case-preserving operating system which
ignores case in filenames), there were a few errors
which slipped through.
A regular review of the 404 errors easily identifies these.
Even after all the case sensitive problems were fixed,
I noticed that there are still quit a few case related
404 errors.
Apparently, users are typing in the urls and,
since they are not aware that this site is case-sensitive,
there are numerous errors.
I also noticed that several urls were failing that
had no obvious problems.
After a little checking, I discovered that these
were browser dependent - i.e. they worked fine
with IE 4.72 but failed with Netscape.
I went through and fixed these so that they now work with
both products.
The following is the code for
404.cgi.
#!/bin/sh
#
# 404.cgi by Robert Clemenzi 1-13-00
#
echo Content-type: text/html
echo
echo \\\404 Errors\\
echo \
echo \404 Errors related to clemenzi pages \
\
echo \
/bin/tail -50000 /l/apache/logs/error_log | grep clemenzi |
egrep -v "(ico|ICO|\.\.)"
echo \
echo \\
Notes on the code
- The actual path name was modified for security reasons.
- The tail command limits the amount of data returned to
only the last few days, not to whatever is in the log file.
- The deprecated xmp tag is necessary because
some of the errors contain greater than and/or less than
characters.
The xmp tag allows these to be displayed
instead of interpreting them and causing none of the
errors to be displayed.
- egrep -v is used to limit the number of errors.
IE 5 automatically searches the site for .ico files
and generates numerous errors.
I obviously don't care about these.
Some site grabbers (programs which automatically download
entire sites) have design problems which generate
thousands of 404 errors. These errors are easily identified
by .. in the url. The \.\. removes these.
I do have a few relative links which legitimately have
.. sequence, but I would rather loose that data than be
overwhelmed with irrelevant errors.
How Users Find Your Site
How do users find your site?
Are they using search engines, links from other pages,
or bookmarks?
Well, the Apache server collects these statistics and a
cgi script
runs a query and displays the results.
It turns out that most of my pages are found via search engines.
Mostly www.altavista.com.
What pages link to your site
It's neet to discover who is linking to your site.
The trick is to search for pages which link to you and
exclude your pages (which presumably link to eachother).
Unfortunately, the syntax and results depend on which search engine
you are using.
Altavista
+link:cpcug.org/user/clemenzi -url:user/clemenzi
Page Count - What Pages People Use
Tracking usage per page tells you which pages people
find most often, which, perhaps,
indicates which pages are most useful.
This data can help you determine which pages to spend the
most time updating and which are just using space.
In my case, the pages I write indicate information that
I am interested in.
However, if a page gets more hits than others,
then I know that time spent improving that page
is well spent
There are many ways to count the number of people using your site
- Use an imbedded counter (script)
- Replace every page with a link to a unique cgi script
- Route all links through a single program (ASP, cgi, ...)
- Let the server collect statistics and process these
I prefer the last choice because it is transparent
and does not require me to modify my site.
I mean, having 100 pages and having to support
an equal number of cgi files makes no sense at all.
Imbedded counters are executed every time your page is acessed.
This not only slows down page access, but is browser dependent.
The only thing that makes any sense is to let the server
collect the statistics.
A few rules, when counting hits
- Never count images
- Never count imbedded content (style sheets, Java classes, ...)
- Ignore frames which simply provide menu functions
(because they have no content)
I run a
cgi script
as a cron job every day around 3:00 AM.
In unix, a cron job is a program that you can schedule
to run automatically at a specific time.
I picked 3:00 AM because I assumed that the server usage
would be low at that time.
The
following script
processes the log file
and generates an html file which displays the data
in a table.
#!/bin/sh
grep 'Last updated' /web/statistics/index.html > stats.htm
echo 'Current stats for ...
' >> stats.htm
echo 'Because the log file ...' >> stats.htm
echo '
' >> stats.htm
grep clemenzi/t /web/statistics/index.html | _
egrep -v "(\.cgi|\.\.)" | _
awk '{print ""$4" _
| | " substr($6,16,80)""}'
>> stats.htm
echo ' |
' >> stats.htm
Notes:
- The code is reformatted to be readable
(Some strings were shortened and _ indicates continued on next line)
- The paths to index.html and stats.htm were modified
- $4 is the count
- $6 is the full url
- substr($6,16,80) is the part of the url I want to see
- egrep -v can be used to remove stuff you don't want
Author: Robert Clemenzi -
clemenzi@cpcug.org