Steve Johnston
The Google Blog of a Google Consultant
September 20, 2005
The original question related to tracking the visits of search engine spiders when the web stats being relied upon are using JavaScript tagging. These tags are not processed by the Lynx-type software bots from the search engines, so the possibility of using an image 'get' as a tracking method was suggested. Sadly this won't work.
September 15, 2005
September 14, 2005
Most web site owners want to know the difference between real and automated visitors to their web sites. Using log file analysis tools alone can make this a tricky proposition. Some of the industry organisations who care deeply about the accuracy of web reports - usually because advertising sales depend on them - such as the ABCe, depend on a 'recognised exclusions' approach. This is a notoriously tricky area, and 'recognised exclusions' are never likely to prove accurate or responsive enough to what is going on in the real world.
Also you have to bear in mind that not everyone wants to exclude all scrapers from their sites, however perverse that may seem. And not all bad bots are equally bad. Once identified by User Agent or IP address in some official list, they will simply mutate and carry on - just like viruses.
Good bots should do as they are told - see WebmasterWorld's exclusion list for how scary managing this issue is.
Bad bots ignore a robots.txt file and frequently spoof their real identity so actually identifying them and their usually-hopefully-not-human behaviour is a matter of diligent log file analysis and patient htaccess modification and monitoring to exclude what YOU want excluded. Don't forget they don't appear in your JavaScript tagging web stats because they don't process the JavaScript.
If your site has unique content, that forms part of your commercial differentiation, you can guarantee it is already splattered over the web, helping some spammy directory compete with you in the search results. For most sites the horse has already bolted, sadly.
Unless you have people actually doing the above, in anger, day-to-day, the information will be garbage-in garbage-out because a cursory glance at your WebTrends, WebAbacus or IndexTools stats will not give the percentages needed for the analysis - the bad bots are pretty well hidden.
This as a big issue for popular sites, also, because voracious bots can generate a whole lot of expensive bandwidth usage.





