In my Web Analytics class, we’re beginning to analyze Apache log files to extract Analytics data. Today, I pulled down a raw access log from this site to see what I could learn. I also have AWStats going to build reports for server access. As I’ve been digging through my access log, I’ve noticed that comment spammers make up a large portion of my server access.
I have found that comment spambots will hit a page on my blog, then scrape the page for the comments form, and then post spam comments to the form target. From AWStats, close to 50% of the access of my site are from Operating Systems that are unknown. This leads me to believe that about 50% of my access log data is pollution from spambots.
Luckily, spambots don’t usually download my Google Analytics JavaScript and execute it like a normal browser so the data is more pure.
[...] A post from Jimmy Zimmerman highlights one issue with log-file analysis. Blogs and other types of web-sites that encourage user feedback have forms for posting comments, etc. These comment forms are highly targeted by comment spammers that use this method of feedback for free advertising. Since these spams consist of valid Request/Response pairs, they pollute your log files with false data. If you use log-file analysis to track statistics on your site, be aware of this and take steps to cleanse your logs to produce accurate stats. [...]