In announcing new data on the growth of blogs as tracked by Technorati – over 7.8 million at the end of February, double the number tracked just last October – Technorati CEO Dave Sifry highlights a disturbing fact:
Part of the growth of new weblogs created each day is due to an increase in spam blogs – fake blogs that are created by robots in order to foster link farms, attempted search engine optimization, or drive traffic through to advertising or affiliate sites. […] Most of this fake blog spam comes from hosted services or from specific IP addresses. […] Right now, about 20% of the aggregate pings Technorati receives are from spam blogs, so you won’t see that in these numbers – these statistics show only "cleaned" data.
The new data shows that between 30,000 and 40,000 new blogs are being created every single day – more than double the number (15,000 per day) from October. (See Sifry’s post for full details.)
Fake spam blogs are one thing – comment and trackback spam on your own blog is another.
Judging from the growing volume of reports about such blog spam, this will be an ever-increasing and major problem for anyone who runs a public blog, whether an individual or an organization. I know of a good half-dozen business bloggers who have been hit by comment and/or trackback spam attacks in recent months, to the extent that it caused them major disruption to their blogs.
I’m not talking about the odd spam comment here – these are concerted attacks with hundreds of automated spam messages blitzed to a blog.
In January, Google announced the "rel=nofollow" tag as a means to combat comment spam in blogs. That can’t prevent spam, it can only try to address it after the event (and it doesn’t seem to have caught on). Six Apart has a practical guide to spam prevention for users of Movable Type.
These are great, but if you use a hosted service like TypePad, as I do, there is nothing you can do to prevent spam – see this TypePad knowledge base article – other than not allow commenting or trackbacks on your blog at all. While that might be a practical solution, it’s hardly what you want to do.
I commented in January about trying to deal with blog spam on hosted blogs from the preventative point of of view. To prevent comment spam, there are tools – Captcha, for instance, and Six Apart’s TypeKey authentication system.
I don’t know why no one’s not yet found a way to implement these preventative measures on hosted blog services. Surely it’s just a matter of time until a service like Blogger or TypePad is brought to its knees through a massive spam attack.
Or is it that everyone thinks it’s just too remote a possibility?
In January, Google announced the “rel=nofollow” tag as a means to combat comment spam in blogs. That can’t prevent spam, it can only try to address it after the event (and it doesn’t seem to have caught on).
Actually, nofollow wasn’t designed to combat comment spam, but rather to tag content that’s not created by a site’s author so it’s not indexed by certain spiders. It wasn’t designed to prevent spam, but to remove one of the incentives, and it’s caught on fantastically, faster than any similar proposal since the adoption of robots.txt in the early days of the web.
Also, it’s actually often easier to detect spam attacks on hosted services, since that gives those of us hosting the sites a much wider base of data to judge with. (If someone’s leaving the exact same comment on 100 blogs, it’s unlikely they’re a legitimate human.)
At that point, it becomes a process of preventing server attacks based on behaviors, in the same way we defend against denial of service attacks and scripting attacks and other similar issues. That’s an area that’s been well-researched by server adminstrators for some time now.
Thanks, Anil.
Re the “nofollow’ tag, I stand corrected. But Google’s announcement re the tag said “This is called comment spam, we don’t like it either, and we’ve been testing a new tag that blocks it.” So maybe I’ve misunderstood what that means.
I understand your points re detecting spam attacks on hosted services and preventing server attacks. So what that does mean to someone like me, with a blog on a hosted service (TypePad)? How do I *prevent* spam reaching my blog? That question probably ought to be rephrased as: What do I need to do to lessen the chances of spam reaching my blog? Or do I reply wholly on what you might implement at the server end as a robust preventative measure?
I’ll ask you another question: Why can’t I implement TypeKey on my TypePad blog? Both are your products.