Spam + Blogs = Trouble

Splogs are the latest thing in online scams – and they could smother the Internet.

I am aware that spending a lot of time Googling yourself is kind of narcissistic, OK? But there are situations, I would argue, when it is efficiently – even forgivably – narcissistic. When I published a book last year, I wanted to know what, if anything, people were saying about it. Ego-surfing was the obvious way to do that. Which is how I stumbled across Some Title.

Some Title identified itself as a blog but obviously wasn't one. Here, reprinted in its entirety, is the paragraph from the site that mentioned me:

Show Disputed Vinland Map Was Made Half Century Before Columbus Trip Audio/Video Columbus: Secrets From The Grave quot;The Last Voyage of Columbus quot;: An Epic Tale Charles Mann's quot;1491 quot; (Audio

In orthodox bloggy style, the paragraph linked to another Web page. When I clicked on the link, I was confronted with more gibberish: "Below," it stated, "you will find some grave robbing in ventura california 1985 news that's relevant for today."

Blogs like Some Title are known as "splogs" – spam blogs. Like email spam, splogs use the most wonderful features of networked communication – its flexibility, easy access, and low cost – in the service of sleazy get-rich-quick schemes. But whereas email spammers try to induce recipients to buy products, sploggers and other Web spammers make most of their money by getting viewers to click on ads that run adjacent to their nonsensical text. Web page owners – the spammer, in this case – get paid by the advertiser every time someone clicks on an ad.

Some Title's creator had almost certainly assembled the site by using software that hops from Web page to Web page, automatically copying text that includes potential search terms. (My name and my book's title had been included incidentally, because they appeared in a review or blog that happened to contain keywords sought by the spammer.) Sploggers don't care if the resulting Web pages are garbled; the point is to churn them out chockablock with terms that people might use in search queries, leading them to visit the pages and click (ka-ching!) on the ads.

Just as the proliferation of email spam constantly threatens to inundate email providers, the explosion of blog spam is a besetting problem for the blog industry. Like most people who poke around the blogosphere, I had occasionally encountered splogs before. But over the months that I monitored the reaction to my book, they seemed to be rising in number. More and more of the blogs and Web sites that mentioned my book – or any other topic, for that matter – were spam. Some 56 percent of active English-language blogs are spam, according to a study released in May by Tim Finin, a researcher at the University of Maryland, Baltimore County, and two of his students. "The blogosphere is growing fast," Finin says. "But the splogosphere is now growing faster."

To Jason Goldman, product manager for Google's Blogger hosting service, "the ever-increasing number of splogs is a significant problem that we have to combat." No search engine wants users looking for information about, say, auto repair to click on a promising link and end up on a page filled with jabberwocky or a collection of advertisements. Nor does any blog host want to waste its resources and trash its reputation by providing a home to spammers. A recent survey by Mitesh Vasa, a Virginia-based software engineer and splog researcher, found that in December 2005, Blogger was hosting more than 100,000 sploggers. (Many of these are likely pseudonyms for the same people.)

Google, Goldman promises, is paying serious attention to the problem. It should be: The pay-per-click advertising that accounts for most of Google's income (and, increasingly, for the incomes of Yahoo and MSN Search, the two other big search engines) has become an irresistible magnet for hucksters, con artists, and chiselers. "The three main search engines are gateways to a huge percentage of the US and world economy," says Anil Dash, a vice president of the blog-hosting company Six Apart. "If your Web site appears high up on their results, thousands or millions of people will go to it." If even a small fraction of those people click on the ads on that site, "you're going to make a lot of money" – and sploggers are going after it.

Because the ad money is effectively available only to Web sites that appear in the first page or two of search results, spammers devote enormous efforts to gaming Google, Yahoo, and their ilk. Search engines rank Web sites in large part by counting the number of other sites that link to them, assigning higher placement in results to sites popular enough to be referred to by many others. To mimic this popularity, spammers create bogus networks of interconnected sites called link farms. Blogs – most of which are in essence little more than collections of links with commentary – are particularly useful elements in them. The result, Dash says, "is what you'd expect: The blogosphere is increasingly polluted by spam."

The mess may have consequences beyond the blogosphere, though. Blogs are the leading edge of what is often called Web 2.0, the vision of the Internet as a bottom-up, communal platform for data of all sorts that is generated and continually updated by its users: the image-sharing sites Flickr and YouTube, the social bookmarking destination del.icio.us, the collaborative online encyclopedia Wikipedia, the user-generated Slashdot rival digg, and publicly viewable online calendars like Kiko and CalendarHub. Unfortunately, the very openness and ease of use that make these Web 2.0 sites popular will inevitably make them perfect targets for spammers, says Matt Mullenweg, developer of the popular WordPress blogging system. "Extreme vulnerability to spam," he says, is a defining characteristic of Web 2.0, and splogs are its first manifestation.

People in the industry disagree about how to beat back spam, or whether it can even be done. But there's no dispute that if the blogosphere and the rest of Web 2.0 can't find a way to stop the sleazeballs who are enveloping the Net in a haze of babble and cheesy marketing, then the best features of Web 2.0 will be turned off, and it will go the way of Usenet, which was driven to desuetude by spam.

Some Title, the splog that commandeered my name, was created by Dan Goggins, the proud possessor of a 2005 master's degree in computer science from Brigham Young University. Working out of his home in a leafy subdivision in Springville, Utah, Goggins, his BYU friend and partner, John Jonas, and their handful of employees operate "a few thousand" splogs. "It's not that many," Goggins says modestly. "Some people have a lot of sites." Trolling the Net, I came across a PowerPoint presentation for a kind of spammers' conference that details some of the earnings of the Goggins-Jonas partnership. Between August and October of 2005, they made at least $71,136.89.

People in Goggins' business bristle at the term "splogger." They like to be called "search engine marketers." And they don't simply throw a mass of blogs online. Instead they build complex networks of Web sites, entire online ecosystems of sleaze, twaddle, and gobbledygook. The main goal is to lure unsuspecting blog readers and other Internet users to spam portals. Sportals, as they are known, are Web pages consisting almost entirely of pay-per-click links, all of which shunt netsurfers to legitimate commercial Web sites, collecting money along the way for the spammers. Examples of these doorway pages include debts.com, lasvegasvacations.com, and 90210.com, all owned by industry pioneer Marchex of Seattle; another is photography.com, run by NameMedia, based in the Boston suburb of Waltham.

Naturally, sportal owners want their properties to appear prominently in search engine results. The precise algorithms that search engines use to rank pages are as closely guarded – and as valuable – as upcoming Intel chip designs. Spammers figure them out by stuffing individual splogs and spam pages with potential search terms and links to one another, observing how high the sites appear in search results, then tweaking their pages and trying again. Repeated over time, Six Apart's Dash says, "the brute-force approach is effective. Not that the Google guys aren't smart, but the sheer relentlessness on the side of spammers is formidable, and they eventually get what they want."

In June, a Romanian tech-blogger who calls himself Ionut Alex. Chitu discovered that Googling "pizza sauce recipe" turned up a spam page within the first 10 results – prime Web real estate. The page, he and others quickly learned, was a tiny island in a massive archipelago of spam, millions of pages in size, erected by scammers supposedly in Romania or Argentina, depending on which fake registration data one chose to believe. Having discovered a lacuna in Google's indexing system, the spammers shoved search terms into their Web sites' domain names, which the search algorithm regarded as a sign of their importance. The pizza-sauce inquiry, for instance, directed viewers to the Web address 1059.pizza.eiqz2q.org, which automatically bounced to a sportal; one incautious mouseclick later and the viewer would be stuck in a near-endless loop of ad sites. (Google started deleting the spam colossus from its index within hours of its discovery.)

In addition to creating massive numbers of phony blogs, sploggers sometimes take over abandoned real blogs. More than 10 million of the 12.9 million profiles on Blogger surveyed by splog researcher Vasa in June were inactive, either because the bloggers had stopped blogging or because they never got started. (The huge mass of dead blogs is one reason to maintain a healthy skepticism toward the frequently heard claims about the vast growth of the blogosphere.) "Nobody is watching or moderating the comments and posts on those abandoned blogs," says Tim Mayer, director of product management for Yahoo search. As a result, he says, scammers are looking for ways to hack the interface of these blogs to post to them and take advantage of their inbound links to increase the ranking of spam sites. For obvious reasons, it is difficult for a Google or a Yahoo to discern when a previously valuable site and its links slip over to the dark side and become part of a spam empire.

Not only do sploggers create fake blogs or take over abandoned ones, they use robo-software to flood real blogs with bogus comments that link back to the splog. ("Great post! For more on this subject, click here!") Statistics compiled by Akismet, a system put together by WordPress developer Mullenweg that tries to filter out blog spam, suggest that more than nine out of 10 comments in the blogosphere are spam. Partly as a result, prominent blogs like Instapundit, The Corner, and Talking Points Memo simply refuse to turn on commenting.

Almost as pervasive is trackback spam. Trackbacks are the familiar mechanism – "see which blogs are talking about this post" – by which Blogger A's discussion of a post by Blogger B can be automatically linked to Blogger B's site. Spammers will claim on a real blog that their splogs continue the thread of discourse, hoping to lure the real blog's readers to visit them and click on ads. "All of a sudden a well-known blogger is linking to a spam blog," says Natalie Glance, a senior researcher at Nielsen BuzzMetrics, the rating company's blog-analysis arm. Because of the link, the splog gets a boost in the search engine ranks – even as its link tarnishes the real blog.

The avalanche of spam places an increasing burden not only on blog hosts but on another equally vital component of the blogosphere: blog-search engines like Technorati and IceRocket, and the so-called ping servers they depend on. Unlike Google or Yahoo, blog-search firms operate in real time, so as to keep pace with ongoing discussions. Every time bloggers make posts, their software automatically alerts the network of ping servers that track the blogosphere for blog-search engines. These tap into the feed from ping servers to refresh their indexes and visit the sites that have new material. Syndication services that use delivery formats like RSS and Atom also use the ping servers to know when they need to deliver content to subscribers.

Unfortunately, splogs generate content faster than real blogs – no surprise, given that the text is churned out by robo-software, with no need for the splogger to write or think. Maryland researcher Finin and his students found that splogs produce about three-quarters of the pings from English-language blogs. Another way of saying this is that the legitimate blogosphere generates about 300,000 posts a day, but the splogosphere emits 900,000, inundating the ping servers. "It's not enough to weed out splogs on the level of the search engines; you also have to get rid of them in the ping servers," Glance says. "It's a whole second front."

Splogs are annoying but not illegal. Still, blog-hosting firms like Six Apart, Blogger, MSN Spaces, and Xanga desperately want to get rid of them. And blog-search companies would like just as much to eliminate them from their results.

For that to happen, though, the companies must identify the splogs they want to weed out – a harder task than it may seem. Take Some Title, the splog that mentioned me. Any human reader can tell instantly, as I did, that the site is tripe. But even if hosting services and search engines hired armies of people, the blogosphere is simply too big to sift through blogs one by one. Computers are faster but notoriously unable to distinguish sense from nonsense – they can't tell Some Title from Shakespeare.

The way out of this dilemma is to find mechanisms for computers to identify splogs without reading them, says David Sifry, founder of Technorati, the largest blog-search firm. The key to doing this, in his view, is to understand that real blogs are a form of expression, but "spam blogs are built essentially to fool search engines." They have different characteristics than blogs – characteristics that computers can identify and thus use to eliminate spam from their search results. "If we see 10,000 pings within 60 seconds, and all the blogs point to the same Web site, it's really easy to recognize that as a link farm," Sifry says.

There are other ways to spot offenders. Like most blogs, Some Title consists of a number of 50- to 100-word posts (incoherent ones, in this case), all with hyperlinks to other Web sites. In real blogs, the hyperlinks' anchor text – the word or phrase users click on – is generally something innocuous like "previous post" or "interesting discussion." Splogs, by contrast, often have search terms in the anchor text; the anchor text for one Some Title link, for instance, was "grave digger freestyle." The links in ordinary blogs usually take users to well-known sites like Flickr and YouTube or prominent blogs like Talking Points Memo and Boing Boing. By contrast, each link in Some Title takes the user to a spam Web page or another splog.

These sites, moreover, often have odd-looking, superlong URLs that are packed with keywords, because search engines tend to award high ranks to Web sites with keywords in their title, and sploggers are constantly looking for ways to increase their visibility in search engines. One LiveJournal splog that mentioned me, for example, was called New-york-agency-direct-mail-insurance-marketing. The grave-robbing Web site had the absurd address www.1michaelgraves7.info/conducting-from-the-grave/ grave-robbing-in-ventura-california-1985.html. "If it's a Blogspot blog with more than two dashes, it's spam," Mullenweg says. Simply checking for dashes and search terms in links, in other words, will eliminate many splogs.

Another giveaway: Both Some Title and the grave-robbing page it links to had Web addresses in the .info domain. Spammers flock to .info, which was created as an alternative to the crowded .com, because its domain names are cheaper – registrars often let people use them gratis for the first year – which is helpful for those, like sploggers, who buy Internet addresses in bulk. Splogs so commonly have .info addresses that many experts simply assume all blogs from that domain are fake.

But even if blog-search firms use these techniques to identify and remove splogs, the struggle against them will never end. "The sploggers always adjust," says Nielsen's Glance. "As soon as companies like Google and ourselves get better, the spammers get better." Every so often, Google revamps its search algorithms, partly to outwit spammers and bloggers. The update sets off a "Google dance," in which legitimate Web site owners and scammers both race to maintain high positions in search engine results.

Dismayingly, this endless arms race may actually be a best-case scenario. The interactivity of the blogosphere – and of the rest of Web 2.0 – means that sploggers will always have multiple ways of infiltrating the system, explains Gilad Mishne, a computer scientist at the University of Amsterdam who focuses on splogs and Web spam. And for those other paths, he says, "we're really in trouble."

On June 19, Six Apart's Anil Dash blogged about his experience beta-testing Microsoft Office 2007. His positive review attracted considerable attention, with many other bloggers linking to it – so many, in fact, that Dash was surprised to discover soon after that his post was listed second when people Googled "Office 2007." (It was 10th in MSN Search and 17th in Yahoo.) If the post's position doesn't change, Dash says, "a year from now it could be a gateway for tens of thousands of software sales, maybe hundreds of thousands. My wife told me to quit my job and focus on exploiting my search rank." It'd be easy, he jokes. All he'd have to do is stick up some big pay-per-click ads for Office 2007 and watch visitors click through, collecting a fee each time they did.

The possibility was more than theoretical. Dash's posts have attracted attention before, and the attention has sometimes been followed by an email asking if he would, for a fee, tuck a new link somewhere on his site. The emails were from sploggers. They wanted to add Dash's highly ranked post to a link farm. "I get these offers about once a week," he says. "I've always been a little leery of trying to contact them to find out exactly how much they'd pay."

The emails, Dash believes, exemplify the fundamental difficulty in fighting splogs and Web spam. With the rise of pay-per-click advertising, the big search engines have, in effect, created a kind of currency: ranking in search results. Put up the right Web site, with the right collection of links and keywords, and – ka-ching! This cash is available to anyone on earth who can manipulate search engines' site-ranking systems. Little wonder that the entire world's supply of spammers is trying to seize the opportunity. They are combing through the Net so assiduously that they are attempting to capitalize on individual blog posts about products that won't even appear for months to come. No single company, Dash believes, can withstand that much collective rapacity. As a result, he says, "there's going to be a reckoning with the economy that's building up around search engine rankings, one way or another." Something fundamental will have to change, either in the search engine world or the blogosphere, because things can't continue the way they are now.

Technorati's Sifry is more optimistic. Yes, he says, spam is a problem. But the people who are crying doom are not taking a wide-angle view. "You have to recognize that spam is doubling every six months," he says. With this in mind, Technorati's engineers designed its spam filters to be scalable right from the start. Its continually escalating defenses, Sifry believes, will eventually tame the onslaught of spam, though not eliminate it.

But many researchers also fear that an eventual solution will reduce the openness, ease, and accessibility that is at the heart of the blog world and Web 2.0. They note that one method by which the blog-search firms weed out spam is by not trying to include comments and trackback in their searches. The result is to strip out bloggy interactivity – getting rid of spam by treating Web 2.0 as if it were Web 1.0.

"The whole purpose of Web 2.0 is user-generated content," Mullenweg says. "To make that happen, you want the system as easy and transparent as possible. But that just lets the spammers in. So you put in hurdles for them to jump over. They jump over them, so you put in more hurdles. And at the end of the day, you have a system that's not nearly as easy and open and transparent."

One example: Blogger and other blog-hosting sites now require users to prove they are not spambots before posting comments by identifying a series of distorted letters and numbers. The protective codes are called Captchas, which stands for "completely automated public Turing tests to tell computers and humans apart." In theory, sploggers' autoposting software can't figure out the distorted images, thus reducing the flow of spam. But Captchas also make commenting harder. "It's a big pain for legitimate users," Blogger's Goldman says, "and there are many visually impaired people who can't do it at all." (Google recently introduced an audio-based form.) Nor are Captchas completely effective. Sploggers are believed to be hiring squads of low-paid people to type through the tests. "We're seeing Captchas solved in bursts, which suggests they are working in shifts," Goldman says.

Mullenweg thinks he has come up with a better approach. He got serious about fighting spam, he says, when his mother started to blog. "I went through the last hundred or so people who had pinged WordPress with comments and trackbacks, and it was all spam," he says. "Mortgages and Viagra, pills and porn."

Embarrassed and revolted, Mullenweg decided to fight back. In his view, even the smartest companies – the Six Aparts and Technoratis – represent single points of failure, something that spammers can target and outwit. But the bad guys, Mullenweg says, can't beat the "collective, distributed intelligence" of the blogosphere. When bloggers install his Akismet software, it submits all comments and trackbacks to a Web service that tests them for spamminess, quarantines the bogus ones, and posts the rest. If any of those are spam, bloggers report them to Akismet, which uses the feedback to improve its filter. Almost 300,000 bloggers use the software, Mullenweg says, and their input improves the filter every day. "Essentially what we're doing is working together. All the kids that got hit by bullies in school have discovered there's strength in numbers. I like to believe that, anyway."

Despite the effort and expertise behind such technical fixes, Dash doesn't think any of them will work in the long run. "They're making money on beating you, and you're losing money fighting them," he says. "The economics are on their side." Ultimately, he thinks, "the solution is going to be accountability. You have to know that somebody is who they say they are." Six Apart's TypePad blogging service enforces accountability on its bloggers in one of the simplest ways possible: It charges them at least $4.95 a month to host their blogs. Not only is the token payment enough to discourage scammers who want to operate thousands of blogs at once, but it also establishes bloggers' identities by tying them to a bank account.

Because not all companies will follow Six Apart's template, Dash says, there will have to be some kind of global identifier – an Internet Social Security number, so to speak. Everyone could select a personal URL, he says, such as their blog address. "If you use your URL as identification, you can use that to get higher search engine placement." He employs a similar system on his own blog, he says. "Anybody that signs in and authenticates themself or provides their URL can immediately comment. Anybody that wants to be anonymous, I have to approve it."

Dash concedes that such global identifiers would alarm privacy activists. But the other solutions are even worse. For example, search engines could auction off their search results, thus making people pay for ranking. Just as monthly payments for blogs would be the death of splogs, paying for search rank would be the death of link farms. But, says Dash, "a blog like mine wouldn't have a chance of being the second result for 'Office 2007.'" What is not possible, he believes, is to continue to muddle through. "The spammers are too good."

Asked what impact he thinks splogging will have on the future of the Web, Some Title creator Goggins pauses. "I'm just making my living," he says. "I guess I don't think about that kind of thing very much."

Contributin editor Charles C. Mann (www.charlesman.org) wrote about click fraud in issue 14.01.

Decoding Splogs

This site on Keywordblogger.net, one of many I discovered last year that contained my name, combines many of the techniques scammers use to drive up their rankings in users' search queries. Wired annotates the essential components. - C.C.M.