Posted by randfish
First off, let me just say that there are a lot of people smarter and more experienced in scalably attacking web spam than I am working in the Search Quality division at Google and specifically on the Spam team. However, as a search enthusiast, a Google fan and an SEO, it seems to me that, all due respect, they’re getting played – hard.
Word is, the Spam team’s key personnel had some time off working on other projects and supposedly they’re coming back together to renew the fight. I hope that’s the case because the uproar about black/gray hat SEO gaming the results is worse than ever, and deservedly so. It’s getting bad enough to where I actually worry that early adopters might stop using Google for commercial queries and start looking for alternatives because of how manipulative the top results feel. That behavior often trickles down over time.
Thus, I’m going to expound a bit on a tactic I discussed in my interview with Aaron for fighting what I see as a large part of the manipulation of results in Google – the abuse of anchor text rich links.
The basic problem is that if you want to rank well in Google for a high value, commercial search query like discount printer cartridges or home security camera systems, getting links with that anchor text containing those words, preferrably exact matches, is invaluable to rankings. Unfortunately, natural, editorially given links are extremely unlikely to use anchor text like that. They’re more likely to use the business or website name, possibly a single relevant word or two, but finding dozens or hundreds of domains that will link with this kind of anchor text without push-marketing intervention from an SEO is next to impossible.
That means sites that earn the natural, editorial links fall behind, while those who find ways to grab the anchor text match links and evade Google’s spam detection systems nab those top spots. It’s been going on for 10 years like this, and it’s insane. It needs to stop. Just as Google’s said they’ll be taking a hard look at exact match domain names, they need to take a hard look at precise matches for commercial anchor text links.
Here’s the methodology I like:
Step 1: Create a list of oft-spammed, commercially-directed anchor text. With Google’s resoures, this won’t be hard at all. In fact, a good starting point might be something some top adsense keywords lists (this one was readily available).
Just a sample of some of the 3,400+ phrases in one file I found
I suspect Google’s Webspam team would have no trouble compiling hundreds of thousands of phrases like this that have a high potential for gaming and are found in large quantities of anchor text links.
Step 2: Locate any page on the web containing 3+ links with any of these anchor text phrases linking to different sites. An obvious example might look something like this:
But, any collection of exact-match anchor, followed links to pages on multiple domains could be flagged by the system.
Step 3: Have manual spam raters spot check through a significant sample size of the pages flagged by this filtration process (maybe 5-10,00) and record the false positives (pages where Google would, legitimately want to count those links).
Step 4: If the false positives follow some easily identifiable pattern, write code to exclude them and their ilk from the filtration system. If the pattern is tougher to detect, machine learning could be applied to the sample, running across the positives and false positives to identify features that give an accurate algorithmic method for filtration.
Step 5: Devalue the manipulative links by applying the equivalent of a rel="nofollow" on them behind the scenes.
Step 6: Create a notification in Webmaster Tools saying "we’ve identified potentially manipulative links on pages on your site and have removed the value these links pass." Add this notification to 60-75% of the sites engaged in this activity AND write a blog post saying "we’ve applied this to 65% of the sites we’ve found engaging in this activity." If webmasters send re-consideration requests that they believe the filter caught false positives, you can send these back through Step 4 for evaluation and refinement.
Step 7: Create a flag in the PageRank toolbar for these same 60-75%, making the PR bar appear red on all the pages of the site. Announce this on the Webmaster Blog as well, noting that "65% of the sites we know about have been flagged with this."
That’s gonna scare a lot of webmasters
Step 8: Watch as search quality improves from the algorithmic filtration of manipulative link power and less spam is created as link buyers and spammers realize their efforts are going to waste.
Is this plan foolproof? No. Are there loopholes and messiness and ways clever spammers will work around it? Absolutely. But the folks I’ve talked to about this agree that for a huge quantity of the most "obvious" webspam via link manipulation, this could have a big, direct, fast and scalable impact. The addition of steps 6 and 7 would also send a much needed message that site owners and content creators would hear and feel loud and clear, while creating enough uncertainty about the value of the non "marked" sites to cause a behavioral shift.
Maybe Google’s already thought of this and shot it down, maybe they’ve already implemented it and we just think all those anchor text rich links are helping, but maybe, this thing has legs, and if it does, I hope Google does something. I’m bombarded so often with the question of "isn’t Google irreleva
nt now?" and "hasn’t SEO ruined Google?" that I’m fairly certain action’s needed. This type of manipulation seems to me the most obvious, most painful and most addressable.
Looking forward to your comments, suggestions and ideas – undoubtedly my concept is riddled with holes, but perhaps with your help, we can patch it up.
p.s. Yes, conceptually we could create a metric like this with Linkscape and show it in the mozBar and via Open Site Explorer and/or the Web App, but I’m not sure how accurate we could be, nor do I think it’s the best way to help web marketers through software (given our dozens of priorities). However, the fact that our engineering team thinks it’s relatively simple to build means it must be equally (if not more) simple for Google.