Google’s Mechanical Turk

by

You can’t run a web search index as massive as Google’s without algorithms, scale, and massive automation.

You also can’t run it without editorial judgments. And “computers alone” do not determine what ranks.

You can’t ever fully describe a composite concept like “relevance” or “quality” fully “accurately,” because these are inherently subjective qualities. All you can do, as a scientist, is to develop the measuring tools to measure a version of that. And in so doing, the scale used to measure becomes, for all intents and purposes, synonymous with that quality — at least for the purposes of the study, and to those consuming the outcomes of the study.

Think of proving that “religiosity is highly correlated with a gene that also causes you to have green hair.” You could isolate that gene and figure out if this was true, but hold on: what the heck is religiosity, anyway?

It turns out the definition would be arbitrary. A group of social scientists (maybe using past literature, an expert panel, or other means) would create a weighted scale to measure it from a composite of factors, based on discoverable facts or askable questions. (“How many times did you attend church or a religious institution in the past month?” “On a scale of 1 to 10, how important do you believe it is that your choice of employment be consistent with time for religious practices?”. For the sake of certainty you might create a long list of factors, or if you didn’t care too much about methodology, you would just come up with a rough approximation of what we generally see as the quality of religiosity, and the scale itself would have to do in terms of being true to that concept.)

So what about the “quality” of the content on web pages in relation to informational queries?

As many know, Google has long employed human raters to make notes on specific pages, to help them better design the algorithm, assess where certain websites stand in relation to generally accepted qualities of quality, spam-or-not, etc.

Quality raters are generally asked to answer simple questions, but it gets as sophisticated as them having to know the difference between “thin” and “thick” affiliates. Presumably, then, they could be asked to take on more nuanced judgments still.

There is nothing to say, then, that Google doesn’t run additional pilot projects using human raters to crack down on certain areas where poor quality is creating generalized malaise and spam complaints.

And what additional qualities might they look for? In the case of the latest “Farmer” update intended to lower the overall rankings of companies dubbed “content farms” — allegedly, producers of lower-quality, SEO-friendly written content that falls below the editorial standards of “true” editorial organizations — you could begin to feed human raters questions like: on a scale of one to five, how “content farmy” does this page feel? (I guess I’m paraphrasing) … so that data could then in turn be fed back into algorithm design.

Presumably, web search algorithms evolve over time to encompass many arbitrary but important factors to ensure that users find content that is as relevant and useful as possible.

But much like trends in geopolitics that result in a shift in coverage emphasis on 24-hour news channels, specific themes and trends can increasingly pose themselves as the proverbial “growing threat” to the integrity of search results. Often these are composite, hard-to-define-with-laser-precision efforts to subvert the “spirit” of search algorithms by publishers seeking to tailor content to the “letter” of those algorithms.

If, for example, it turns out you can get an unfair advantage in search engine rankings just by studying the frequencies of many highly specific “question and answer” style search queries, and you fill the “answer” void with tailored, but untrustworthy and “slapped-together” answers, then you must might create a business around that. And a person (not a search engine) might argue that there are better sources of information generally, where that material is covered (indirectly or harder to find), and a search engine’s algorithm should be tweaked to give the latter a fighting chance over the former.

In that process, human quality raters could be getting new questions about how deep, rich, authoritative, etc. a piece is. New factors could come into play as to answer length, external validation, and just about anything you or a scientist could dream up.

Those raters might also be asked just to do a specific job in “calling out” a page or website. Similar to asking a quality rater to say if something is “spam” or not, and defining a “thin affiliate” as “spam” for said purposes, you could ask them to define whether an answer or how-to article was “low quality” or “original”, with further instructions that state that if something looks “content farmy,” then it may be low quality or unoriginal.

That human input could then be used as a shortcut for downgrading whole websites, or pages with certain qualities from those websites, at least on an “all else being equal” basis (the downgrades could be overruled by other strong quality and relevance signals).

On the whole, this paints a picture of a fallible process that still moves in vaguely the right direction, as far as “quality” and “relevance” are concerned. It’s not perfect, but if the goal is to plug certain opportunistic SEO loopholes so we do a better job of highlighting great content, it’s a nice way to reward the producers of original, “real” content and give them the courage to stick with that process for the long term, rather than worrying about being outranked in the short-term by cynical opportunists.

Like “religiosity,” the process of determining “relevance” and “quality” is not entirely subjective, because you can create and refine the scale and the composite measuring stick that tells you who has more of it.

But certainly there is an arbitrariness to it, so that some will feel wronged by the process. Like the “highly religious” person who scores lower on the official scale of “religiosity” because they don’t live near a church and don’t have a car, some useful web pages might not fare well when the algorithm finds them sharing qualities in common with other low-quality sites. With the Web being the scale it is, there are bound to be many shortcomings in ranking algorithms, especially on “long tail,” infrequently-searched terms.

If the situation gets bad enough, the search engines need to take big shortcuts to make fewer obvious mistakes. One enormous shortcut involves looking globally at “domain authority” as a major factor in measuring the trustworthiness or quality of a website’s content. While it would be nice to think that the algo can adjust to accurately assess the quality and helpfulness of specific pages of content, it’s fair to say that it can’t do a great job of that in many cases. So huge shortcuts are taken; some sites are relatively greenlighted before turning too greedy for their own good; so, the dance will continue. One day, certain sites — like The Huffington Post or Squidoo — will be flying high. The next day, they take a big hit across the board. The day after: who knows.

How “algorithmic” are those shifting fortunes? What exact mechanisms are leading to site-wide and brand-wide promotions and demotions? Many people are curious to know the details. Google is unlikely to provide specifics.

Not without cause, good-sites-gone-bad are often “slapped” by an across-the-board downgrade in response to open scheming and boasting by their owners or partners — or SEO’s who find means of becoming publishers on those sites — about how great they are at leveraging their site’s high trust to generate advertising or affiliate revenues, even for lower-quality or tailored pages. High trust where? In Google.

The parts of those sites that actually play by the rules — the real Squidoo lenses, the good articles on HuffPo — then need to be preserved so the “slap” doesn’t throw the baby out with the bathwater. Hard work, even for Google.

This puts every content-driven, advertising-driven business in a perilous position. Google is careful to describe certain directed “anti-spam,” “anti-low quality” initiatives in such a way that demonizes the “offenders”. Usually, this is in line with consumer protection, but every so often, you worry that just about any site could find itself on the wrong end of a new definition of what counts as “original” “quality” “content”.

Certainly, Yelp isn’t feeling too confident about its relationship with Google, despite its obvious leadership status in local business reviews.

And certainly, these “growing threat” sweeps should never be undertaken by first asking if “someone could build an entire business” around “getting unfairly high rankings in search engines.” This, of course, is subjective, and could apply to anyone.

Imagine if you built a database of filmographies and biographical information pertaining to motion pictures and their stars. Imagine if you built a user-constructed encyclopedia that had nearly definitive information on an incredible range of subjects. Imagine if you built the first major website that offered comprehensive information and user reviews of every travel destination in the world. That would give you great organic search traffic! But is it wrong? I hope not!

Of course, by any algorithmic test you can imagine — for now — these kinds of sites would rank well in Google – across tens of thousands of pages of useful content. Whole businesses could be built around them (and have been).

But in a heartbeat, this can change. Computers don’t make that decision — not on their own.

That’s the scary part.

You may also like