Testing and Statistical Confidence: The Three Bears


The primary danger in taking actions in response to data (or even structured tests) in our industry appears to be overconfidence in overt data points and a lack of grasp of randomness and statistical confidence levels, leading to a flurry of actions and tweaks that get us more lost in the woods, not less. (I pointed this out here, last week.)

According to cognitive scientists like Daniel Kahneman and Amos Tversky, humans — even professional statisticians — are “poor intuitive statisticians.” (Predictably, some up-and-coming scholars in the same field have argued just the opposite.)

Translated into terms we see in our daily work in performance-based ad testing (etc.), it’s very common for novices — or even good workers who feel under the gun from bosses/clients who push us to “test more, do more,” — to respond frequently to random bits of data. For example, ads will be paused in favor of the ad or two that are “winning,” despite the statistical confidence levels on the win being (if you took the trouble to run them through a calculator) below 70%, despite the fact that a suboptimal attribution model is used such that sometimes segments get “last click credit” for “converting,” but other times do not (and again, often that is for no rhyme nor reason other than pure randomness). “What’s working better” in a combination of ads, keywords, queries, landing pages, inter-family buying dynamics, and medium to long consideration cycles — is never as easy as it looks. Tweaking to random data is, arguably, tantamount to concluding tests before they’re finished. In other words, you set up a whole bunch of experiments, and then shut them down prematurely. Wasted resources and insufficient learning/takeaways.

Too much random tweaking, we might say, is the action of the Impetuous Bear.

At the other end of the spectrum is the Inertia Bear. That bear once saw an A/B test that was rigged to be exactly the same ad competing with itself. 9 conversions accrued to the “winning version,” and only one to the “losing” (yet identical) version. Consulting the math experts, that outcome (in the case of a truly fair coin flip) happens only ten times out of 1024, so it’s less than 1% likely to happen. And yet it happened! From this, the Inertia Bear decides to insert a lot of these “placebo tests” into testing as a way to guard against acting on purely random results. Over time, though, the paranoia about some results necessary being random or impossible to explain (or correlate with the triggers being tested) begins to creep into a general mistrust of testing. That leads to a broader trend away from building anything new. That works fine, until it doesn’t.

So is the answer to simply be “Moderate Bear” and chart a path in between the two? Well, certainly you want to avoid either of these two extremes.

But in addition to that, you probably should be Curious Bear or Creative Bear, the kind of bear that fashions new things to see what might come of them, regardless of what the data say. If you’re purely driven by spreadsheets, provable outcomes, and “what’s best for the shareholders,” your output is bound to be less interesting. (See If Steve Ballmer Ran Apple.) And the end result of that, we see all around us. It’s why — despite not being an Apple guy either — while I have been somewhat intrigued by the Microsoft Surface tablet, I never bothered to actually buy one. If someone didn’t demonstrably pour some passion into the conception and development of the product, then why would I line up to buy it? Probably, I eventually will. Maybe. (Is that level consumer intent even worth testing around?)

You may also like