Last time, we talked about certain time-honored concepts in the advertising and marketing world. I mentioned how some have been given new definitions by those who didn’t understand the concept. Or maybe they were seeking to put their own unique stamp on it. It would appear that’s also true for A/B testing, and it might be leading to some issues for you on implementing it.
While A/B testing seems to be thought of as the exclusive domain of marketing and advertising, that’s far from true. I mean, it’s not uncommon for Hollywood to engage in their own version. We’ve all heard of “test screenings”. Studios have completely redone movie endings based on the practice. There are a few examples out there where studios essentially tested two versions at the same time. On the same night. At the same multiplex.
It goes beyond the movie world. Magazines test different covers, or even release the same issue with multiple covers. In recording studios, audio producers often engage in A/B testing by listening through a composition while enabling and disabling certain effects or filters as they go along, to hear whether a given track sounds better in the overall mix with or without the filter activated.
Restaurants use A/B testing to narrow down specific menu items. I even saw one that offered two wildly different menu descriptions of the same item. Another gave different names to two items that were identical in every way except – as an example – the inclusion of garlic, or some other ingredient.
Very simply, you’re doing A/B testing when you compare one version of something to another, and measure the results to refine the next iteration.
Where A/B Testing Goes Wrong
I laugh when I see a guru lauding these concepts as somehow new or revolutionary business practices, when in fact it’s been around for more than a century. If you were here last time you might remember me mentioning the name Claude Hopkins as one of the legends I studied voraciously while I was learning to be a copywriter. Claude Hopkins was leaning on A/B testing as one of his core processes by the year 1910. The Hopkins model was simple: Make a newspaper ad with a coupon at the bottom corner. For Version B, use the same ad except try it with a different headline. Run multiple tests on multiple coupons in multiple markets, and the data will lead the way for the next iteration.
Hopkins, however, knew the secret to effective testing: One change. That’s it. One change determines whether you’re going to get any benefit at all from one of the most valuable tools in history.
You used different copy for the headline in Version B, but you also made Version B’s headline red instead of black? That’s no longer A/B testing. That’s playing a guessing game. What if Version B comes back showing improvement? You now no longer know whether it was the words or the color that made the difference. Maybe people loved the new text in the headline, but the color red turned them off. You don’t know. And you wasted an entire test gathering useless data. A test that cost somebody money.
Is It Big Enough to Warrant a Test?
The problem with many a guru now is that they’ll see a suggestion like mine as “old school”. They don’t see an issue with changing the headline, changing the color, and changing multiple other elements. They’ll say that something as simple as changing the color of the headline isn’t worthy of its own test. My response: If you don’t think it will result in a measurable impact, then why are you changing it?
Forgive the echoes of my previous work, but consider the client who’s asked for different background music in the radio commercial I produced for them. What’s that change based on? Personal taste? Or did someone do A/B testing on versions with no music, the current music and the proposed music, to see which one performed better? And if I’m going to propose music at all, it needs to be because experience has shown me that for this specific commercial, music will work better.
Make Sure It’s a Fair Fight
You have to level the playing field as much as possible to get accurate test results. I saw a comedian once who was A/B testing different punch lines for a reference in one of his jokes. He was trying to illustrate that the person sitting in front of him at a Yankees game was wearing a big hat. In one version he said the hat was “twice the size of Manhattan”. In another, he said the hat was “twice the size of Schenectady”, a small town northwest of Albany. But in order for that to be an effective test, he had to try the line out in front of big crowds, small crowds, lively crowds, hostile crowds, when the joke before had done well, when the joke before had done badly. (And yes, there are comedians that get this granular about their material.)
After a ton of performances, he’d settled on the Schenectady version getting better response. However, the possibility was that – as many comedians will tell you – words in a punch line that include the sound made by the letter ‘K’ are perceived as funnier. That link will take you to a fascinating Google search on the subject. So that raised a different question – what if he replaced the word ‘Schenectady’ with the name of another small town in New York that also has a ‘K’ sound – like Yonkers? And another round of A/B testing begins.
In the web world, it’s the same. Testing on your new website design needs to give equal chances for success across browsers, operating systems, speed of web connections, geography, language, screen size, and dozens of other variables.
A lot of data is a good thing. A lot of random data that isn’t granular enough to be crystal-clear, is a waste of time.
Sorry if wanting data that’s actually useful is too “old-school” of me.