Playing Baseball without a Bat – a great example of effective statistical visualisation

Came across a very interesting and persuasive video on baseball via Kottke.org today. It’s a great example of what an interesting question, effective visualisation, and some statistical knowledge can do.

The question the video seeks to answer is the following: what would happen if baseball player Barry Bonds, who happened to play one of his greatest (if not the greatest) baseball seasons ever in 2004, played without a baseball bat?

I’m not a baseball fan, and frankly quite a number of the things that were mentioned in the video were lost on me. But I’m a fan of interesting statistics and great visualisations, and this definitely had both.

And despite having a few doubts at its conclusion (the results seem too good to be true – watch to the end!), it is convincing and definitely worth a watch if you’re either into baseball or statistical visualisations.

Business Experimentation

Imagine for a moment that you want to implement a new sales initiative that you think will transform your business. The problem is, you’re not too sure if it’d work.

You decide, prudently, that maybe a pilot test would be good: let’s roll out the initiative to just a small subset of the company, the pilot group, and see how it performs.

If it performs well, great, we roll it out to the rest of the company. If it performs badly, no drama – we simply stop the initiative at the pilot stage and don’t roll it out to the rest of the company. The cost of the pilot would be negligible compared to the full implementation.

After consulting with your team, you decide that your pilot group would be based on geography. You pick a region you know well with relatively homogeneous customers, and whom are extremely receptive to your idea.

You bring your idea to your boss, who likes it and agrees to be the project sponsor. However, he tells you in no uncertain terms that in order for the initiative to go beyond a pilot, you need to show conclusively that it has a positive sales impact. You have no doubt it has, and you readily agree, “of course!”

Knowing that measurement is a little outside your area of expertise, you consult your resident data scientist on the best way to “show conclusively” the your idea works. He advises you that the best way to do that would be through doing an A/B test.

“Split the customers in your pilot group, the region you’ve picked, randomly into two,” your data scientist says. “Let one group be the ‘control’ group, on which you do nothing, and the other be the ‘test’ group, on which you roll out the initiative on. If your test group performs statistically better than the control group — I’ll advise you later on how to do that — you know you’ve got a winning initiative on your hands.”

You think about it, but have your doubts. “But,” you say, “wouldn’t that mean that I would only impact a portion of the pilot group? I can’t afford to potentially lose out on any sales – can’t I roll it out to the whole region and have some other group, outside the pilot, be the control?”

Your data scientist thinks about it for a moment, but doesn’t look convinced.

“You can, but it wouldn’t be strictly A/B testing if you were to do that. Your pilot group was based on geography. Customers in other geographies won’t have the exact characteristics as customers in your pilot geography. If they were to perform differently, it could be down to a host of other factors, like environmental differences; or cultural differences; or perhaps even sales budget differences.”

You’re caught in two minds. On the one hand, you want this to be scientific and prove beyond a doubt the efficacy of the initiative.

On the other hand, having an initiative that brings in an additional $2 million in revenue looks better than one that brings in an additional $1.5 million, due to having a control group you can’t impact.

Why would you want to lose $500,000 when you know your idea works?

What do you do?

A Culture of Experimentation

Without a culture of experimentation, it’s extremely difficult for me to recommend that you actually stick by the principles of proper experimentation and go for the rigorous A/B route. There’s a real agency problem here.

You, as the originator of the idea, have a stake in trying to make sure the idea works. Even though it’d have just been a pilot, having it fail means you’d have wasted time and resources. Your credibility might take a hit. In a way, you don’t want to rigorously test your idea if you don’t have to. You just want to show it works.

Even if it means an ineffective idea is stopped before more funds are channeled to an ultimately worthless cause, for you it really has no benefit. Good for company; bad for you.

In the end, I think it takes a very confident leader to go through with the proper A/B testing route, especially in a culture not used to proper experimentation. It’s simply not easy to walk away from potential revenue gains through holding out a control group, or scrapping a project because of poor results in the pilot phase.

But it is the leader who rigorously tests his or her ideas, who boldly assumes and cautiously validates, who will earn the respect of those around. In the long run, it is this leader who will not be busy fighting fires, attempting to save doomed-to-fail initiatives.

Without these low-value initiatives on this leader’s plate, there will be more resources that can be channeled to more promising ventures. It is this leader who will catch the Black Swans, projects with massive impacts.

I leave you with a passage from an article I really enjoyed from the Harvard Business Review called The Discipline of Business Experimentation, which is a great example of a business actually following through with scrapping an initiative after the poor results of a business experiment:

When Kohl’s was considering adding a new product category, furniture, many executives were tremendously enthusiastic, anticipating significant additional revenue. A test at 70 stores over six months, however, showed a net decrease in revenue. Products that now had less floor space (to make room for the furniture) experienced a drop in sales, and Kohl’s was actually losing customers overall. Those negative results were a huge disappointment for those who had advocated for the initiative, but the program was nevertheless scrapped. The Kohl’s example highlights the fact that experiments are often needed to perform objective assessments of initiatives backed by people with organizational clout.

Can you imagine if they decided not to do a proper test?

What if they thought, “let’s not waste time; if we don’t get on the furniture bandwagon now our competitors are going to eat us alive!” and jumped in with both feet, skipping the “testing” phase?

Or what if the person who proposed the idea felt threatened that should the initiative failed  it would make him or her look bad, and decided to cherry pick examples of stores for which it worked well? (An only too real and too frequent possibility when companies don’t conduct proper experiments.)

It would, I have little doubt, led to very poor results.

And now imagine if this happened with very single initiative the company came up with, large or small. No tests, just straight from dream to reality.

Disastrous.

But unfortunately in so many companies just the case.

My thoughts on (sales) forecasting and predictive models

I need to have a data-dump on the sales forecasting process and forecasts.

On optimistic and pessimistic forecasting:

  • When forecasts are (consistently) too low: well-known issue that even has a name: sandbagging. You forecast lower to temper expectations. When you do get better results than the forecast you look like a hero.
  • When forecasts are (consistently) too high: quick research on Google shows that this is almost as prevalent as sandbagging. It seems salespeople are by nature over-optimistic about their chances of closing deals. My question though: if you consistently fail to deliver on the high-expectations doesn’t this dent your confidence? I’m not a salesperson, but if I was one I’d probably be a sandbagger (note: this actually reminds me of IT teams, where sandbagging is so prevalent  because of the high variability of project outcomes).
  • If the above is true, that we consistently under- and over-estimate our abilities to deliver, would a range of forecasts be a better bet? But I don’t hear sales leaders saying “don’t give me a number, give me a range.”
  • Would a range solve the sandbagging and over-optimism problem? In a way, it might, since it forces an alternative view that would be hidden should only a single number held sway.
  • Sandbaggers would be forced to say, “well yes, if EVERYTHING went to plan we might get a 20% increase in sales this month,” while the over-optimistic’ers would be forced to say, “fine, you are right that there is quite a bit of risk. A 10% drop wouldn’t be impossible.”
  • The problem with a range is that it is, well, a range. Oftentimes  a single number is preferred, especially if it’s to be communicated to other parties. It’s easier to tell a story with a single number, and its precision (note that I did not say accuracy) is seductively convincing.
  • One way around this would be to explicitly ask for a range, but at the same time ask also for the “highest probability” or “expected” value. This forces thinking about the best and worst case scenarios while giving you the benefit of a single number. And if you were tracking these forecasts, you might actually find that you can systematically take the optimistic forecasts of known sandbaggers and pessimistic forecasts of known over-optimistic’ers.

On the granularity of forecasting

  • When forecasting, the more granular the forecast the more noise you’ll find. I find it easiest to think about this in terms of coin flips.
  • A fair coin gives a 50/50 chance of being heads or tails.
  • If I flipped a coin, there would be a 50/50 chance of it being heads or tails, but I couldn’t tell you with any certainty if the next flip was going to be heads or tails.
  • However, if you flipped a coin a thousand times, I could tell you with certainty that the number of heads would be close to 50%, which is the nature of fair coin.
  • But let’s say I flipped a coin ten times. Could I tell you with certainty that the number of heads would be close to 50%? Well, no.
  • With just 10 flips (or “trials”, in statistical parlance), the probability of getting 5 heads is actually only 24.60%, which means that you have a 75.40% chance of getting something other than 5 heads/tails.
  • As we increase the number of trials, the probability of heads gets ever increasingly closer to 50%. Every additional trial reduces the variability, and you get closer and closer to what is the “nature of the coin”.
  • In sales forecasting there are occasionally times that you are asked to forecast for very specific things, so specific in fact that you might only have 10 historical data points from which to extrapolate. But with just 10 trials, what’s the chance that those 10 would fit the “nature of the thing being predicted”?
  • From Arthur Conan Doyle’s Sherlock Holmes: “while the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what any one man will do, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant.
  • One way around this is to aggregate upwards. You can, for example, ask yourself “what category does this thing I’m trying to predict fall into?” and lump this with those other similar things in the same category.
  • Say you have 10 related products that have sold about 10 units each, similar to each other though not identical. Though you could attempt to predict them individually, the small sample sizes per product would give you so much variance your prediction would likely be not much better than chance. It would be better to group these products together into a single category and perform predictions on this larger category.
  • Variations/predictive noise at the individual product level cancel each other out, giving you a cleaner picture.
  • Though looking at the individual products is a precise exercise, it doesn’t add to predictive accuracy.

On Building Great Predictive Models

  • The greatest amount of time spent on developing good predictive models is often in data preparation.
  • Give me perfect data, and I could build you a great predictive model in a day.
  • A predictive model that is 80% accurate may not be “better” than a model that is 70% accurate. It all depends on the context (if this was in the business domain, we’d say it depends on the business question).
  • Let’s say I build a model that is so complex it’s impossible for others but the most technical minds to understand, or which uses a “black box algorithm” (i.e. you let the computer do its predictive thing, but you have no hope of understanding what it did, e.g. a neural network). It predicts correctly 8 out of 10 times (or 80%).
  • Concurrently, I also build a model using a simple linear regression method, which is algorithmically transparent – you know exactly what it does and how it does it, and it’s easily explainable to most laypersons. It performs a little worse than the more complex model, giving me the correct answer 7 out of 10 times (or 70%).
  • Is giving up control and understanding worth that additional 10% accuracy? Maybe, but in a business context (as opposed to a hackathon) chances are good that after the 7th time you spend an hour explaining why the model does what it does, you’ll probably want to opt for the more easily understandable model at the expense of a little accuracy.
  • Business understanding is an important aspects of model building.

Overfitting a model and “perfect” models

  • Finally, I want to talk about overfitting models. Have you heard about overfitting?
  • When we build predictive models, we build them based on past data. In machine learning we call this data “training data”, i.e. data we use to “train” our model.
  • Overfitting happens when we “train” our model so well on training data that it becomes so specific to the data used to train it that it cannot be expanded to predict new data.
  • I find it akin to learning a new language. Sometimes you get so fixated on the grammar and syntax and structure you miss the woods for the trees, that though your speech may be grammatically correct it could be awkward or unnatural (e.g. overly formal, which is often the case if we learn to speak like we write).
  • When somebody speaks to you in a language you’re just picking up using conversational language, you try to process it using your highly formalised syntaxes and grammars and realise that though you know all the words individually, when strung together they make as much sense as investment-linked insurance plans.
  • Overfitting often happens when we try to predict at increasingly granular levels, where the  amount of data becomes too thin
  • In the end the model becomes VERY good at predicting data very close to what was used to build the model, but absolutely DISMAL at predicting any other data that deviates even slightly from that.
  • If tests show you’ve got a model performing at too-good-to-be-true levels, it probably is. Overfitted models perform very well in test environments, but very badly in production.
  • Sometimes when a model performs “badly” in a test environment, ask yourself: (1) is it performing better than chance? (2) is it performing better than the alternatives?
  • If your answer to both (1) and (2) is yes, that “bad” model is a “good” one, and should be used until a better one comes along.
  • Unless, of course, that the resources it takes to carry out predictions, in terms of monetary cost, time, or both, are higher than the benefits it brings. Sometimes a simple model with above-average performance that can be run in a minute can be far more valuable than one with superb predictive performance but which has a turnaround time longer than the time in which decisions are made.
  • I know of some people who look at predictive model and dismiss them simply because they aren’t perfect; or worse, because they’re too simple — as if being “simple” was bad.
  • But models have value as long as they perform better than the alternatives. If they’re simple, quick to run, and require no additional resources to build or maintain, all the better.

So many ideas – have to expand on some of these one of these days.

KFC and the Representative Survey

I had KFC (Kentucky Fried Chicken) for breakfast yesterday. Chicken rice porridge and a “breakfast” wrap (that oddly enough didn’t seem to contain any chicken).

It was decent, and I liked it.

So when I was quite excited when I saw that the receipt had a link to an online customer satisfaction survey, for which I would get a free piece of chicken if I completed it. It was a pretty good deal, I thought.

But I couldn’t help but wonder about how useful it was to KFC.

Surely survey responses would be largely over-represented by people who like their food (and service, to a certain degree)? If I hated their food, and/or hated their service, and swore never to go back there again, what good would offering me a free piece of chicken do for me?

These are the people whom you probably most want to hear from, and yet have absolutely no incentive to complete such a survey (and in most likelihood, being normal people like us, they’d vote with their dollars and just not patronise the store again, instead of submitting feedback).

It would, in short, be far from a representative survey.

I just hope that those who are interpreting and on the receiving end of said-interpretation understand the limitations of just such a survey, and discount the very likely amplified, far-too-positive results.

And if the results are lukewarm instead of three-Michelin -stars-worthy? Then oh dear.

Big Data and Personality

Andrew McAfee posted about a very intriguing study on personality, gender and age in their relation to language. In essence, what the study did was to look at the correlation of people’s Facebook statuses and their personality, gender, and age.

You’ll know why I say it’s intriguing when you take a look at some of the findings. Especially interesting are the word maps.

Here’s one showing the words used by people who were extraverted/introverted, and their emotional stability (i.e. personality). Neurotic people are sad, angry, and existential. Emotionally stable people are… hmm… outdoorsy/active? As McAfee mentioned in his post it’s an interesting correlation between the sorts of activity and emotional stability, but one which cause-and-effect is difficult to determine. Does physical activity lead to a more emotionally stable personality, or do emotionally stable people just tend towards physical activity?

Image of Facebook status updates by personality

I’m pretty much a 60/40 introvert (60% introvert, 40% extravert) so I’m always intrigued with studies on introversion, so I just couldn’t ignore the huge “anime” (and its related terms, like “Pokemon”) popping up in the introversion word map.  — I do wonder how much of an impact cultural influence (i.e. a person’s country of origin/residence) plays a part. And did you notice the number of emoticons in that map? Me too :)

And here’s the word map for males vs. females. I love this one. Seems like the biggest thing on female’s minds is shopping and relationships, while for males it’s all about sex and games. As McAfee mention’s on his blog, this doesn’t “does not reflect well at all on my gender”.

Image of Facebook status updates by gender

And here’s one for age. My guess why daughter’s are more talked for the 30s to 65s about is because women are the ones talking about them (men just talk about sex and sports). In the gender map, relationships dominate what women talk about (apart from chocolate and shopping), and through my experience in TV watching, women don’t really talk about sons because sons pretty much take care of themselves. Daughters, on the other hand, are always worth worrying about.

Image of Facebook status updates by age

I could imagine fiction writers using these to build character dialogues; or academics building ever more insightful anthropological maps; or marketers with targeted campaigns. It’s a really imaginative use of big data, and one that I think is brilliant.

Who says Big Data’s failed?

Risk vs. Uncertainty (Part I)

I can’t believe I didn’t write about it before today: the difference between uncertainty and risk.

I’d originally thought that uncertainty and risk were one and the same. If you’re uncertain about something, about taking some action, and you had to decide whether or not to take that action, it was a risky action to take.

But it’s not like that.

Risk involves known odds. Known probabilities. Known possible outcomes. Uncertainty does not.

Let’s say that you have to throw a die that determines whether or not you live or die based on its outcome. If it’s four or greater you live, if it’s three or less you die. It’s a risk. But it’s not uncertain, because the odds and outcomes are known.

If you were not given the conditions under which you’d live or die, so you don’t know what range of values determines what fate, things get pretty uncertain. You don’t know if throwing any number between 1 through 6 will mean you live or die. Or whether or not living or dying was one of the outcomes you could expect.

To use another analogy, it’s like playing Russian Roulette without knowing how many bullets there are in the chambers and not knowing if the gun is real in the first place.

Under conditions of risk you’re making an informed decision.

Under conditions of uncertainty, however, there is no informed decision except that of the overhanging uncertainty. “I know the outcome and odds are uncertain, but I’m going ahead anyway.”