I’m going to be talking a little about predictive analytics today, to give you a rough idea of what it is (and isn’t).
You might have read in the news before about things like computers and algorithms churning out predictions on what might happen next, in industries as diverse as the financial markets to soccer betting.
You might have read how accurate (or inaccurate, as the case may be) they were, and how “analytics” (or more accurately, predictive analytics) is changing the nature of information. Analytics used to simply describe what happened (see “descriptive statistics“), but now they’re almost just as often used to predict what’s going to happen (“predictive analytics“).
Perhaps you had in your mind a vision of data scientists in lab coats peering into a computer screen like a crystal ball, with the crystal ball telling them what is or is not going to happen in the future. If you did, then get a cloth and prepare to wipe that image out of your head, because other than the computer screen nothing else is true.
Predictive analytics isn’t rocket science. Not always, anyway.
The ingredients of predictive analytics
The ingredients that go into predictive analytics are quite straightforward. In most cases, what you’ll have is some historical data and a predictive model.
The historical data is often split into two: one for building the predictive model, and the other for testing it. For example, suppose I have 100 days of sales history (my “historical data”). For the sake of simplicity, let’s assume my sales history contains just two pieces of information: number of visits to my website, and the number of units of widgets sold. I would split this 100 days of sales history randomly into two separate groups of sales history, one with, say, 70 records for building the model, and the remaining 30 for testing it.
Using the 70 records, I build a model says that for every 1000 visitors, I’d sell approximately 10.3 units of my product. This is my “predictive model”. So if I had 2000 visitors, I’d sell approximately 20.6 (10.3 x 2) units; and if I had 3000 visitors I’d sell 10.3 x 3 or 30.9 units and so on.
In order to test my data, I’d run this model on the 30 days of sales history I had put aside. So for each day of sales history, I’d use my formula of 10.3 units sold per 1000 visitors and compare that result against the actual sales results I had.
If I found that the model’s predicted results and what actually happened were very different, I’d know that the model needed tweaking and wasn’t suitable for real-world use (it’s a “bad fit”). On the other hand, if I found that the predicted and actual results were close, then I’d be happy to assume the model was correct and test it on current, ongoing data to see how it worked out.
You may be wondering why we don’t just use the data we built the model on to test the model. It is because we want to make sure that the model we built is too specific to the data used to build it (i.e. the predictive model predicts with great accuracy the data it was built on, but not anything else). Testing the predictive model against the data that helped to build it would be inherently biased.
Think of it like the baby with a face only a mother could love, with the predictive model the baby and the dataset the mother. Just like you wouldn’t ask the baby’s mom to judge a baby contest her child was participating in, you wouldn’t want to test a predictive model against the data it was built from.
Now that we’ve settled one half of the predictive analytics equation (i.e. the data portion), let’s get to the predictive model. You may be wondering what a predictive model is exactly. Or you may have guessed it already based on what was written above. Whatever the case, a predictive model is simply is a set of rules, formulas, or algorithms: given input [A], what will be the output be?
This predictive model is something like a map. It aims to predict what will happen (the output) given a value (the input).
Let’s run with the map anology for a bit. Let’s say that I have in my hands the perfect map (i.e. it models the real world perfectly). Using this map, I can predict that starting from where I am right now, if I walked straight for 100 meters, turned 90 degrees to my right, and walked straight for another 50 meters (the input), I should arrive at the mall (the output). And if I tested the map and actually followed its directions, I’d find the “prediction” to be right and I’d be at the mall.
But if I had in my hands an inferior map (i.e. a lousy representation of the real world), if I “asked” it what would happen if I followed the exact same directions as above (100 metres straight, turn 90 degress right, 50 meters straight), it wouldn’t say the mall. And because it doesn’t say the mall, which so happens to be where I want to go, I “ask” the map what directions I needed to take if I wanted to get to the mall. The inferior map would provide some directions, but because it’s so different from the real world, even if I followed these directions to the most exact millimeter I wouldn’t get there.
So the prefect predictive model will predict things to happen in the real world exactly as they will happen, given a set of inputs.
In a nutshell, that’s just what predictive analytics is: an input, a predictive model, and an output (the prediction). Though what I’ve written here is grossly simplified, it helps to have a concept in your head when you hear people talking about algorithms or computers predicting such-and-such.