Big Data and Personality

Andrew McAfee posted about a very intriguing study on personality, gender and age in their relation to language. In essence, what the study did was to look at the correlation of people’s Facebook statuses and their personality, gender, and age.

You’ll know why I say it’s intriguing when you take a look at some of the findings. Especially interesting are the word maps.

Here’s one showing the words used by people who were extraverted/introverted, and their emotional stability (i.e. personality). Neurotic people are sad, angry, and existential. Emotionally stable people are… hmm… outdoorsy/active? As McAfee mentioned in his post it’s an interesting correlation between the sorts of activity and emotional stability, but one which cause-and-effect is difficult to determine. Does physical activity lead to a more emotionally stable personality, or do emotionally stable people just tend towards physical activity?

Image of Facebook status updates by personality

I’m pretty much a 60/40 introvert (60% introvert, 40% extravert) so I’m always intrigued with studies on introversion, so I just couldn’t ignore the huge “anime” (and its related terms, like “Pokemon”) popping up in the introversion word map.  — I do wonder how much of an impact cultural influence (i.e. a person’s country of origin/residence) plays a part. And did you notice the number of emoticons in that map? Me too 🙂

And here’s the word map for males vs. females. I love this one. Seems like the biggest thing on female’s minds is shopping and relationships, while for males it’s all about sex and games. As McAfee mention’s on his blog, this doesn’t “does not reflect well at all on my gender”.

Image of Facebook status updates by gender

And here’s one for age. My guess why daughter’s are more talked for the 30s to 65s about is because women are the ones talking about them (men just talk about sex and sports). In the gender map, relationships dominate what women talk about (apart from chocolate and shopping), and through my experience in TV watching, women don’t really talk about sons because sons pretty much take care of themselves. Daughters, on the other hand, are always worth worrying about.

Image of Facebook status updates by age

I could imagine fiction writers using these to build character dialogues; or academics building ever more insightful anthropological maps; or marketers with targeted campaigns. It’s a really imaginative use of big data, and one that I think is brilliant.

Who says Big Data’s failed?

Risk vs. Uncertainty (Part I)

I can’t believe I didn’t write about it before today: the difference between uncertainty and risk.

I’d originally thought that uncertainty and risk were one and the same. If you’re uncertain about something, about taking some action, and you had to decide whether or not to take that action, it was a risky action to take.

But it’s not like that.

Risk involves known odds. Known probabilities. Known possible outcomes. Uncertainty does not.

Let’s say that you have to throw a die that determines whether or not you live or die based on its outcome. If it’s four or greater you live, if it’s three or less you die. It’s a risk. But it’s not uncertain, because the odds and outcomes are known.

If you were not given the conditions under which you’d live or die, so you don’t know what range of values determines what fate, things get pretty uncertain. You don’t know if throwing any number between 1 through 6 will mean you live or die. Or whether or not living or dying was one of the outcomes you could expect.

To use another analogy, it’s like playing Russian Roulette without knowing how many bullets there are in the chambers and not knowing if the gun is real in the first place.

Under conditions of risk you’re making an informed decision.

Under conditions of uncertainty, however, there is no informed decision except that of the overhanging uncertainty. “I know the outcome and odds are uncertain, but I’m going ahead anyway.”

Confounding and the measurement of the MRT off peak initiative

The Singapore government announced a while back that they were going to start an initiative to try to reduce peak period crowds on our public rail system or MRT (Mass Rapid Transit). The initiative involved providing free and subsidised travel for passengers on selected trips during the morning off-peak period.

This initiative kicked off two days ago. Two days on, some people are wondering if it made any difference – trains seem as packed as they were before, and those who were already taking trains during the free travel periods have found little to no difference of the number of passengers from before.

But before we make any conclusions, aside from the fact that it’s only day two and it makes no sense to conclude any result, we have to realise that this initiative kicked in at a time filled with confounding variables.

What we’re trying to measure here is whether the government initiative has worked by reducing peak period travel. So we’re trying to see if there’s a relationship between the [Government Initiative] and [Fewer Peak Period Passengers]; or more precisely, whether [Government Initiative] caused [Fewer Peak Period Passengers].

A confounding variable is an additional variable (one we could and would rather do without) that obscures the relationship of the variables we’re trying to measure, because its introduction impacts the end result. Let me give you an example.

Let’s say there’s a group of people who are hard of hearing. You discover that they love listening to loud music and have, in fact, done so for at least the last five years. You might conclude that listening to loud music makes you hard of hearing.

But let’s say that you then discover that this group of people all used to operate jackhammers, and were subject to loud noises for most of their working lives. Would you be as confident of your conclusion now?

What if these people were in their 80s? Would that change your mind yet again?

Loud music, operating jackhammers, and age can all contribute to hearing loss. Drawing conclusions from this group to make predictions on hearing loss is going to be tough. You just can’t quite single out one cause for hearing loss.

So, as I was saying, any analysis of the MRT rides this week is definitely going to be badly confounded by (at least) the following:

  • People working from home due to the haze;
  • Parents bringing their children overseas as it’s the last week of the school holidays;
  • Children not taking the trains because it’s the school holidays, leading to;
  • People just trying out what it’s like to travel off peak, with the new initiative; and
  • People just trying out what it’s like to travel during peak periods, with the new initiative.

Confounding in business measurement

Confounding is a terrible thing to have when you’re trying to measure cause and effect. I remember having been involved in several performance measurement initiatives, all happening at the same time, designed to improve sales numbers.

The problem with such initiatives is that you could never really know how much of an impact a particular initiative had on the overall sales results. You could know the impact of all the initiatives put together, but any single one would probably have been affected by others because, as mentioned before, they were all happening at the same time.

It’s difficult to get management to agree putting off trying initiatives simply because you want to get a more accurate measurement. It’s like telling a get-rich-quick addict to try only one get-rich-quick scheme at a time to know what really works. It just doesn’t happen.

But when you don’t know what initiative works and what doesn’t, you can’t afford to drop even a single one of them. And juggling all of them can get pretty expensive.

The Reliability of Internet Marketing Research

I was doing some secondary research on the web to try to gather some statistics on small businesses and websites when I realised that there just wasn’t much reliable data around, and that the majority of the statistics on the web were referencing themselves (this is like when an article on website A would point to 50% of small businesses not having websites in 2011, a statistic it obtained from an article written in 2009 on website B, which got its information from website C, which was incidentally quoting an unconfirmed “Internet Research expert” who wrote it on some tech forum, citing some old and unconfirmed piece he remembered reading a couple of years ago).

Reminded me of an article I read on how Wikipedia was subject to these sorts of self-referencing too.