Getting Real by 37Signals

In what must be one of the most serendipituous moments of my recent life I hit upon 37Signals’ Getting Real TOC page after doing a search of a string of text that randomly entered my mind. (The whole book’s available online for free? Wow!)

I’d first read Getting Real in its physical form (borrowed from a local library) and liked it so much I even quoted it in one of my earlier posts on making plans.

Though largely to do with software development and start ups, I find its lessons thoroughly transferrable to plenty of other domains in life.

Building an Antifragile System

I just completed the testing of a new program I wrote. 500 lines of well-commented code, making debugging easy if necessary. With this program, the reports we run daily in the morning would take 10 minutes instead of the usual hour, and fully-automated too. Without any manual inputs, the potential failure points owing to manual input errors (so common when any sort of manual intervention is required) are no longer a threat, a large boon to data integrity.

But yet, I couldn’t quite feel at ease. Something was bothering me – I just couldn’t help thinking that it was all too automated. Too efficient. Too fragile.

Was I missing something? I thought about it a while, then realised that I’d forgot about insuring myself against possible data problems. Would I know if the data wasn’t correct? What alternatives did I have if the program failed? Key business decisions may be made based on these reports, after all, so it wasn’t a good idea to take any chances here even if at a slight hit to efficiency.

I’m going to allude to Taleb’s wonderful book Antrifragile again (as I did in my previous post): the problem with lean and highly efficient systems is that they tend to be fragile, breaking easily under times of volatitlity (even the volatility of time itself) while simultaneously portraying an air of infallibility (it’s so efficient and wonderful that all the potential problems are ignored).

Fat-free, highly efficient systems break easily under times of volatility. Take for example a reporting system that extracts a specific piece of data within a spreadsheet on a daily basis. At its most efficient, you just take the data from a specific cell, without doing any checks or introducing any sort of flexibility. So perhaps you program the system to look for the cell that’s directly below the “Sales” header (say, for “Region X”), which is always the second from the left-most column. One day, because there’s be the addition of a new sales region (“Region Y”), the cell that you’re looking for pushed to the cell that’s two rows below the “Sales” header. And since everything’s automated, you don’t realise that the number that you’re now getting is for “Region Y”, not “Region X”, and you’ll continue using the numbers as if everything was fine. If you were manually running the report, however, you’d have spotted the difference right away.

Or take task scheduling for example. Let’s say you have two tasks you want your computer to carry out: Task A and Task B. Task B depends on the result of Task A, so Task A would have to be run first. Because you want to be as efficient as possible (i.e. completing the most number of tasks in the smallest amount of time), you set the starting time of Task B as close to Task A as you can (for the sake of this example, let’s assume you can’t set the scheduler to start Task B “after Task A” but have to give it a specific time).

Let’s say that after monitoring the times taken for Task A to run for a month, you find that it takes 30 minutes at most, so you schedule Task B to start 30 minutes after Task A. One day, due to an unscheduled system update halfway through Task A, Task A takes a minute longer than usual to run (i.e. 30 + 2 minutes), causing Task B to fail since Task A wasn’t ready on time. The system thus breaks because of this quest for “efficiency”. If a little redundancy had been included for “freak run times” at the expense of pure efficiency (or a less fragile process been put in place, e.g. figuring out how to start Task B after Task A, no matter how long Task A took), it’d have been OK.

An unwarranted faith and an air of infallibility of automated systems. When I program something to automate a process, I test it thoroughly before deployment, making sure that it works as advertised. After that, I wash my hands off the program. The reason why I automated it in the first place was because I didn’t want the process to be as involved as it was. If I was going to have to be looking over the program’s shoulder (i.e. if it had one) it’d defeat the purpose of automation, wouldn’t it? So I trust the system to work, and expect it to work until I know of changes that might break it.

I’m not sure about other developers but I find that I tend to place a lot of trust on automated systems because  of the “no manual intervention” aspect of automated systems. I find that most data discrepencies occur because of data-entry or procedural errors (i.e. missing out an action or doing things in the wrong order). As mentioned previously, when I automate something I do a lot of checks to make sure everything’s working correctly as at the time of deployment. After that though, anything goes. And even though I do carry out the occasional random check, they are random and may not occur for a long time after things have gone awry.

Forgetting that things can and do go wrong is one mistake we may make in designing systems. Expecting things to eventually go wrong and what to do about it when it does is one way we can protect ourselves against such occurences, helping us to mitigate the risk of large negative consequences when things do go wrong. Take for instance the example mentioned before of the additional sales region in the reporting system. One way we could protect ourselves against the using of the wrong data while enjoying the benefits of an automated system would be to create a “checksum” of sorts (used to ensure the integrity of the data after transmission or storage), where we check the sum of the total sales number against the sum of the individual sales numbers. If there’s a discrepency, perhaps because of a missing or additional region, we flag it out and prevent its distribution.

Creating great automated systems doesn’t necessarily entail making things as efficient as possible, even though efficiency might be the main benefit you were seeking in the first place. A highly efficient system that’s prone to error (and worse, errors you’re not aware of – the unknown unknowns) is worse than having no automated system at all. With such a system, you’d just be doing the wrong thing faster (like great management without an accomanying talent for leadership/vision!)

Having a system that can withstand shocks (one that’s “robust”), and recover stronger than before (one that’s “antifragile”) after they happen is the best system you can hope to have. Adding (some) redundancy for contingency and keeping in mind that even the best systems can fail is the key to protecting ourselves against the errors that happen. And they will happen. So, how antifragile’s your system?

On antifragility and new stuff

Taleb once again scores with me with his book on “antifragility”. Like his book on randomness and black swans, this book has opened my mind to a concept that I’ve intuitively felt but never been able to put down in words.

I wrote once about “destroying things” to love them more – making new things old because of the transcience of “newness” but the lasting hold of “oldness”. I never realised that it could have been antifragility at work.

An object, when new, when perfect, is at its most fragile. At any moment a small bit of entropy – a scratch, a bump, or even the controlled might of time — might cause it to be no longer new; no longer perfect. The harder you try to keep it in pristine condition, the worse off you’ll feel when it’s finally imperfect. It’s value drops precipitously, and all the effort maintaining it goes to waste.

But the moment an object is old, the focus is no longer on its newness. It becomes more robust — a small scratch on an already scratched object brings no harm. And it might even be brought into the realm of the antifragile — where a scratch could bring along positive associations like memories or good feelings, making it better than it was before.

Analytics Adoption: Evolutionary vs. Revolutionary Technology

In this post about analytics adoption, I’d like to start with a short story.

The wife and I got ourselves each a Samsung Galaxy S4 over the weekend. Though it’s a great phone, we couldn’t help but feel that there was a distinct lack of a “wow” factor.

We both moved to the S4 from the S2. Back in its day (about two years ago) it was the latest and the greatest, and though technology has come some way since then it’s still a very capable phone. I remember when we got the S2… boy, did we feel like country bumpkins moving into the city. Everything was wow, wow, wow.

Even I, a self-prosessed can’t-go-a-day-without-the-computer-nut, could go a day (a day! can you imagine??) without touching my computer because everything I needed to do on it I could do on the phone. It was amazing to be able to send free text messages, and access e-mail and Facebook on the go. I didn’t know what I was missing until I tasted the data-plan-backed mobile life.

But having had such a capable phone already, the S4 comes to us an evolutionary and not revolutionary move. Sure, things are snappier, bright, faster, and larger. But that’s about all they are. It hasn’t been the same habit-changing killer app.

Evolutionary vs. Revolutionary Technology

Now, the S4 may be a evolutionary technology step for me, but for plenty of people who haven’t yet made the move to a relatively capable handset like the S2, it could well be a revolutionary move. The thing is that where a user is in the lifecycle of technology adoption makes a big difference to how that technology is perceived.

I would imagine that the biggest winners in analytics ROI (i.e. dollar returned per dollar invested) would well be those who have been avoiding it thus far. Because there’s such a huge gap to be bridged between no analytics and some analytics, even the smallest investments in analytics could give huge returns. (Whether it scales or not is another story for another day.)

And with the recent improvements in analytical processes/methodology, software, and thinking (a very important point here), things are far cheaper — and not just in terms of money, but of time, executive buy-in, and ease-of-adoption.

User requests are like hunger pangs

“So, when is the [request] going to be ready?” he asks me, the fourth person to ask in a one-week period.

This, I think to myself, is probably real hunger.

“I’m working on it,” I reply, which means I’m waiting it out to determine how important the request really is. The moment I can confidently say it’s a valid “need”, a real hunger, I move it into my high-priority queue and start work on it.

It’s not that I don’t wish to help, but system/application/report requests have a tedency to come in hugely inflated, seemingly much more important than they really are. More a reaction to an itch than a true life-saving need it’s thought to be.

I like to think of requests that come into my queue as a type of hunger. There is real hunger: the haven’t-eaten-for-days-and-starving hunger; and then there’s perceived hunger: the after-dinner craving for Pringles hunger.

When a request is of the “real” hunger variety, no matter how long you try to wait it out it’ll always be there (and the people who are requesting it won’t let you forget it’s there!)

“Perceived” hunger requests, on the other hand, tend to go away like after-dinner cravings when you give it a little time.

One problem with giving in to these “perceived” hunger requests is that, like the afore-mentioned Pringles, once you “pop you can’t stop” – these sorts of requests tend to come one after another. And it’s difficult to know when to say no because each request isn’t really that different from those that came before it.

A precedent, once set, can bind you to a cycle of petty requests (“why did her request go through and not mine?”) for the life of the project.

So my advice is: wait and see. If it’s really important you’ll be sure to know.

Which reminds me, it’s time for supper.