I just completed the testing of a new program I wrote. 500 lines of well-commented code, making debugging easy if necessary. With this program, the reports we run daily in the morning would take 10 minutes instead of the usual hour, and fully-automated too. Without any manual inputs, the potential failure points owing to manual input errors (so common when any sort of manual intervention is required) are no longer a threat, a large boon to data integrity.
But yet, I couldn’t quite feel at ease. Something was bothering me – I just couldn’t help thinking that it was all too automated. Too efficient. Too fragile.
Was I missing something? I thought about it a while, then realised that I’d forgot about insuring myself against possible data problems. Would I know if the data wasn’t correct? What alternatives did I have if the program failed? Key business decisions may be made based on these reports, after all, so it wasn’t a good idea to take any chances here even if at a slight hit to efficiency.
I’m going to allude to Taleb’s wonderful book Antrifragile again (as I did in my previous post): the problem with lean and highly efficient systems is that they tend to be fragile, breaking easily under times of volatitlity (even the volatility of time itself) while simultaneously portraying an air of infallibility (it’s so efficient and wonderful that all the potential problems are ignored).
Fat-free, highly efficient systems break easily under times of volatility. Take for example a reporting system that extracts a specific piece of data within a spreadsheet on a daily basis. At its most efficient, you just take the data from a specific cell, without doing any checks or introducing any sort of flexibility. So perhaps you program the system to look for the cell that’s directly below the “Sales” header (say, for “Region X”), which is always the second from the left-most column. One day, because there’s be the addition of a new sales region (“Region Y”), the cell that you’re looking for pushed to the cell that’s two rows below the “Sales” header. And since everything’s automated, you don’t realise that the number that you’re now getting is for “Region Y”, not “Region X”, and you’ll continue using the numbers as if everything was fine. If you were manually running the report, however, you’d have spotted the difference right away.
Or take task scheduling for example. Let’s say you have two tasks you want your computer to carry out: Task A and Task B. Task B depends on the result of Task A, so Task A would have to be run first. Because you want to be as efficient as possible (i.e. completing the most number of tasks in the smallest amount of time), you set the starting time of Task B as close to Task A as you can (for the sake of this example, let’s assume you can’t set the scheduler to start Task B “after Task A” but have to give it a specific time).
Let’s say that after monitoring the times taken for Task A to run for a month, you find that it takes 30 minutes at most, so you schedule Task B to start 30 minutes after Task A. One day, due to an unscheduled system update halfway through Task A, Task A takes a minute longer than usual to run (i.e. 30 + 2 minutes), causing Task B to fail since Task A wasn’t ready on time. The system thus breaks because of this quest for “efficiency”. If a little redundancy had been included for “freak run times” at the expense of pure efficiency (or a less fragile process been put in place, e.g. figuring out how to start Task B after Task A, no matter how long Task A took), it’d have been OK.
An unwarranted faith and an air of infallibility of automated systems. When I program something to automate a process, I test it thoroughly before deployment, making sure that it works as advertised. After that, I wash my hands off the program. The reason why I automated it in the first place was because I didn’t want the process to be as involved as it was. If I was going to have to be looking over the program’s shoulder (i.e. if it had one) it’d defeat the purpose of automation, wouldn’t it? So I trust the system to work, and expect it to work until I know of changes that might break it.
I’m not sure about other developers but I find that I tend to place a lot of trust on automated systems because of the “no manual intervention” aspect of automated systems. I find that most data discrepencies occur because of data-entry or procedural errors (i.e. missing out an action or doing things in the wrong order). As mentioned previously, when I automate something I do a lot of checks to make sure everything’s working correctly as at the time of deployment. After that though, anything goes. And even though I do carry out the occasional random check, they are random and may not occur for a long time after things have gone awry.
Forgetting that things can and do go wrong is one mistake we may make in designing systems. Expecting things to eventually go wrong and what to do about it when it does is one way we can protect ourselves against such occurences, helping us to mitigate the risk of large negative consequences when things do go wrong. Take for instance the example mentioned before of the additional sales region in the reporting system. One way we could protect ourselves against the using of the wrong data while enjoying the benefits of an automated system would be to create a “checksum” of sorts (used to ensure the integrity of the data after transmission or storage), where we check the sum of the total sales number against the sum of the individual sales numbers. If there’s a discrepency, perhaps because of a missing or additional region, we flag it out and prevent its distribution.
Creating great automated systems doesn’t necessarily entail making things as efficient as possible, even though efficiency might be the main benefit you were seeking in the first place. A highly efficient system that’s prone to error (and worse, errors you’re not aware of – the unknown unknowns) is worse than having no automated system at all. With such a system, you’d just be doing the wrong thing faster (like great management without an accomanying talent for leadership/vision!)
Having a system that can withstand shocks (one that’s “robust”), and recover stronger than before (one that’s “antifragile”) after they happen is the best system you can hope to have. Adding (some) redundancy for contingency and keeping in mind that even the best systems can fail is the key to protecting ourselves against the errors that happen. And they will happen. So, how antifragile’s your system?