On the perils of crying Wolf!

I sporadically hear those who create software expressing exasperation at the stupidity of users – and am generally inclined to stand up for the user. The root cause of this is that (having worked in the software industry for a quarter century) I'm acutely aware that software (aside from, generally, being much harder to use than its authors realize) routinely expects users to think like programmers – whose mindset is, frankly, rare.

However, one particular kind of alleged stupidity brings forth another reason for defending the much-maligned user; the software professional wonders how anyone could be so stupid as to go ahead with some action despite having been warned about the possibility of the horrible things that then ensued. The cause of this is, usually, that software is like the little boy in the fable, who habitually cried Wolf! for a lark, and thus was ignored when he came to report an actual wolf – software all too often pesters the user with confirmation dialogs, which warn of dire consequences, which users get used to dismissing without thought.

Software professionals (whether programmers or support staff) marvel at the stupidity of dismissing dire warnings but fail to appreciate that nearly every time the user dismisses such a warning, no dire consequences follow. That is partly because modern purveyors of malware have learned to be less obtrusive than the attention-seeking script kiddies of the 90s, so that bad things happen without the user noticing (at least at the time); but it's also because programmers commonly make a crucial mistake in deciding when to hassle the user with a warning.

To take an example, many warnings relate to security issues; the innocent user browsing the internet can all too easily be tricked into doing unwise things, so the considerate programmer – on discovering a class of unwise thing the user might be tricked into doing – endeavours to devise some way for the program to test whether what the user is doing might fall into that class of folly. If the test usually spots the instances when the user is making a mistake and seldom hassles the user otherwise, the programmer uses the test to decide when to warn the user. This is indeed the right thing to do – provided that seldom is actually rare enough.

The problem is assessing how seldom is rare enough. If nine out of ten foolish acts trigger the warning and only one in ten non-foolish acts does so, all too often the programmer thinks that's rare enough – and this is how we have trained users to dismiss warnings without reading, let alone thinking about, them. It's not good enough to make warnings rarer, among times they're not needed, than they are among times they are needed; unwarranted warnings must be rare among warnings – and situations which don't warrant the warning are much more common than ones that do.

Let's go back to our security warning example. There's some class of context in which the test is applied; every time the program hits such a context, it uses the test to decide whether to warn the user. Now, suppose that one time in a thousand, that such a context arises, the potential threat is really there; let's be optimistic and suppose that, among times the threat is there, the test always causes a warning; but one time in twenty that there's no real problem the test also causes a warning. So let's consider a thousand instances of the context in which the program runs the test: one of those really did need a warning; but (roughly) fifty of the others get warnings too. Now look at this from the user's point of view: it's almost certain that the first few times the user meets this warning it's misguided, so the user learns to ignore it. One time in fifty, that's a mistake but 49 times out of 50 it was the right thing to do.

What this means is that, to assess whether the test is good enough to be used to trigger a warning, you need to assess what proportion, of times that the test is going to be invoked, there really is a problem. When that's a small proportion, the test's false warnings need to be rarer than its true warnings by a factor that's small even compared to that proportion. In the example above, if the truly scary situations are one in a thousand, and the test always detects them, then it can't afford to be wrong about the safe (enough) situations even as often as one time in a thousand – that's the point where, half the time the user sees the warning, it's irrelevant.

Written by Eddy.