If you toss a coin ten times what is the probability that you will get five heads and five tails? The question isn’t a general question about how many heads and tails you might expect; the question refers to that specific instance of the experiment, what is the probability you will receive five heads and five tails on that particular go? The answer isn’t a half, it is in fact less than half of a half, 0.246 to be precise. P(X = 5) = (10C5) * .5^5 * .5^5 = .246 for those who really wish to know.
The city of Boston has famously embraced big data as part of its ongoing program of regeneration and was highly regarded for its Street Bump initiative. This program involved smartphone users downloading an App which measured their car’s acceleration and deceleration in certain parts of the city, allowing it to predict where potholes were occurring and repair was required. As Boston residents drove around the city, their smartphones were collecting small data, which city authorities collated into big data to keep roads smoother and safer.
The city proudly proclaims that the “data provides the City with real–time information it uses to fix problems and plan long term investments”.
Whilst the initiative is laudable, the outcome, when examined, is entirely predictable based on statistical theory. Un–moderated, Street Bump strongly favours young affluent areas where a greater proportion of residents own smartphones. The key insight is that every pothole detected from Street–Bump–enabled smartphones is not every pothole in the city.
This represents a key statistical challenge, avoiding sample bias. The other challenge is to ensure that the data set used is large enough to provide the experiment with enough statistical power.
Statistical power is the probability that a statistical test will detect a difference between two values when the underlying difference is real. Going back to our coin test, if we tossed it ten times it’s not inconceivable that we would get three heads and seven tails. Without consideration for size of data set and power of the statistics, we might incorrectly conclude that tails is the dominant result for the coin. In the context of A/B or multivariate tests upon which many website improvement programs are based, this might lead us to recommend the “tails” option, which as we know would be incorrect.
The reason that statistical power is important is that because we live in a random universe (we may not but that is a discussion for another day when we have lots more time) tests will sometimes create false positives such as the one above. The probability of this happening is fixed by common convention at 5% and so the closer a test power is to 80% or 90% then the more confidence we can have that on balance false positives have been dealt with.
Web mega–brands such as eBay, Amazon and Google have built nearly their entire user–experiences using A/B and multivariate tests and we are right to replicate their approach to success and product design improvements. It is said that we may never know what the true Google is because at any one time it is running up to 7,000 split tests in a bid to constantly improve and enhance life for the user. However the great luxury which these online behemoths enjoy is volume and that enables them to glean statistically meaningful insights very quickly and very regularly.
Let’s copy their focus on user behaviour and learn from their pioneering processes but let’s remember that until we reach their scale, we are going to have to be much more focused on avoiding statistical bias, and getting our hands on sample sizes of adequate scale to make robust recommendations.
Veteran American sports broadcaster Vin Scully claims that “Statistics are used much like a drunk uses a lamppost: for support, not illumination”. It’s time for the UX industry to sober up and tackle lazy statistics.