Last week, I wrote an entire post on how Covid-19 was a great case study in the risks of bad data. It never really sat great with me after I had it in the can. The message was correct but it wasn’t right. The gist of it was that the data that is being thrown around in the media and social media circles is known to be bad but it’s the only data available so people are stretching it as far as they can.
This is happening without doubt. But even still, the message doesn’t feel complete because there are some very thoughtful people talking about the data issues daily. And realistically, bad data is the norm in most businesses, not the exception. Why call out bad data just because?
The reality is that bad data can still have good qualities because at least it is data. Let’s take the testing numbers in the US for Covid-19 as a great example of this. It’s a fairly established fact that testing starting after the virus began spreading through communities and then testing volumes were not able to catch up to determine the real scope and scale of infection. Many have taken this to mean the data is crap and should be ignored.
Similarly, global death rates from the virus are notoriously bad. They could be anywhere from 0.25% to 4% if you believe the outliers for the purposes of getting a range. Because we can’t nail it down, it’s simply bad data.
However, the data is actually still really useful as a view into the future of where deaths will occur. The one thing the two above pieces of data tell us from EVERY country that has had infections break out: ~4% of confirmed positive tests will become fatalities with fatalities trailing positive tests by about a week. Where there are positive tests, there are fatalities with fairly consistent numbers.
Two pieces of generally bad data that come together to be highly predictive but only if applied narrowly and correctly. Applied poorly and you can get led astray quickly.
Data is data, it’s just a matter of how you choose to use it.