Yep, another Covid-19 post here. It’s a situation that keeps on giving new examples that can be applied to how we do things once we start to return to some sort of normal.
Countries and companies are now beginning to look at returning some businesses to work. They are using the data to guide their decisions. Interestingly, the quoted numbers in the press, in government documents and in other published pieces is the raw data that is published by government organizations daily.
I have a problem with the raw data, particularly in the Americas and Europe, it’s not right. I don’t mean to say that these regions are publishing misleading information, it’s just that their processes lead to bad data. For our purposes, let’s just focus on new daily cases. The process is straightforward, people receive a test, it’s submitted for testing, a result is generated, positive result counts are included in published daily numbers. Pretty simple.
Except it is not as simple as that. Numbers do not generate themselves, they require people to run the tests and submit the results. That requires paperwork and working hours.
So what is happening? Friday, Saturday, and Sunday published numbers are naturally depressed because some of the paperwork people are taking much needed time away. Sundays are the lowest. Mondays see the data return to normal and Tuesdays see a catch-up from the weekend. This means that peak data is occurring more often on Tuesdays through Thursdays which can tend to skew some analysis.
France is a great example of this. During their peak, their daily numbers of new reported cases were all over the place. days of 2,000 then 7,000 then 5,000 with one day of 23,000 thrown in catching everything up and correcting prior days. At no point was there a day 12,000 new cases higher than any other day around it, it was just a quirk in the reporting. But this number then gets used as if it is real. If you were to smooth things out and apply a 7 day rolling average, the peak becomes ~8,400 daily new cases.
There are many instances where it makes sense to use raw data. Anytime a process generates consistent and predictable data, the raw data is going to be extremely useful. Anytime that data is sensitive to small units of time (an hour or less typically), you are almost forced to use raw data. Financial analysis will always start from the raw data so that all the pennies can be appropriately added up. Using raw data also gives you cover where no one can question the numbers you are using.
But then there are times when raw data is not useful. This is especially the case when the raw data is actually the aggregate of an aggregate already that are each put together with processes of varying control and rules. The data is already skewed in actuality, it’s just that it’s not clear to a casual observer.
Everyone knows the rule of “garbage in, garbage out.” What they struggle to understand is the definition of garbage, which is not a yes/no situation. Many people treat data that comes from the source as somehow pure, because it started in the right place. But there is no such thing as perfect data. All data has, at a minimum, momentary blips, quirks of the system, human behavior built-in, seasonality, or some other facet that gives it a unique personality. Controlling for these personality traits in data makes an analysis stronger, not weaker.