We Are All Data Scientists
How the coronavirus outbreak has taught us the importance of data
The COVID-19 pandemic is unlike anything we have experienced in our lifetimes. The invisible global threat has drastically changed our everyday lives, damaged our economy and forced us to change the way we live in a matter of weeks.
As we were plunged into uncertainty, different media have been sources of information that the general public grappled with to better their understanding. Figuring out what the reliable facts on the ground are, the right questions to ask about those facts, how to keep essential businesses open, when to reopen others, are just some of the questions we have quickly had to find answers to.
As we are bombarded by the endless newscycle to stay on top of this awful pandemic, a lot of concepts about the scientific process and the methods behind it have come to the forefront of our minds. From bare basics to more difficult concepts, there are many different ways the general public has had to learn about data science, or how to be a citizen data scientist. A typical data scientist’s workflow involves collecting data, analysing it in context, using it for modelling purposes and assessing how well the modelling works. With different aspects of the coronavirus explained in mass media, social media and online publications, everyone has had to become familiar with each of these steps. I will explore each in turn.
Starting from the bottom and we're here
Data collection & wrangling
Aka garbage in garbage out
Data collection is step number 1 when it comes to data science. A lucky data scientist designs the data collection process to get “perfect” data. Most of the time, it is an imperfect process, with big gaps in the data, poor data quality, or worse a continuously changing process with the good intention of improving the dataset. At the best of times, establishing a sound collection and wrangling process is time consuming. In the context of the pandemic, some kind of numbers were needed to begin tracking, but at that early stage those numbers can be difficult to find and are subject to error. This is best seen when it comes to counting cases early on.
Questions around the reporting of daily cases and deaths were important to understanding the spread of coronavirus. Different countries have been reporting cases differently, making country to country comparisons difficult. Even within countries keeping track of what the numbers actually represent is tricky and sensitive. Here in the UK, Public Health England appeared to report a surge in the total death count when they decided to start including deaths from care homes, with accusations of both under and over reporting.
This is if we are looking at officially reported cases. The actual number of people who have had coronavirus is likely much higher, but because of limited testing capacity a high proportion of the infected cases could not be confirmed at the time. South Korea had more testing capacity than Japan and so initially appeared to have more cases despite their population size, nicely visualised in this video from Vox (0:35) (Side note: This video covers a number of excellent visual analytics display issues. However, we don't agree with the third issue regarding normalizing by population size. If one were to do that, we recommend using a more appropriate x-axis scale eg %of population infected, explained in the next section). The size of the data collection can also be incredibly important. Joinzoe.com has over 4 million users (and growing) who are reporting their daily symptoms via a mobile app. Because of this effort, there have been a few scientific papers such as this one in Nature that argue that anosmia, a loss of smell and taste, might be the most indicative symptom of COVID-19, that has now been included in the NHS list of symptoms.
Examining what decisions have been made when collecting data, in this case the case count and death rates, is important to understanding context. The scientific process is methodical and verifiable but that does not make it unbiased.
A picture is worth a thousand words
Visualising data to better understand it
In a matter of weeks, our lifestyle was overhauled. This was a necessary and monumental effort that would not have been possible without understanding what the consequences of carrying on as before would have been. Visualisations are a great way to instantly assimilate data without the need for a verbose explanation. Like in the meme at the top of this page, exponential growth is instantly understood, along with how quickly the coronavirus could spread without measures in place. Putting raw case projections in a linear scale as above justifies the measures the government put into place.
Animated visualisations can also be useful, like this one from the Guardian. With social distancing still in place, it is quickly understood that public transport for commuters returning to work would be reduced to only ~10% capacity, with queues to board a train extending to a couple of stations back. It’s obvious that working from home will continue to be necessary while social distancing is still in place.
We have also been caught out by comparing similar looking charts. Just as explained in the section before, what data is used and why can affect the visualisation. Eric Topol picked up on this in his tweet, where you can see the trends of each of the graphs to be a bit different, there is no consensus.
This visualization of the cumulative case trajectory was made popular by John Burn-Murdoch from the Financial Times, and because it is so ubiquitous it warrants an in-depth look. There are several key components why it has been used so much:
1: The y-axis has a logarithmic scale.
The slopes of individual country / state curves represent their local epidemic growth rates. This enables us to compare countries/states that are at different stages in their outbreaks and get a view into the potential future trend of a country whose outbreak is at an earlier stage than another. It also is much easier to visualise than on a true scale (see the meme earlier).
2: The x-axis is normalized to start at the same number of cases.
All the countries / states are started at the first 100 cases. This allows the countries / states to be compared in terms of their growth rates or doubling times. For example, in the graph below,
The US, UK and Brazil have similar trajectories for the first ~15 days (since 100 cases)
Brazil and the UK were at ~4 day doubling rate for these ~15 days
The US was at ~3 day doubling rate for some time beyond this - up to its first ~100K cases or ~20 days
Note that this comparison is invalid if one were to divide the cases by the population size. In this case the epidemics would be at different stages in their evolution and any comparison would not make sense. If one wanted to divide cases by population size, the x-axis would need to be <% of population infected> rather than <cases since 100>. Further, the x-axis would need to start at some % of population infected eg 0.1% or a number that would allow all countries to be scaled on the same graph.
3: The dotted angular lines show the reference doubling rates.
In the graph below, the steepest slope dotted line represents 1-day doubling and the shallowest slope represents 7-day doubling, making it easy to infer the doubling rates.
However, as the curves start to flatten, this can’t be inferred easily from the dotted reference lines, as the doubling rate is approx the slope of the tangent to the curve that is flattening. For this reason, TIBCO’s COVID-19 app includes a bar chart of doubling times (see below). For example:
Sweden and India are similar for the first ~5 days, then they flatten at different rates
As of July 12, India is doubling every 24 days and Sweden every 71 days
One important caveat to the interpretation of the cumulative case trajectory graph is that it assumes testing is being carried out in similar proportion in the countries/states on the graph.This is one reason why the graphs might vary slightly from one media outlet to another.
TIBCO’s COVID-19 app showing cumulative case trajectory by country
Days until cases double for some countries from TIBCO’s COVID-19 app
“All are wrong but some are useful”
Modelling and its challenges
The above aphorism is attributed to the statistician George Box. Well-known among data scientists, it is a useful concept to keep in mind when looking at predictions and projections. Models are never 100% correct- it is inevitable that something hasn’t been accounted for in whatever model we use- but that doesn’t mean the results should be discarded.
FiveThirtyEight combined a view of nine different institutions’ projections of total deaths in the US. Different modelling techniques from different institutions are giving a wide range of results, starting at 111,000 up to 163,000, a difference of about ~50,000 in their projections. This is partly because of the different data sources, partly because of the different assumptions they are making, the different techniques each is using. All of the models from the different institutions are averaging around the ~120,000 mark, an awfully high number but one that has some consensus despite the different techniques.
The Re number, or the effective reproduction number, is now being closely monitored worldwide. The public are now expected to understand and consume Re in everyday news and discussion. Here in the UK, Prime Minister Boris Johnson gave a speech on how the Re number will determine how strict the lockdown measures will be at any time. The higher the Re number, the higher the alert level, as the virus is deemed to be circulating more amongst the population.
TIBCO’s coronavirus visual analysis hub has been calculating the Re number daily since the outbreak, using Spotfire with scripts running under the hood. For the US and UK this calculation is down to the county level. You can see how Re has been increasing now with the second wave of the epidemic. For a detailed explanation on the methodology, read this blog by TIBCO’s CAO Michael O’Connell.
Effective R number on September 28 2020 for Islington, a borough in London
Re number across time for different counties in the UK. The green colour is where R < 1
Can we rely on the lab results
aka how good are your models really
Most people expect that when they go for some kind of health or medical checkup or test that the result is definitive. This isn’t exactly true. Usually the tests give you a very good idea, but there are occasions where someone may test positive but actually be negative, or may test negative but actually be positive. The statistics for test accuracy are summarised in a “confusion matrix”. I like to think it is named as such because it can get confusing very quickly.
Confusion matrix might be less confusing with this pregnancy example
The scenario where someone tests negative but actually is positive is the feared one with respect to COVID-19. When it comes to the pandemic, we are most concerned with someone being told that they have a negative result and thinking they are free to interact with others, all the while spreading the virus and potentially causing a flare-up and / or even another wave. When it comes to how good the tests work, there are two important concepts, clearly defined on wikipedia:
Sensitivity (also called the true positive rate) measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).
Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).
Mona Chalabi, data editor at Guardian US known for making infographics with an artistic flair, put together the best explanation I have seen on this. It also includes a comparison of the different tests’ specificity.
Get it from the ground
Context & domain knowledge
aka the key to understanding the reality behind the numbers
Though I may be a data scientist myself, and comfortable with different mathematical modelling techniques, there are a lot of nuances that are best left to the experts. I have not spent my career focused on the study of microorganisms, epidemiology, or in any public health related sphere.
When we get confused, it is these experts that we should turn to. Epidemiologists especially must be our first point of call when it comes to understanding what the data is telling us. They have the experience to tune those modelling parameters best, to understand biological mechanisms behind the virus, to inform government policy.
Sotiris Tsiodras, a professor, physician and specialist in infectology, led the Greek government’s response. Greece went into strict lockdown early on and kept the number of deaths down to less than 400 to date. His calm, almost dry delivery, was also credited with preventing mass hysteria, making Greece exemplary in its handling of the coronavirus.
Harvard Business Review summarised it succinctly:
This pandemic has been studied more intensely in a shorter amount of time than any other human event. Our globalized world has rapidly generated and shared a vast amount of information about it. It is inevitable that there will be bad as well as good data in that mix. These massive, decentralized, and crowd-sourced data can reliably be converted to life-saving knowledge if tempered by expertise, transparency, rigor, and collaboration. When making your own decisions, read closely, trust carefully, and when in doubt, look to the experts. (emphasis is my own)
Having to learn how to be a citizen data scientist in a matter of weeks is no easy feat. With this massive global effort I believe that everyone will understand the importance of data: collecting it, visualising it and using it for predictive purposes. As awful as this pandemic has been, with a very high death toll, it has taught us a lot in a short amount of time. It has shown us that working together, even if ironically we are working far apart, will be even more important in the future.
With special thanks to Michael O'Connell, Steven Hillion and Colin Gray for their editorial skills
Noora Husseini (she / her) is a data scientist in the TIBCO Data Science team, based in London. Her interest in data science and artificial intelligence runs the gamut, from the ethical implications of how we use data to natural language processing to experimenting with the latest open source libraries. She likes wearing and designing obscure fashion labels, making friends with animals (especially cats) and creating the best playlists to dance to.