COVID-19 : Visual Data Science Part 2 - Update & Methodologies
Follow along with this blog in Spotfire, and for live updates
This blog and Spotfire application are authored by the TIBCO Data Science team
Contact: Michael O’Connell, @MichOConnell
We are now in the midst of many COVID-19 regional outbreaks. WW confirmed cases have topped half a million and growing rapidly. Italy now has more confirmed cases than China; and Italy, Spain, Germany, UK and US are on an approximate 3-5 day doubling rate of cases. There have been more than 20 thousand deaths. Italy and Spain now have more deaths than China; and Italy, Spain, UK and US are on an approximate 3-5 day doubling rate for deaths.
Note that errors around any predictions of future cases are substantial - with exponential parameters comes exponential prediction errors! It is only by modeling, visualizing and predicting emerging infections, that everyone can understand the pandemic in their own region, assess the effects of preventive measures, and apply best protective practices in their local communities. And to understand our personal risk!
This paper provides an update on our analyses and some details on our modeling, simulation and analytics methodologies. This includes :-
COVID-19 Trajectories : interpretation and normalization; including auto-cluster of trajectories
Data Science Modeling : Rt progression
Compartment Modeling: epidemiology and statistical parameters
Healthcare resource requirements modeling
GeoSpatial Analysis: map layers, cartograms, chloropleths
The analyses are presented using Spotfire visual analytics in a hosted environment. Figure 1 shows the Spotfire application Global Overview.
Figure 1. Spotfire application Global Overview. Shows worldwide cases, fatalities, recoveries and country-level stats. Includes slider for stepping through time by date.
The analyses refresh hourly, depending on availability of data sources. Spotfire apps and code will be made available for download. Links to various trusted data sources are provided. Collaboration is encouraged and Spotfire will be available for use by those who don’t have it. TIBCO customers who are struggling with data and analytics issues around COVID-19 effects, can contact the authors for more information and assistance.
Figures 2 and 3 show COVID-19 case trajectories by country and US states. Figure 4 shows a cluster analysis of case trajectories by country. Figures 5 and 6 shows COVID-19 deaths by country and US States. All of these analyses update hourly as data permit, or by a refresh button click in the Spotfire apps.
For case trajectories, the y-axis is the cumulative number of confirmed cases, on the log scale and the x-axis is the time in days after the first <100> confirmed cases. The dashed lines are at slopes representing 1-day, 2-day, 3-day 5-day and 7-day doubling.
Note that we use raw and cumulative cases rather than normalizing by total population. Normalized numbers are good at showing *relatively* how much strain a country is under, but they’re not suited to tracking the extent/state of a country’s outbreak, which spreads at approximately the same pace regardless of country size. Also note that cases are a function of the number of tests performed; this varies considerably by country. As such, the number of confirmed cases should not be interpreted as reflective of actual infections.
Figure 2. COVID-19 case trajectories by country. The y-axis is the number of confirmed cases (log scale), and the x-axis is the number of days after the first <100> confirmed cases. The <100> days aligns the curves to a common starting point in the epidemic outbreaks, and is configurable in the Spotfire application. The dashed lines indicate various doubling rates in days.
Figure 3. COVID-19 case trajectories by US state. The y-axis is the number of confirmed cases (log scale), and the x-axis is the number of days after the first <100> confirmed cases. The <100> days aligns the curves to a common starting point in the epidemic outbreaks, and is configurable in the Spotfire application. The dashed lines indicate various doubling rates in days.
Figure 4. COVID-19 case trajectories clustered by country. The y-axis is the number of confirmed cases (log scale), and the x-axis is the number of days after the first <100> confirmed cases. The <100> days aligns the curves to a common starting point in the epidemic outbreaks, and is configurable in the Spotfire application. The dashed lines indicate various doubling rates in days. Countries are clustered using the Hartigan–Wong algorithm in Spotfire, using the silhouette value for auto-selection of the number of clusters. Sequences longer than those available in the US are truncated.
Figure 5. COVID-19 fatality trajectories by country. The y-axis is the number of fatalities (log scale), and the x-axis is the number of days after the first <10> deaths. The <10> days aligns the curves to a common starting point in the epidemic outbreaks, and is configurable in the Spotfire application. The dashed lines indicate various doubling rates in days.
Figure 6. COVID-19 fatality trajectories by US state. The y-axis is the number of fatalities (log scale), and the x-axis is the number of days after the first <10> deaths. The <10> days aligns the curves to a common starting point in the epidemic outbreaks, and is configurable in the Spotfire application. The dashed lines indicate various doubling rates in days.
Modeling the Outbreaks - Visual Data Science
See O’Connell (18 March), for an outline of our analysis to date, and epidemiology modeling basics. In summary :-
The reproduction number R0 (pronounced R-nought) is the average number of people infected from a person with an infection, without any interventions in place. This is a crucial parameter in describing an epidemic. The effective reproduction number Re includes intervention effects. If Re is bigger than 1, the disease spreads. Conversely if Re, or the time-varying reproduction number Rt can be reduced over time, the disease can be contained.
The reproduction number R0 as the product of D*O*T*S (Kucharsky), where :-
D = duration (number of days someone is infectious)
O = opportunities for transmission (number of person-person greetings / day)
T = probability of transmission
S = susceptibility (proportion of population susceptible)
Delameter et al. describe R0, its use and misuse.
For COVID-19, without intervention (per Kucharski, TED Interview) :-
D (number of days someone is infectious) is approx. 1-2 weeks, before isolation. This includes ~5-6 days incubation until symptoms, and often an additional ~2-5 days before isolation. Flu is slightly shorter e.g. ~3 days. STDs can be several months.
O (number of person-person greetings / day) is modeled as ~5-10 people/day (person-person greetings) under usual behavior
T (probability of the virus being transmitted in an interaction) is approx. 1/3. This is high compared to Flu and SARS.
S (proportion of population susceptible) is high i.e. 95-100%. Per Kucharski (TED Interview), based on early Wuhan data, ~95% of the initial population were still susceptible up to the end of January.
Kucharski describes R0 = 2 to 3 in uncontrolled outbreaks for COVID-19, compared with Flu where R0 = ~1.2.
The other key parameter is the Case fatality rate (CFR) - this measures the risk that someone who develops symptoms will eventually die from the infection.
For COVID-19, Kucharski (TED Interview) says this about the CFR: “I’d say on best available data, when we adjust for unreported cases and the various delays involved, we’re probably looking at a fatality risk of probably between maybe 0.5 and 2 percent for people with symptoms.” By comparison, the CFR for Flu is ~0.1%. Kucharski summarizes by stating that COVID-19 is ~10X+ more deadly than Flu. This is inline with other experts and studies e.g. Pail Atwater (Johns Hopkins) stated that "CFR is clearly going to be less than 2%, but at the moment we just don’t know what that number is".
Early estimates of CFR in epidemics is typically high as focus is on the sickest of the sick. The early CDC estimates were 3.5% in China; and across 82 countries 4.2% and a cruise chip 0.6%. They suggested a wide range of 0.25%-3.0%.
Wu et al. estimate the CFR of COVID-19 in Wuhan at 1.4% (0.9–2.1%). This is a big dataset as Wuhan was the epicenter for the initial outbreak. They note that this is substantially lower than the corresponding naïve confirmed case fatality risk of 2,169/48,557 = 4.5%; and the approximator of deaths/(deaths + recoveries): 2,169/(2,169 + 17,572) = 11%, as of 29 February 2020. The risk of symptomatic infection increased with age, with those above 59 years were 5.1 (4.2–6.1) times more likely to die after developing symptoms, compared to those aged 30–59.
Ruan summarizes a number of studies and shows wide variability in CFR by region (2·9% in Hubei vs 0·4% in other areas of China), in different phases of the outbreak (eg, 14·4% before Dec 31, 15·6% for Jan 1–10, 5·7% for Jan 11–20, 1·9% Jan for 21–31, and 0·8% after Feb 1), and by sex (2·8% for males vs 1·7% for females). They also quote the Chinese CDC reports that the case fatality ratio increases with age (from 0·2% for people aged 11–19 years, to 14·8% for people aged ≥80 years), and with the presence of comorbid conditions (10·5% for cardiovascular disease, 7·3% for diabetes, 6·0% for hypertension, 6·3% for chronic respiratory disease, and 5·6% for cancer).
Verity et al analyze deaths in mainland China and recoveries outside of China, estimating the mean duration from onset of symptoms to death to be 17·8 days (95% credible interval 16·9–19·2) and to hospital discharge to be 24·7 days (22·9–28·1). With adjustment for demography and under-reporting, they estimate case fatality rate in China of 1·38% (1·23–1·53), with substantially higher ratios in older age groups (0·32% [0·27–0·38] in those aged <60 years vs 6·4% [5·7–7·2] in those aged ≥60 years), up to 13·4% (11·2–15·9) in those aged 80 years or older. Estimates of case fatality rate from international cases stratified by age were consistent with those from China (parametric estimate 1·4% [0·4–3·5] in those aged <60 years [n=360] and 4·5% [1·8–11·1] in those aged ≥60 years [n=151]). These early estimates give an indication of the fatality ratio across the spectrum of COVID-19 disease and show a strong age gradient in risk of death.
It is tricky to calculate the CFR. The best way to calculate CFR would be to track a large group of people from the point when they develop symptoms until they later die or recover, and to then calculate the proportion of all these cases who had died. This is not possible in the real world. It is incorrect to just divide the total number of deaths by total number of cases as this does not account for unreported cases or the delay from illness to death.
It is widely recognized that there are many unreported cases eg due to unavailable test kits. In the US analysis below, Bedford estimates and approx 10X under-reporting of cases on March 13. Re. the time delay, consider 20 new people admitted to a hospital with confirmed COVID-19 infection on a given day -- that doesn’t mean the CFR is zero!. We need to wait to see what happens to them. Conversely any deaths that occur are people who showed symptoms some weeks before.
Fauci et al. state that "if one assumes that the number of asymptomatic or minimally symptomatic cases is several times as high as the number of reported cases, the case fatality rate may be considerably less than 1%. This suggests that the overall clinical consequences of Covid-19 may ultimately be more akin to those of a severe seasonal influenza (which has a case fatality rate of approximately 0.1%) or a pandemic influenza (similar to those in 1957 and 1968) rather than a disease similar to SARS or MERS, which have had case fatality rates of 9 to 10% and 36%, respectively."
Bendavid and Bhattacharya also write about the under-reporting and effects of limited testing, and suggest CFR could be more like 0.01 to 0.1% ie more in line with seasonal Flu or perhaps less deadly.
For all the reasons above, there is both a wide range of estimates and opinions on the CFR. It seems clear that CFR is higher on people older than 60 and with comorbind conditions. The CFR will become clearer as more people are tested and more people are followed from infection through recovery or death. It is important to get an accurate estimate of CFR soon, so as to best focus interventions at appropriate levels in regional and global communities.
Compartment models are a technique used to simplify the mathematical modelling of infectious disease. The population is divided into compartments, with the assumption that individuals in the same compartment have the same characteristics. The models are defined with ordinary differential equations (ODEs, deterministic), and can also be viewed in a stochastic framework, which is more realistic but more complex to analyze (Wikipedia: Compartmental Models).
Compartment models may be used to predict properties of how a disease spreads, for example the prevalence (total number of infected), reproduction number (average number of people infected from a person with an infection) and the duration of an epidemic. Also, the models enable understanding how different interventions may affect the outcome of the epidemic, and can be used to simulate various scenarios.
The SIR model is one of the simplest compartmental models, and many models are derivations of this basic form. With the SIR model, people transition from susceptible (S) to infected (I) to removed (R), with S+I+R = N (the total population size); where R can be recovered or death. The number of susceptible, infected and removed individuals vary over time (even if the total population size remains constant), we make the precise numbers a function of t (time): S(t), I(t) and R(t). This model is reasonably predictive for infectious diseases which are transmitted from human to human, and where recovery confers lasting resistance, such as measles, mumps and rubella.
COVID-19 has a significant incubation period, with estimated median of 5.1 days. This requires at least one additional compartment for modeling. The SEIR model where people transition from susceptible (S) to exposed (E) to infected (I) to removed (R), with S+E+I+R = N (the total population size); where R can be recovered or death.
We can fit SIR and SEIR models in R, with packages such as EpiModel. Tim Churches has provided an excellent blog on fitting compartment models and modeling the epidemic trajectory and the effective reproduction number over time. For the purpose of simulating and forecasting healthcare resource scenarios we use models with additional compartments. Figure 5 shows one such model. We are developing approaches for healthcare resource planning using the work of Althaus (25 March). He presents a method for modeling and projections of the COVID-19 epidemic in Switzerland, using an SEIR model and the daily number of reported deaths. Althaus includes additional compartments for hospitalization and critical care (ICU). He assumes constant uncontrolled transmission until the lockdown that was set in Switzerland on 17 Mar 2020; and then varies a parameter kappa = Re/R0 as a measure of the effectiveness of the subsequent interventions.
These two modeling scenarios are covered in sections below.
Figure 5. Compartment model for studying and simulating scenarios of COVID-19 outbreaks.
The time sequence of virus and human host states are outlined in Figure 6. This shows a number of epidemiology parameters :-
The Latent Period is the time between the occurrence of infection and the onset of infectiousness (when the infected individual becomes infectious).
The Serial Interval = the duration of time between the onset of symptoms in a primary case and the onset of symptoms in a secondary case infected by the primary case.
- The Incubation Period represents the time period between the occurrence of infection (or transmission) and the onset of disease symptoms
Figure 6. Infection and transmission timeline of COVID-19. Based on supplement to: Anderson et al.. Lancet 2020.
Modeling the Effective Reproduction Number over time - Rt
R0 is a base rate, with no interventions, and with the virus in an unmodified state of the population. For COVID-19, R0 has been widely reported to be in the range 2-3. The effective reproduction number Re includes intervention efforts (drugs, non-drugs). If the effective reproduction number Re >1, the disease spreads. Rt, the time-varying reproduction number, tracks Re over time. Our current non-pharmaceutical interventions (NPIs) are aimed at reducing Re. If the Re can be reduced below 1 with interventions, the virus stops spreading.
We have been estimating the time-varying reproduction number Rt, at a state and county level, across the US and worldwide. While some data are thin, early results are encouraging, showing a downward trend in Re over time in some countries and states.
Figure 7 shows some results of modeling Rt for different countries and Figure 8 some results of this Rt modeling of different US states. Models are fit using the package EpiEstim. This package can be added to Spotfire via the TERR Tools menu, and configured to run via a Spotfire data function. User-selected markings on maps and other visuals then invoke the Rt estimates to run interactively, in context of exploratory visual data analysis.
EpiEstim (Cori et al, 2019) analyzes time series incidence data to estimate time-varying reproduction numbers as outlined in Cori et al 2013. EpiEstim incorporates uncertainty in the distribution of the serial interval - the time between the onset of symptoms in a primary case and the onset of symptoms in secondary cases.
There are five estimation methods in EpiEstim; these vary in the way the serial interval distribution is specified. In the first two methods, a unique serial interval distribution is considered, whereas in the last three, a range of serial interval distributions are integrated over:-
- "parametric_si" the user specifies the mean and sd of the serial interval
- "uncertain_si" the mean and sd of the serial interval are each drawn from truncated normal distributions, with parameters specified by the user
- "si_from_data", the serial interval distribution is directly estimated, using MCMC, from interval censored exposure data, with data provided by the user together with a choice of parametric distribution for the serial interval
- "si_from_sample", the user directly provides the sample of serial interval distribution to use for estimation of R.
Zhanwei et al. (CDC EID) estimate the distribution of serial intervals for 468 confirmed cases of COVID-19 reported in China as of February 8, 2020. They found mean interval of 3.96 days (95% CI 3.53–4.39 days), and SD 4.75 days (95% CI 4.46–5.07 days).
We have been exploring all the above methods, following the logic and approach set out by Churches. Our live Spotfire currently uses values for SI as mean 2.6 and standard deviation of 1.5; and we are exploring mean 4.7 days and standard deviation 2.9 days. Churches reasoning for these higher values is that they better account for transmittion before the onset of symptoms, which results in shorter serial intervals than expected, possibly even shorter than the incubation period (see Figure 9). As we explore additional approaches eg by Abbott et al and with application to estimates of Rt on US states, we will update the Spotfire app. In particular, we are planning to expose these parameters to the R functions in Spotfire eg ranges (3.7,6.0) and (1.9,4.9). We are using window length of 7 days. We are also planning to expose this and let people change the window length from (1,7) as a parameter in Spotfire.
Figure 7. Rt modeling of countries as of March 26. highlighting France, Germany, Italy, the Netherlands, Spain and the UK.. The colored bands show Rt < 1.0 (green), 1.0 < Rt < 2.0 (yellow), 2.0 < Rt < 3.0 (amber). The dark line is the Rt estimate and the gray lines are 95% credible intervals. The models use the R package EpiEstim invoked through a Spotfire data function.
Figure 8. Rt modeling of US states as of March 26. highlighting California, Connecticut, Louisiana, Massachusetts, Michicagan and New York. The colored bands show Rt < 1.0 (green), 1.0 < Rt < 2.0 (yellow), 2.0 < Rt < 3.0 (amber). The dark line is the Rt estimate and the gray lines are 95% credible intervals. The models use the R package EpiEstim invoked through a Spotfire data function.
Note that Rt values can change quickly in response to non pharmaceutical interventions (NPIs). As outlined above, Kucharski et al. (March 11, 2020) found that the median daily Rt in Wuhan declined from 2·35 1 week before travel restrictions were introduced on Jan 23, 2020, to 1·05, just 1 week after. The next section assesses the effects of various non-pharmaceutical interventions.
Associating and Interpreting Rt and Case Data
Italy and Spain have had major outbreaks of COVID-19 in March. We show estimates of Re over time for Italy and Spain though April 2nd in Figure 9 below.
During March Italy and Spain aggressively adopted non-pharmaceutical interventions (NPIs) such as social distancing, so as to reduce the effective reproduction number and slow down the rate of spread. Results from the Rt estimates show that this appears to be working. This is not surprising as we have seen Re change quickly in response to non pharmaceutical interventions (NPIs) in other regions. As outlined above, Kucharski et al. (March 11, 2020) found that the median daily Rt in Wuhan declined from 2·35 1 week before travel restrictions were introduced on Jan 23, 2020, to 1·05, just 1 week after.
Note that todays cases are people that initiated infection some 2-3 weeks ago. It takes time for the virus to go from one host to another, and for that person to get tested and to be confirmed as a case. So the estimates of Re reflect cases that initiated some weeks prior. As such, the Re estimates around 1 on April 2nd indicate a solid effect of social distancing in March; and harbor well for the future in terms of reducing cases and flattening the epidemic curve.
Figure 9. Rt modeling of Spain and Italy as of April 2. The colored bands show Rt < 1.0 (green), 1.0 < Rt < 2.0 (yellow), 2.0 < Rt < 3.0 (amber). The dark line is the Rt estimate and the gray lines are 95% credible intervals. The models use the R package EpiEstim invoked through a Spotfire data function.
We check the case counts in Italy over the latter half of March in Figure 10. As predicted from the Rt estimates presented in Figure 9, we see a drop in daily case counts in Italy over the last days of March. While there are many sources of error, and low #tests, there is some comfort in these trends are now aligning, and the epidemic curves are flattening.
Figure 10. Case Counts in Italy and Spain up until April 4th. Note the fall in cases over the prior 7 days.
We study the effects of individual NPIs in the next section. We have collated NPIs across all WW locations and are making these available on the TIBCO Community - COVID-19 Visual Data Science Headquarters. These can be referred to in context of the previous 2 figures.
Effects of Interventions
The objective of any public health response during a pandemic, is to slow or stop the spread of the virus by employing mitigation strategies that reduce Rt. Typical interventions include:
testing and isolating infected people
reducing opportunities for transmission (e.g. via social distancing, school closures)
changing the duration of infectiousness (e.g., through antiviral use)
reducing the number of susceptible individuals (e.g., by vaccination)
The initial focus of public health experts with COVID-19 has been on suppression i.e. reducing the effective reproduction number Re to below 1; by isolating infected people, reducing case numbers and maintaining this situation until a vaccine is available. This worked well for SARS, but is more challenging for COVID-19 because many infected people are asymptomatic and go undetected.
The current focus is on mitigation i.e. reducing Re to slow spreading :-
Opportunity parameter : to get Rt below 1, Kucharski (TED Interview) describes the need for everybody in the population to cut interactions by one-half to two-thirds. This can be achieved by initiatives such as working from home (WFH), school closures, reducing social dinners etc.
As a simple analogy, there is a 84% chance of rolling at least one 6 in 10 rolls of a die. This reduces to 31% in 2 rolls (1 - (⅚)^n). So you can reasonably expect to cut your odds by one-half to two-thirds by reducing usual social meetings from say 10 meetings to 2 meetings per day.
Measures such as hand-washing, reducing contacts with others and cleaning surfaces can reduce the Transmission probability.
Note that the fatality rate in people aged 60-70 is increased to ~5%, in people aged 70-80 to ~10% and for people older than 80 at 15-20%. People with cormorbid conditions are at increased risk. So a key mitigation strategy to reduce deaths is to reduce interactions with the elderly.
Ferguson et al. (Imperial College, 16 March, 2020) describe interventions such as case isolation, household quarantines, restricting large events, closing social gathering spots, closing schools and universities, encouraging individuals to stay at home, pausing sporting and arts events -- and how these NPIs can affect the rate of contact and hence R0.
Case isolation in the home
Symptomatic cases stay at home for 7 days, reducing nonhousehold contacts by 75% for this period. Household contacts remain unchanged. Assume 70% of household comply with the policy
Voluntary home quarantine
Voluntary home quarantine Following identification of a symptomatic case in the household, all household members remain at home for 14 days. Household contact rates double during this quarantine period, contacts in the community reduce by 75%. Assume 50% of household comply with the policy.
Social distancing of those over 70 years of age
Reduce contacts by 50% in workplaces, increase household contacts by 25% and reduce other contacts by 75%. Assume 75% compliance with policy.
Social distancing of entire population
All households reduce contact outside household, school or workplace by 75%. School contact rates unchanged, workplace contact rates reduced by 25%. Household contact rates assumed to increase by 25%.
Closure of schools and universities
Closure of schools and universities Closure of all schools, 25% of universities remain open. Household contact rates for student families increase by 50% during closure. Contacts in the community increase by 25% during closure.
Table 1. Summary of NPI Interventions. Based on Ferguson et al. March 16
They also model these mitigation strategy scenarios for the GB to estimate hospital bed and critical care (ICU) requirements.
They predict that for R0 = 2.4, i.e. with a "do nothing approach", that 81% of the Great Britain and US populations would be infected over the course of the epidemic. They then show the effects of the interventions in Table 1 applied to this R0=2.4 scenario, in terms of critical care beds required. The resulting estimated effects are shown in Table 2.
|Non-Pharmaceutical Intervention (NPI)||Maximum critical care beds required|
|Closing schools and universities||240|
|Case isolation and household quarantine||130|
|Case isolation, home quarantine, social distancing of >70s||90|
Table 2. Predicted Effects of NPI Interventions on maximum critical care beds required (per 100,000 population). Based on Ferguson et al. March 16. NPI measures are described in Table 1.
Ferguson et al. suggest that the interventions remain in place for as much of the epidemic period as possible (they show April to July, 2020). They note that “Introducing such interventions too early risks allowing transmission to return once they are lifted (if insufficient herd immunity has developed); it is therefore necessary to balance the timing of introduction with the scale of disruption imposed and the likely period over which the interventions can be maintained.”
The Predictive Healthcare team at Penn Medicine recently released CHIME, a tool for COVID-19 hospital capacity planning. CHIME features an interface where users input parameters as follows
- number of days to project
- currently hospitalized COVID-19 patients
- doubling time before social distancing
- social distancing (% reduction in social contact)
- hospitalization % (toa=tal infections)
- ICU %(total infections)
- ventialited %(Total infections)
- hospital length of stay
- ICU length of stay
- vent length of stay
- regional population
- currently known regional infections
Results of a CHIME run include projections for hospitalized, ICU and ventilated cases.
Draugelis and Hanish compare the Penn CHIME and Imperial College team’s ‘Do nothing’ scenarios; and analyze CHIME’s Social Distancing parameter with different scenarios from the Imperial College model.
In a similar base scenario, CHIME and Imperial College results are comparable (Table 4).
Peak Ventilated or Critical Care Census
Table 3: Comparison of CHIME and Imperial College results from a similar base scenario ie no social distancing.
They use a paper by Zhaoyang et al. (2018) on adult daily social interactions to do a rough conversion of Imperial College scenarios to CHIME social distancing scenarios. Table 5 shows the results of running CHIME with these roughly comparable parameters.
% reduction of social contact
Imperial College Scenario
Imperial College (their Table 3)
PC + noness_SD
PC + noness_SD + PASD
Table 4. Comparison of CHIME’s Social Distancing parameter settings with different scenarios from the Imperial College model.
While this comparison is rough, it is encouraging that the base scenario projections are similar (Table 4) and that the CHIME Social Distancing parameter scenarios (top down overall % reduction in contact) can be lined up with the bottom up estimates of the Imperial College scenarios.
Bottom line, in unmitigated exponential growth, health systems can be quickly overburdened. The NPI measures are designed to save hospital resources eg ICU beds to serve the patients in serious condition. Given that an ICU bed may be taken for 2 weeks, the protective measures need to be aggressive.
Modeling Required Healthcare Resources
In order to understand the application of compartment models to healthcare resource requirements, we are exploring a similar compartment modeling approach to CHIME. Althaus (25 March) presents a method for modeling and projections of the COVID-19 epidemic in Switzerland. He fits an SEIR transmission model to the daily number of reported deaths, with additional compartments for hospitalization and critical care (ICU). He assumes constant uncontrolled transmission until the lockdown that was put in place on 17 Mar 2020, and then varies the transmission rate relative to the epidemic spread before the lockdown.
A schematic of the extended SEIR model used by Althaus is depicted in Figure 11.
Figure 11. Schematic of the extended SEIR model from Althaus (25 March).
The parameters include :-
- S=Susceptible / E=Exposed / I=Infected / R=Recovered, H=Hospitalized / V=ICU / D=died
- C = the cumulative number of cases
- beta = (# contacts per person per time) * probability of infection per contact
- omega1 = 1/ hospital stay, days, for mild and severe cases
- omega2 = 1/ hospital stay, days, for critical cases
- epsilon1 = proportion of Infected patients needing hospitalization
- epsilon2 = proportion of Hospitalized patients moving to ICU
- epsilon3 = proportion of ICU patients fatalities
- gamma = 1/(duration of disease, days)
- sigma = 1/(incubation period, days)
Parameters that can be controlled with NPIs - omega1 = 1/ hospital stay, days, for mild and severe cases
- kappa = the NPI effectiveness multiplier; kappa in (0,1), where 1 = no intervention, 0 = max intervention
- Re = kappa * beta / gamma (the effective reproduction number)
- where beta / gamma = R0, the basic reproduction number
Althaus (25 March) varies kappa to reflect NPIs and show effects on hospitalization and ICU bed requirements.
Before the lockdown in Switzerland, Althaus reported the basic reproduction number R0 of COVID-19 at 2.99 (95% confidence interval: 2.54 - 3.59). We checked this using EpiEstim applied to the case data available from Althaus (the table swiss_covid_epidemic). The case data is plotted in Figure 10 (upper left), with the marking (orange line) indicating the data prior to lockdown. In the Spotfire analysis shown in Figure 10, we provide an input field for the serial interval distribution parameter (upper right). We found that using SI = 5.0 gives good agreement with the Re reported in Althaus. Note that the Wuhan data analysis by Li et. al. provided an estimate of 5.3 for the mean SI, so this is a reasonable value. Figure 12 shows Rt dropping from 4.0 to 2.2 over the time sequence prior to lockdown on 17 March, with an average of 2.8 ie close the value of 2.99 (2.54-3.59) reported by Althaus. This indicates agreement in R0 estimates using different data (case data from Althaus website) and method (EpiEstim) as compared to the Althaus compartment model.
Figure 12. Calibrating the Althaus model with estimated Rt from case data. Fit uses EpiEstim package on case data from Althaus (25 March)
We are working on calibrating the kappa parameter from the Althaus model, to the social distancing results from CHIME and Ferguson et al. as presented in Table 4. Our goal is to create an interactive application for modeling regional hospitals and healthcare systems in the US and other WW regions. In order to do the projections in a state / region level we need :-
1) Population size / census (including age distribution)
2) Hospital stats : hospital locations, beds and capacity stats
3) Case data
- Daily case data
- Daily fatality data
4) Epidemic data
- Duration of hospitalization for mild and severe cases
- Additional duration of hospitalization for critical cases
- Proportion hospitalized cases
- Additional duration of hospitalization for critical cases
- Case fatality rate (use 1%)
- Reproduction number (variable)
Our interactive Spotfire application allows end-users to plug-in their data for hospital capacity and generate scenarios for resource planning. We are calibrating our models against the Penn CHIME Hospital Impact Model. Our goal is to enable regional healthcare organizations to drill into their local area and interactively obtain live-update forecasts of hospital resources needed to meet emerging demand. The forecasts include kappa scenarios combined with user-entered data on hospital stats from American Hospital Association and CDC age band risks, along with epidemic and case data selections. We are using census data in age bands, along with hospital data at the county level, to make the what-if scenarios more targeted.
GeoSpatial Data Science
Spotfire’s map charts display multiple layers of information - including points, lines, WKB objects like shapefiles and polylines, and TMS and WMS layers that show e.g. geology, live weather, or customized image, terrain, or other information. Map layers with points, lines, and WKB objects can be configured to respond to marking, and refreshed by Spotfire data functions including model fitting in R and Python. This provides a convenient means of injecting calculations and predictions into interactive map presentations e.g. interactive contour lines, heatmaps, polygons, territory calculations, and route optimization.
Figures 13 and 14 show US county level case data, with drill-down into hotspots in the NY area and the South East. The hotspot colorings are relative within the markings. The companion visuals show confirmed cases sorted by county, and combination daily cases and cumulative cases from the marking.
Figure 13. Interactive marking around the NY hotspot on 3 April. The hotspot coloring show the NYC hotpot and surround. The companion visuals show confirmed cases sorted by counties in the marking, and combination daily and cumulative cases. The combination chart shows no evidence of case flattening.
Figure 14. Interactive marking around the southeast on 3 April. The companion visuals show confirmed cases sorted by counties in the marking, including cities in IN, GA, DC, MO TN. The combination chart of daily and cumulative cases show no evidence of case flattening.
Figure 15 shows an area cartogram (Dorling 1996) of confirmed cases in the US. This is set of non-overlapping regions with state areas proportional to the number of cases, using a rubber sheet distortion algorithm (Dougenik et al. 1985). The cartogram is invoked via a data function in Spotfire, with the R package Cartogram (Jeworutzki et al) run inside Spotfire on a mouse marking, using the built-in TIBCO Runtime for R engine.
Figure 15. Cartograms of COVID-19 confirmed cases from March 19 and March 30. This shows a shifting dominance of cases in from WA and CA to NY.
Summary and Community Actions
Reading Adam Kucharski and other experienced epidemiologists, this COVID-19 virus is clearly highly contagious and deadly. However, from a statistical perspective, with exponential growth parameters there are similarly exponential errors on predictions, and many different scenarios could eventuate. The case fatality rate is particularly unclear, with estimates ranging from 0.01% up to 2%. Thats an enormous range of outcomes, perhaps implying a range of total deaths from 50 thousand to 2 million.
When our predictive models are this uncertain, it is no wonder that we are seeing a wide range of human reactions - from terror to indifference. And the community measures that are being implemented have been shown to be effective. At one level we can think of communities and populations, where deaths in the thousands are certain, the economy is in turmoil and our life savings are under attack. At the other end of the spectrum there is us and our individual friends and families. If I/we assume a 10% risk of infection and a 0.1% mortality rate, my/our personal death rate is 1 in 10,000. Or perhaps better said my/our chance of being just fine is 9,999/10,000.
I guess what I’m saying is from a personal perspective it's ok to be afraid and take every measure to protect myself. But I'm not going to take these highly uncertain outcomes as events that are likely to happen. In the current situation, I take comfort in the uncertainty. We are moving forward with hope and confidence in the analytics and predictions - and uncertainties of the predictions - that are summarized in this paper.
The interventions are being driven by our medical and epidemiology experts (e.g., CDC and WHO) and these are measures we know to work since the Spanish Flu of 1918. It's clear that we have to all chip in, in our everyday lives to enforce these :-
Be aware of the path to infection: hand to face, etc.
- Stop things like handshakes
- Clean hands often
- Clean and disinfect surfaces
Practice Social Distancing
- Avoid gatherings
- Maintain distance between yourself and others
Think about old people and their high infection and mortality rates
- Cover coughs and sneezes
- If sick, stay home. If that is not possible wear a facemask
We are all in this together. Be kind. Watch out for others in your orbit. Educate others with the knowledge you have. Be generous to others in our lives who are struggling. Help keep the young ones away from the elderly and immuno-compromised. Good luck. We will be back with another visual data science update on COVID-19 soon.
This blog was updated on April 4 as follows:
- added new section on GeoSpatial Data Science
- updated the Compartment Model Section
- updated the section Modeling the Effective Reproduction Number over time - Rt
- updated the Modeling Healthcare Resources section
- added CDC reference on age band risk
- added references on Case Fatality Rate
- added Cartogram references
Future updates will likely appear in a new blog, including :-
- error sources - testing, case and fatility reporting
- COVID-19 risk in context of existing risk
- CFR and case reporting
- testing and diagnostics
- healthcare resource planning
- udpates on visual and geospatial data science
- natural language generation
Appendix: TIBCO Analytics
Special thanks to the awesome TIBCO Data Science team who are working on these analyses using Spotfire (Visual Analytics; R, Python) : Neil Kanungo, Peter Shaw, Prem Shah and David Katz did the heavy lifting, and were well supported by Vinoth Manamala, Eric Hsu, Andrew Berridge, Heleen Snelting, Mike Alperin, Colin Gray and Dan Rope.
Blog contact author: Michael O’Connell, @MichOConnell
- Abbott, S, Hellewell, J, Munday, JD, Young Chun, J, Thompson, RN, Bosse, NI, Chan, YWD, Russell, TW, Jarvis, CI. Temporal variation in transmission during the COVID-19 outbreak, online March 14, 2020
- Althaus, C. Real-time modeling and projections of the COVID-19 epidemic in Switzerland. March 25, 2020
- Anderson RM, Heesterbeek H, Klinkenberg D, Hollingsworth TD. How will country-based mitigation measures influence the course of the COVID-19 epidemic? Lancet 2020, with appendices; published online March 6, 2020
- Becker M and Chivers C. Announcing CHIME, A tool for COVID-19 capacity planning. March 14, 2020.
- Bendavid E and Bhattacharya J. Is the Coronavirus as Deadly as They Say? WSJ March 27 2020
- CDC. Severe Outcomes Among Patients with Coronavirus Disease 2019 (COVID-19) — United States, February 12–March 16, 2020. MMWR Morb Mortal Wkly Rep 2020;69:343-346. DOI: http://dx.doi.org/10.15585/mmwr.mm6912e2external icon
- Churches, T. Analyzing COVID-19 outbreak data with R - part 1. published online February 7, 2020
- Community mitigation guidelines to prevent pandemic influenza. https://stacks.cdc.gov/view/cdc/45220 United States, 2017
- Cori A, Cauchemez S, Ferguson NM, Fraser C, Dahlqwist E, emarsh A, Jombart T, Kamvar ZN, Lessler J, Li S, Polonsky JA, tockwin J, Thompson R, van Gaalen R. EpiEstim, 2019.
- Cori A, Ferguson NM, Fraser C, Cauchemez S, A New Framework and Software to Estimate Time-Varying Reproduction Numbers During Epidemics. Am J Epidemiology, 2013
- Dalmeter PL, Street EJ, Leslie TF, Yang T and Jacobsen KH. (2019). Complexity of the Basic Reproduction Number (R0). CDC Emerging Infectious Diseases, 25, 1 - January 2019
- Dorling, D. (1996). Area Cartograms: Their Use and Creation. In Concepts and Techniques in Modern Geography. Catmog, 59.
- Dougenik JA, Chrisman NR, Niemeyer DR. (1985). An Algorithm to Construct Continuous Area Cartogram. Professional Geographer, 37(1). 1985, 75-81.
- Draugelis M and Hanish A. CHIME comparison with Imperial College COVID-19 Publication March 18, 2020
- Du Z, Xu X, Wu Y, Wang L, Cowling BJ, and Lauren Ancel Meyers LA, Serial Interval of COVID-19 among Publicly Reported Confirmed Cases CDC Emerging Infectious Diseases, Vol 26, 6 - June 2020.
- Fauci AS, Lane HC, Redfield RR. Covid-19 — Navigating the Uncharted. NEJM March 26, 2020; 382:1268-1269. DOI: 10.1056/NEJMe2002387
- Ferguson NM, Laydon D, Nedjati-Gilani G, Imai N, Ainslie K, Baguelin B, Bhatia S, Boonyasiri A, Cucunubá Z, Cuomo-Dannenburg G, Dighe A, Dorigatti I, Fu H, Gaythorpe K, Green W, Hamlet A, Hinsley W, Okell LC, van Elsland S, Thompson T, Verity R, Volz E, Wang H, Wang Y, Walker PGT, Walters C, Winskill P, Whittaker C, Donnelly CA, Riley S, Ghani AC. Impact of non-pharmaceutical interventions (NPIs) to reduce COVID19 mortality and healthcare demand. Imperial College, 16 March 2020
- Jeworutzki S, Giraud T, Lambert N, Bivand R, Pebesma E, Nowosad J, Cartogram R package. Version 0.2. CRAN 2019-12-07
- Jones, J. Notes on R0, Stanford University, 2007
- Jones, J. Models of Infectious Disease, Stanford Spring Workshop in Formal Demography, May 2008.
- Kucharski, Adam. The TED Interview, March 12, 2020
- Kucharski et al. Early dynamics of transmission and control of COVID-19: a mathematical modeling study, March 11, 2020
- Lauer et al. The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application, Pubmed, March 10, 2020
- Interim pre-pandemic planning guidance : community strategy for pandemic influenza mitigation in the United States : early, targeted, layered use of nonpharmaceutical interventions. https://stacks.cdc.gov/view/cdc/11425, CDC, 2007
- O'Connell M. COVID-19 : A Visual Data Science Analysis and Review TIBCO Blog, 18 March 2020
- Ridenhour, B., Kowalik, J. and Shay, D. Unraveling R0: Considerations for Public Health Applications. Am J Public Health. Doi: 10.2105/AJPH.2013.301704. Published online February 2014
- Riou J, Hauser A, Counotte, MJ, Athaus CL, Adjusted Age-Specific Case Fatality Ratio during the COVID-19 Epidemic in Hubei, China, January and February 2020, 3 March 2020, Preprint.
- Ruan S Likelihood of survival of coronavirus disease 2019. March 30, 2020 DOI: https://doi.org/10.1016/S1473-3099(20)30257-7
- Spiegelhalter D. How much 'normal' risk does Covid represent? Medium
- Stanway, A. Real Time COVID-19 Tracking. Medium, March 14
- VerityR, LC Okell, I Dorigatti, P Winskill, C Whittaker, N Imai, GC Dannenburg, H Thompson, P Walker, H Fu, A Dighe, J Griffin, A Cori, Marc Baguelin, Sangeeta Bhatia, Adhiratha Boonyasiri, ZM Cucunuba, R Fitzjohn, KAM Gaythorpe, W Green, A Hamlet, W Hinsley, D Laydon, G Nedjati-Gilani, S Riley, S van-Elsand, E Volz, H Wang, Y Wang, X Xi, C Donnelly, A Ghani, N Ferguson. Estimates of the severity of COVID-19 disease. doi: https://doi.org/10.1101/2020.03.09.20033357
- Wilson N, Kvalsvig A, Barnard LT, Baker MG. Case-Fatality Risk Estimates for COVID-19 Calculated by Using a Lag Time for Fatality. CDC EID Journal. Voliume 26, Number 6, June 2020.
- Wu JT, Leung K, Bushman M, Kishore N, Niehus R, de Salazar PM, Cowling BJ, Lipsitch M, Leung GM: Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China, Nature Medicine, March 19, 2020
- Zhaoyang R, Sliwinski MJ, Martire LM, Smyth JM. (2018). Age Differences in Adults’ Daily Social Interactions: An Ecological Momentary Assessment Study. Psychol Aging. 2018 Jun; 33(4): 607–618.
Websites with data updates
- 1Point3Acres: COIV-19 in US and Canada
- Johns Hopkins: Coronavirus Resource Center
- KCDC: Daily cases update from Korea
- Our World in Data: Coronavirus Testing – Source Data
- Wikipedia: Case data for US States
- World Health Organization: Coronavirus situation reports
- Trevor Bedford : @trvrb
- Nextstrain : @Nextstrain
- Hannah Ritchie : @_HannahRitchie
- Eric Topol : @EricTopol
- Adam Kucharski : @AdamJKucharski
- Sam Abbott : @seabbs
|Michael O'Connell, Ph.D., is the chief analytics officer at TIBCO, where he helps clients with analytics software applications that drive business value. He has written a bunch of scientific papers and software packages on statistical methods. He also likes listening to electronic music; watching basketball, football and cricket; going to art galleries and walking around neighborhoods.|
|Neil Kanungo is a Data Scientist at TIBCO and specializes in data visualization and business analytics. He helps deliver unique solutions to industry’s biggest challenges. Neil takes a special interest in operationalizing analytics across organizations at multiple levels, and in fostering user engagement. In his free time, Neil enjoys hiking with his dog, live music, and playing pinball.|
|Peter Shaw is a data scientist in the TIBCO Data Science team, based in Seattle. His interests include computational geoanalytics, mapping, pattern recognition, optimization, time series and routing. He views data science as a contact sport, with the analyst, the data, and analytical models as the players. Other interests include photography, drawing, music, and partner dancing.|
|Prem Shah is a data scientist working in the Data Science Team at TIBCO based out of their Seattle office. He has a strong inclination to figure out data driven and automated solutions and wants to work with new technologies to get insights. He likes to play the keyboard in his spare time and usually is working on pet projects that involve combining deep learning with his interests.|