Covid-19 – How is the outbreak growing? A deep dive analysis with Power BI

Covid-19 – How is the outbreak growing? A deep dive analysis with Power BI

Introduction

With the rapid spread of the novel coronavirus Covid-19 across the globe, a massive amount of data is generated every day. 

Many organizations such as the WHO or the CDC have publicly shared datasets on the worldwide impact of COVID-19.

By now, we’ve probably seen hundreds of graphs and charts across the internet or on the TV depicting the new confirmed cases or cumulative cases around the world. 

Although those charts are highlighting important daily statistics, I still feel that the data is not analyzed in an efficient way to provide insights and can sometimes be misleading.

So using Power BI I will attempt to provide a more in-depth analysis of the outbreak and share the insights I found.

So what are the actual insights about the outbreak? 

What do we really want to know?

  • Where does the virus spread faster?
  • Which countries are the most affected?
  • Which countries have the most severe cases?
  • Which countries better handle the outbreak?
  • Is the epidemic slowing down? And where?
  • Has the curve flattened?
  • When the peak will be reached? Or has it been already reached?

Data sources

https://github.com/CSSEGISandData/COVID-19

The dataset contains time series data of the number of cases, deaths and recoveries across each country on a daily basis.

Terms of Use:
This GitHub repo copyright 2020 Johns Hopkins University, all rights reserved, is provided to the public strictly for educational and academic research purposes. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.

Flatten the curve

Countries around the world are working on slowing the spread of the infection. “Flattening the curve” is a strategy to reduce the number of new cases from one day to the next to prevent healthcare from being overwhelmed.

Most of the charts shown on the news represent the new daily cases or the total number of cases/fatalities over the past few weeks by country. These statistics are good at making the headlines but what does that tell us? What are the actual insights that we can take from it? How do we know if the curve is flattening?

Daily figures

The above chart shows the daily cases over time in Italy, it seems that the numbers of new confirmed cases have begun to plateau or even fall but this is still no really obvious if Italy has started to flatten the curve of cases.


However, if instead of looking at the new cases we look at the progression change rate of today’s data versus yesterday’s data or the last 7 days average data we then start to get a sense of where the outbreak progression rate is heading.

Figure 1: Daily Cases in Italy – Data as of Tuesday 14th April 2020

So how can we spot a flatten curve in this chart?

A flattened curve will show a downward trend in the last 7 days avg whereas an upward trend will indicate that the virus is still spreading rapidly.

We clearly observe a downward curve in the “Last 7 Days Avg” trend so this a good sign that Italy has managed to flatten the curve. Adding the last 7 days avg trend on top of the daily cases provides a much clearer view on whether the infection rate is slowing down or still rapidly growing.

Analysis by country

Total cases, Mortality rate, Recovery rate

What does this chart tell us? Does it provide any insights?
Yes and no…
It says where are things right now? How does a country compare to other countries?

The US seems to be the most impacted country as it has far more cases than any other countries as well as more fatalities.
Italy has the highest mortality rate.
Russia has the lowest mortality rate.
China has the highest recovery rate.

I could go on and on to list the insights given by this chart but hang on how do we compare the US with Switzerland?
They have very close mortality rate US 4.25% and Switzerland 4.52% but the US has 24 times more case than Switzerland and the US population is about 38 times bigger than Switzerland population.
So what can we infer about it now?


If instead of looking at the number of cases we look at the number of cases per million inhabitants disparities across each country will become much clearer. 

Which countries seem to better handle the outbreak?

Now let’s visualize Covid-19 cases and fatalities per million inhabitants:

Sorted by Cases for 1M:

Figure 3: Cases per 1M Inhabitants by Country – Data as of Tuesday 14th April 2020

Sorted by Deaths for 1M:

Figure 4: Deaths per 1M Inhabitants by Country – Data as of Tuesday 14th April 2020

Now, this chart provides a lot more insights we clearly observe a significant difference between each country. And we can get a better intuition on which country better handle or has more resources to handle the pandemic.

Spain has the most severe cases rate with 3,676 cases per 1M, followed by Switzerland and Belgium. The United States, the country with by far the most cases, still has a relatively low rate in comparison, 1,844.

Germany has more than 130k cases which is nearly 5 times more than Belgium but its fatality rate stands at 2.5% only compared to 13% in Belgium. Germany has a deaths For 1M rate of 39.6 whereas Spain has a rate of 384.7 which is 10 times more.
Is that because Germany has been testing far more people than other countries? At the time of writing this post, I haven’t gathered any data about the number of tests by country. I’d be tempted to say yes but as I can’t back it up I won’t say it!

How can we better visualize disparities across countries?

We’ve seen above that using a ratio per 1M inhabitants gives a clearer view of the disparities between countries but raw data still not provide an easy way to visualize it.
One chart I like to use when I want to compare two ratios is the scatter plot.

Figure 5: Top 15 Countries Ratio Cases/Deaths per 1M Inhbaitants (14.04.2020)

How to read this chart:

  • Circle size represents the number of cases
  • The dotted line represents the ratio between Cases per 1M over Deaths per 1M
  • On the lower side of the chart (right symmetry), we assume that the longer the distance is between a country and the dotted line the better the country handles the outbreak
  • On the upper side (left symmetry), we assume that the longer the distance is between a country and the dotted line the worse the country handles the outbreak

Why do I use ratio and scatter plot? Well does the number of cases on its own tell which country is the most impacted?

No, If we assume that the population size of a country is associated with the ICU beds capacity and medical equipments like ventilators. (I’m not saying it’s true)

So since we know how to read this chart and we suppose that in theory population size is associated hospital beds capacity. Let’s deep dive into this chart again.

Figure 6: Top 15 Countries Ratio Cases/Deaths per 1M Inhbaitants (14.04.2020)

Which countries seem to do better and worse than others?

From the previous visuals, we started to get an intuition on which country was doing better than an another but it was still not obvious to see how Switzerland and the US were different. 

To compare the ratio of each country side by side scatter plot is in my opinion by far the most appropriate visual to go with. 

So far we’ve had identified that Germany was doing far better than other countries but we’ve had no clue that Switzerland was also standing out.

So among the most impacted countries based on the ratio (Deaths/Cases per 1M inhabitants) Germany, Switzerland and the US seem to be better handling the outbreak than other countries while Belgium, Italy and the UK have a higher Deaths/Cases ratio than other countries.

Now we’ve seen that depicting the relationship between the two ratio Deaths/Cases per 1M inhabitants gave us a clear picture of which countries are the most gravely affected. But how do we know where the virus spread faster? What about countries where the infection has just started?

Don’t show the date on the X-axis

The number of cases isn’t going to start on the same day across all countries. Instead, the virus will tend to spread in a specific location then to nearby locations and then gradually all over the world.

So, in that scenario how does one country compare to another? 

Date is not relevant, in fact, in February China was the hotspot now they have mostly eradicated the pandemic, then in March Italy was the hotspot, now in April, the US is the hotspot.

If we were to compare these 3 countries using the date scale it would look like this:

AS the outbreak did not begin at the same time in these 3 countries. These charts do not provide actual insights and cannot answer the question “Where does the virus spread faster”

Instead of using the date on the X-axis, we will use the number of days since 50 cases were first recorded or since 10 deaths were first recorded, thus, we bring all the countries at the same starting point.

Where does the virus spread faster?

The two charts below allow us to compare how fast the number of confirmed cases increased after the outbreak has reached a similar stage in each country.

The first chart represents the cumulative number of cases across the top 10 most affected countries, by number of days since the 50th case was recorded (over 20 days).

Figure 9: Total cases by number of days since 50th case recorded (0-20 days)

We can see how robust the spread is over Turkey which has around 3 times more cases than France and Italy since the infection begun to spread within a period of 20 days.
As the virus begun to spread later in Turkey we wouldn’t have been able to visualize it using the date scale.
So using the “number of days since 50th case recorded” gives a much accurate view on how rapidly the virus spread across each country.

This second chart represents the cumulative number of cases across the top 10 most affected countries, by number of days since the 50th case was recorded (over 40 days).

Figure 10: Total cases by number of days since 50th case recorded (0-40 days)

Now this time we see something interesting the US was not among the 10 most-affected countries when looking at a period of 20 days since the infection started but at 40 days its number of cases is far higher than any other countries.
It looks like it’s only after 25 days that its number of cases started to grow exponentially.
Another interesting point is that at 20 days the virus seemed to spread much faster in Turkey than anywhere else in the world but 5 days later Spain took over.
So did Turkey manage to slow down the infection or perhaps Spain had a sudden increase in cases?


How can we effectively compare the growth in cases of different countries?

Logarithmic scale

The log scale will help better visualize early exponential growth.

So now we get even more insight on when and where the virus spread faster.
Turkey had an early exponential growth in cases at days Turkey had times more cases than Spain, times more than France and UK and 50 times more than the US!
The exponential growth for France, Spain and the Uk started at around the 10th day of the outbreak (since 50th case recorded) and it started around the 15th day for the US.
(Note: I use the term exponential growth to mean “really fast” not to mean cases double every day)

 

We’ve now seen where the virus is spreading faster.

And most of the countries impacted by the rapid spread of the virus have ordered lockdown in order to slow the epidemic

So how can we track the effectiveness of the lockdown?

Tracking the effectiveness of lock-down period on the spread of the virus is an important indication of how well government responses around the world worked.

How long does it take to the curve to flatten since lockdown started?

The number of cases or fatalities in a country isn’t going to start flattening overnight it might take up to two weeks to have the symptoms and even more to go from being infected to unfortunately passing away.

Most estimates of the incubation period for COVID-19 range from 1-14 days, most commonly around five days.

Let’s visualize the effectiveness of lockdown on the new daily cases in a few different countries:

Spain

Lockdown seems to be quite effective it takes on average 20 days to see the new daily cases slowing down after the lockdown started as we can see from the recap table below.

ContryLockdownCurve starts to flattenDuration
Italy09 March27 MArch18 days
Spain14 March01April18 days
US19 March11 April 23 days
UK24 March15 April 22 days

Now let’s visualize the effectiveness of lockdown on the new deaths cases in a few different countries:

Spain

Again lockdown seems to be quite effective it takes on average 22 days to see the new daily deaths slowing down after the lockdown started as we can see from the recap table below.

ContryLockdownCurve starts to flattenDuration
Italy09 March03 April24 days
Spain14 March04 April21 days
US19 March– ? –25 days+
UK24 March15 April ?22 days

However, for the US there’s seem to be an issue with the data that twists the actual trend, it could indicate either a time of explosive growth of fatalities and thus that the lockdown is not effective or just a change in how deaths are counted like in France where fatalities in nursing homes were excluded from official numbers until the beginning of April.

A sudden extreme growth or shrinking of the number of cases or fatalities is what we call in statistics an outlier. Outliers affect the mean value of the data and can make trend harder to forecast. (We’ll see that in the next part)

When the peak will be reached?

Just to be clear here.
The model “linear regression” that I will use is very basic and can by no mea

Just to be clear here.
The model “linear regression” that I will use is very basic and can by no mean accurately predict the future outcome of the pandemic.
Even for experts, the future of the pandemic is still hard to predict and as experts said no matter how much data we gather, models can’t predict human behaviour.

First let me explain how I’ll try to predict when the peak will be reached.

To predict when the peak will be reached I look at the daily rate of change in the number of cases. So when a country has fewer new daily cases or fatalities than the previous day the change rate will be negative and if there are more cases today than yesterday then the change rate will be positive.
So if the change rate is positive it means the outbreak is still in a growing phase and not yet under control. If the daily number of cases is still growing, but the change rate is negative it means that the outbreak is slowing down.

A picture is worth a thousand words so let’s visualize this:

Daily Cases Change Rate in Italy – 01 March – 14 April

How to read this chart:
Here we have the change rate of “Daily Cases Change” (yellow) and its trend “Estimated Cases Growth %” (dotted blue line) using a linear regression.
I

Here we have the change rate of “Daily Cases Change” (yellow) and its trend “Estimated Cases Growth %” (dotted blue line) using a linear regression. 

If the rate is positive, the daily number of deaths/cases is growing, if it is negative, the daily number of deaths/cases is shrinking. 

If the linear trend is going up, the overall change rate is growing, if it’s going down the overall change rate is decreasing. When the linear trend crosses 0the peak has been reached and if the linear trend approaches 0 in the future we predict that the peak will be reached at this time.

So here in Italy we the linear interpolation of the change rate reaches 0 on So here in Italy, we see that the linear trend of the change rate reaches 0 on April 5th which we were predicted the peak to be. We already know that the peak was reached a few days earlier.

Now let’s see how outliers can affect visuals as well as the forecast trend. In the below visual in the UK, there was only one confirmed case on the 15th March and 406 confirmed cases on the 16th March. This is a 40,500% increase in the daily change rate and we see how this outlier is affecting the visual.

It seems that any other values apart from the second outlier are just lying on the X straight line.

And if we were to forecast the daily change rate in the future while keeping the outliers we would predict an increase of 212% for the 16th April so 3 times more cases than the day before…
Fortunately, this is completely wrong my linear regression model has been skewed away from the true underlying relationship due to the few outliers.

So what can we do about outliers?

In that scenario outliers are likely to be associated with a change in

In that scenario, outliers are likely to be associated with a change in reporting methods by public health or government so experts will know what to do with it.
In my case I will just drop them since missing value won’t impact my model, however, in other scenarios, we would probably have to cap them or assign them another value such as the mean or a percentile.

Anyway after dropping the few outliers this how the UK Daily Cases Change rate looks like:

Now after droping the outliers we predict a change rate of 0.04% which is

Now after dropping the outliers we predict a change rate of 0.04% which is much closer to the reality.
This last part was more to showcase what we can do in Power BI by using only DAX (no R or Python involved) rather than providing a real forecast.

I appreciate that it is so hard at that time to see into the future of Covid-19. However, I do believe that short term estimation helps countries best prepare to combat the virus.
In this post, I’ve shown how we could predict when the peak will be reached but many other things could also be predicted such as estimating the peak duration or predicting the time when COVID-19 infections will fall below a certain threshold.

Final Thoughts

Thanks for sticking with me until the end!
I hope you feel that we have uncovered some useful insights and that I have demonstrated that using the appropriate visuals help to analyse data more efficiently.
Let’s recap a few of them:

Daily figures:
Comparing Daily figures with rolling avg or with last 7 days avg helps to

Comparing Daily figures with rolling avg or with last 7 days avg helps to visualize when the curve has flattened.
By now most countries from the West seemed to have flattened the curve especially those that have ordered a lockdown


Using different scales tell different stories:
Logarithmic scale helps to visualize exponential growth. Turkey had the earliest exponential growth in cases.

Tracking the effectiveness of lockdown:
It is crucial for public healts and governements to track how effective lockdowns are in stopping the spread of the virus.
Lockdown seems to have worked in most countries but there are still a few countries where a lockdown seems to be a bit less effective like in the US or even Belgium.

Using Number of days since the 50th case instead of the date helps visualize how rapidly increases the infection across each country.

Using deaths/cases per capita:
Knowing how many people have died compared to how many people live in that country is more insightful than showing the raw number of deaths or cases and helps to see which countries are the most affected and those which better handle the outbreak.

When will the outbreak peak?
As we’ve seen it’s easy to know when we’ve reached the peak.

As we’ve seen it’s easy to know when we’ve reached the peak.
Once we start seeing the downward trend it means the peak has been reached.
Knowing when the pandemic will peak is also useful, in fact, if we know that the virus will peak soon and will start to be contained within a few weeks it will help panic and fear of the unknown to fade rapidly and lifting restrictions swiftly.

Power Bi Link

You can fully interact with the reports shown in this post here.
I plan to add more features and insightful visuals so stay tune.

Leave a Reply

Your email address will not be published. Required fields are marked *