Visualising Deviation From Average in Power BI

Note that throughout this article I will use the margin measure to illustrate the deviation from average application but the same pattern can be applied to any measures like Sales volume, Delivery time, Profit, etc…

Margin is a key metric to asses high level performance of a company.
But sometime we want to measure and compare specific shop or department or employee performance with the entire company over all performance and this where deviation from average metric comes to help.

The requirement are as follows:

  • Company overall margin over time (which is just the margin average)
  • Employee margin over time
  • Employee margin deviation from the company margin average over time
  • Employee margin deviation vs Last Year Employee margin deviation

And the final result should look like this:

Capture2

Now let’s define our measures:


 

Margin  and Overall Margin calculation

Margin:

Margin % = divide(([Total Revenue]-[Total COGS]),[Total COGS])

Overall Margin:

All Margin =
if(
    ISBLANK([Margin %]),
    BLANK(),
    calculate([Margin %],ALLEXCEPT('Fact','Date')
)

Removes all context filters except filters that have been applied to the fact or date tables.

Margin Deviation From Average measures

Margin Deviation From AVG:

Margin Deviation =
var _AllMargin = calculate([Margin %],ALLEXCEPT('Fact','Date'))
Return
IF(ISBLANK([Margin %]),
   BLANK(),
   [Margin %] - _AllMargin
)

Margin Deviation CY vs LY (Hidden):

Margin Deviation CY vs LY (hidden)=
var _MarginDeviationLY = CALCULATE([Margin Deviation],SAMEPERIODLASTYEAR('Date'[Date])
)
var _MarginDeviationCYvsLY = [Margin Deviation] - _MarginDeviationLY
Return
IF(ISBLANK(_MarginDeviationLY),
   BLANK(),
   IF(_MarginDeviationCYvsLY>0,
      1,
      0
     )
  )

Margin Deviation CY vs LY:

Margin Deviation CY vs LY =
IF(ISBLANK([Margin Deviation CY vs LY (hidden)]),
   BLANK(),
   IF([Margin Deviation CY vs LY (hidden)>0,
      UNICHAR(9650),
      UNICHAR(9660)
     )
  )

Unichar is a great function which returns the Unicode character referenced by the numeric value, we can draw really cool stuff with this function.
(UNICHAR(9650) Up-Arrow , UNICHAR(9660) Down-Arrow)


Visualising everything together

Now let’s put together our measures into a matrix visual:

  • Margin %: Margin by employee and by year
  • All Margin: Yearly margin of the entire company
  • Margin Deviation: This is simply the difference between the employee margin and the average margin of the entire company
  • Margin Deviation CY vs LY: Current Year Margin Deviation versus Last Year Margin Deviation
Capture3

How do we read this result:

Andrew Ma:
In 2013 the entire company margin AVG was 46.9% and Andrew margin was 35.46% hence below the AVG margin by -11.46%.
In 2014 the AVG margin is 74.32% and Andrew has now performed better than the AVG margin by 8.42%.
Note that Andrew has also increased his margin Deviation by nearly 20% (8.42–11.44) so he got a up arrow.

Tina Lassila:
In 2013 Tina realised a Margin% of 101.44% and she beat the AVG margin by 54.54%.
In 2014 Tina realised a Margin% of 116.67% which is a nice increase compared to 2013!
However, it turns out that the whole company performed better in 2014…
So if we look at Tina’s Margin Deviation compared to the last year we notice that she did not perform as good as in 2013 ans thus got a down arrow.

A nicer visual

Capture2
Capture5

To get this visual we just need to keep the previous matrix above, remove the first two columns and then apply some conditional formatting using the Margin Deviation CY vs LY (hidden) measure.
And by looking at this visual we can clearly observe that Tina performance compare to the AVG Margin has decreased by around 10%.

In most cases the margin indicator is used to asses people, team or product performance.
However, by using the Margin only we’re missing out the overall trend of the company.  At first sight an increase of 15% seems great but once we realise that the overall increase was actually of 28% it doesn’t seem great anymore.

This is why margin deviation from average can provide really great insight from our data but using it alone could also be misleading as even if Tina Margin Deviation is not that good she’s still the best performer for the year of 2014.
So I think combining the Margin% and Margin Deviation from average measures together is always a good idea!

Here is the power BI Customer Profitability sample that I used for this article.

T-Test: Dr. Semmelweis and the discovery of handwashing

This article only illustrates the use of t-test in a real life problem but does not provide any technical information on what is T-Test or how T-Test works. I will go through the T-test in details in another post and will link it into this post.

Intro

I was looking for a cool dataset to illustrate the use of T.test and I found this DataCamp project “Dr. Semmelweis and the discovery of handwashing”. This a straightforward project but I really like the way they introduce it and specifically how they show beyond doubt that statistic plays a vital role in the medical field.

Here is the discovery of the Dr.Ignaz Semmelweis:
“In 1847 the Hungarian physician Ignaz Semmelweis makes a breakthough discovery: He discovers handwashing. Contaminated hands was a major cause of childbed fever and by enforcing handwashing at his hospital he saved hundreds of lives.”

1. Meet Dr. Ignaz Semmelweis

ignaz_semmelweis_1860This is Dr. Ignaz Semmelweis, a Hungarian physician born in 1818 and active at the Vienna General Hospital. If Dr. Semmelweis looks troubled it’s probably because he’s thinking about childbed fever: A deadly disease affecting women that just have given birth. He is thinking about it because in the early 1840s at the Vienna General Hospital as many as 10% of the women giving birth die from it. He is thinking about it because he knows the cause of childbed fever: It’s the contaminated hands of the doctors delivering the babies. And they won’t listen to him and wash their hands!

In this notebook, we’re going to reanalyze the data that made Semmelweis discover the importance of handwashing. Let’s start by looking at the data that made Semmelweis realize that something was wrong with the procedures at Vienna General Hospital.

# Load in the tidyverse package
library(tidyverse)
library(ggplot2)
# Read datasets/yearly_deaths_by_clinic.csv into yearly
yearly <- read_csv("datasets/yearly_deaths_by_clinic.csv")
# Print out yearly
yearly
year births deaths clinic
1841 3036 237 clinic 1
1842 3287 518 clinic 1
1843 3060 274 clinic 1
1844 3157 260 clinic 1
1845 3492 241 clinic 1
1846 4010 459 clinic 1
1841 2442 86 clinic 2
1842 2659 202 clinic 2
1843 2739 164 clinic 2
1844 2956 68 clinic 2
1845 3241 66 clinic 2
1846 3754 105 clinic 2

2. The alarming number of deaths

The table above shows the number of women giving birth at the two clinics at the Vienna General Hospital for the years 1841 to 1846. You’ll notice that giving birth was very dangerous; an alarming number of women died as the result of childbirth, most of them from childbed fever.

We see this more clearly if we look at the proportion of deaths out of the number of women giving birth.

# Adding a new column to yearly with proportion of deaths per no. births
yearly$proportion_deaths<-yearly$deaths/yearly$births
# Print out yearly
yearly
year births deaths clinic proportion_deaths
1841 3036 237 clinic 1 0.07806324
1842 3287 518 clinic 1 0.15759051
1843 3060 274 clinic 1 0.08954248
1844 3157 260 clinic 1 0.08235667
1845 3492 241 clinic 1 0.06901489
1846 4010 459 clinic 1 0.11446384
1841 2442 86 clinic 2 0.03521704
1842 2659 202 clinic 2 0.07596841
1843 2739 164 clinic 2 0.05987587
1844 2956 68 clinic 2 0.02300406
1845 3241 66 clinic 2 0.02036409
1846 3754 105 clinic 2 0.02797017

3. Death at the clinics

If we now plot the proportion of deaths at both clinic 1 and clinic 2 we’ll see a curious pattern…

# Setting the size of plots in this notebook
options(repr.plot.width=7, repr.plot.height=4)
# Plot yearly proportion of deaths at the two clinics
ggplot(data=yearly, aes(x=year, y=proportion_deaths, group=clinic, color=clinic)) +
 geom_line() + geom_point()+
 scale_color_brewer(palette="Paired")+
 theme_minimal()

1

4. The handwashing begins

Why is the proportion of deaths constantly so much higher in Clinic 1? Semmelweis saw the same pattern and was puzzled and distressed. The only difference between the clinics was that many medical students served at Clinic 1, while mostly midwife students served at Clinic 2. While the midwives only tended to the women giving birth, the medical students also spent time in the autopsy rooms examining corpses.

Semmelweis started to suspect that something on the corpses, spread from the hands of the medical students, caused childbed fever. So in a desperate attempt to stop the high mortality rates, he decreed: Wash your hands! This was an unorthodox and controversial request, nobody in Vienna knew about bacteria at this point in time.

Let’s load in monthly data from Clinic 1 to see if the handwashing had any effect.

# Read datasets/monthly_deaths.csv into monthly
monthly <- read_csv("datasets/monthly_deaths.csv")

# Adding a new column with proportion of deaths per no. births
monthly$proportion_deaths<-monthly$deaths/monthly$births

# Print out the first rows in monthly
head(monthly)
date births deaths proportion_deaths
1841-01-01 254 37 0.145669291
1841-02-01 239 18 0.075313808
1841-03-01 277 12 0.043321300
1841-04-01 255 4 0.015686275
1841-05-01 255 2 0.007843137
1841-06-01 200 10 0.050000000

5. The effect of handwashing

With the data loaded we can now look at the proportion of deaths over time. In the plot below we haven’t marked where obligatory handwashing started, but it reduced the proportion of deaths to such a degree that you should be able to spot it!

ggplot(data=monthly, aes(x=date, y=proportion_deaths)) +
geom_line() + geom_point()+
scale_color_brewer(palette="Paired")+
theme_minimal()

2

6. The effect of handwashing highlighted

Starting from the summer of 1847 the proportion of deaths is drastically reduced and, yes, this was when Semmelweis made handwashing obligatory.

The effect of handwashing is made even more clear if we highlight this in the graph.

# From this date handwashing was made mandatory

handwashing_start = as.Date('1847-06-01')

# Add a TRUE/FALSE column to monthly called handwashing_started
monthly$handwashing_started=handwashing_start,TRUE,FALSE)

# Plot monthly proportion of deaths before and after handwashing
ggplot(data=monthly, aes(x=date, y=proportion_deaths, group=handwashing_started, color=handwashing_started)) +
geom_line() + geom_point()+
scale_color_brewer(palette="Paired")+
theme_minimal()

3

7. More handwashing, fewer deaths?

Again, the graph shows that handwashing had a huge effect. How much did it reduce the monthly proportion of deaths on average?

# Calculating the mean proportion of deaths
# before and after handwashing.

monthly_summary % group_by(handwashing_started) %>% summarise(mean_proportion_detahs=mean(proportion_deaths))

# Printing out the summary.
monthly_summary
handwashing_started mean_proportion_detahs
FALSE 0.10504998
TRUE 0.02109338

8. A statistical analysis of Semmelweis handwashing data

It reduced the proportion of deaths by around 8 percentage points! From 10% on average before handwashing to just 2% when handwashing was enforced (which is still a high number by modern standards).
To get a feeling for the uncertainty around how much handwashing reduces mortalities we could look at a confidence interval (here calculated using a t-test).

# Calculating a 95% Confidence intrerval using t.test
test_result <- t.test( proportion_deaths ~ handwashing_started, data = monthly)
test_result

9. The fate of Dr. Semmelweis

That the doctors didn’t wash their hands increased the proportion of deaths by between 6.7 and 10 percentage points, according to a 95% confidence interval. All in all, it would seem that Semmelweis had solid evidence that handwashing was a simple but highly effective procedure that could save many lives.

The tragedy is that, despite the evidence, Semmelweis’ theory — that childbed fever was caused by some “substance” (what we today know as bacteria) from autopsy room corpses — was ridiculed by contemporary scientists. The medical community largely rejected his discovery and in 1849 he was forced to leave the Vienna General Hospital for good.

One reason for this was that statistics and statistical arguments were uncommon in medical science in the 1800s. Semmelweis only published his data as long tables of raw data, but he didn’t show any graphs nor confidence intervals. If he would have had access to the analysis we’ve just put together he might have been more successful in getting the Viennese doctors to wash their hands.

 

Central Limit Theorem (example using R)

The Central Limit Theorem is probably the most important theorem in statistics.

The central limit theorem (CLT) states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the original population.

Furthermore, the CLT states that as you increase the number of samples and the sample size the better the distribution of all of the sample means will approximate a normal distribution (a “bell curve“) even if the original variables themselves are not normally distributed.

Let’s break down this with some examples using R:

Original Population with a left skewed distribution

Let’s generate our left skewed distribution in R.

By using the rbeta function below I generated 10,000 random variables between 0 and 1 and I deliberately changed the shape parameter to have a distribution with a negative skewness.
myRandomVariables<-rbeta(10000,5,2)*10

left skewed hist

The mean (µ) of the total population is 7.13  and the standard deviation (σ)  is 1.61.
We can see that the distribution has a tail longer on the left with some data that go up to 4 standard deviation away from the mean where as the data on the right don’t go beyond 2σ away from the mean.

As we can see on the plot above the standard deviation (σ)  allow us to see how far away from the mean each data are.
A small σ means that the values in a statistical data set are close to the mean of the data set, on average, and a large σ means that the values in the data set are farther away from the mean, on average.
AS the σ is 1.61 it mean that all the data between 5.52 (7.13-1.61) and 8.74 (7.13+1.61) are close the the mean (less than 1 σ).
However the data less than 2.30 (7.13-3*1.61) are much more far from the mean at least 3σ.

To better illustrate let’s see the same plot with the data scaled such as the mean is equal to 0 and the standard deviation is equal to 1.
The formula to get the data normalised is (x-µ) / σ

left skewed norm hist

The distribution still has exactly the same shape but it’s just make it easier to observe how the data are close or far from the mean.

Using the Central Limit Theorem to Construct the Sampling Distribution

So how can we use the CLT to construct the sampling distribution. We’ll use what we know about the population and our proposed sample size to sketch the theoretical
sampling distribution.

The CLT states that:

  • Shape of the sampling distribution: As long as our sample size is sufficiently large (>=30  is the most common but some textbook use 20 or 50) we should assume the distribution of the sample means to be approximately normal disregarding the shape of the original distribution.
  • The mean of the distribution (x̅): The mean of the sampling distribution should be equal to the mean of the original distribution.
  • The standard error of the distribution (σx): The standard deviation of the sample means can be estimated by dividing the standard deviation of the original population by the square root of the sample size.  σx = σ/√n

 

Let’s prove it then!

I will first draw 100 mean samples from the original population with the minimum size recommended by the CLT  30.
Here is the code to generate the sample means:
sampling

So according to the CLT theorem the three following statements should be true:

  1. The mean of our sample means distribution should be around 7.13
  2. The shape of the sample distribution should be approximately normal
  3. Standard error (σx = σ/√nshould be equal to 0.29 (1.61/√30)

sampling 100

  1. The mean is 7.15 hence nearly 7.13
  2. The shape is approximately normal still a little bit left-skewed
  3. The standard error is 0.3 hence nearly 0.29

The CLT also states that as you increase the number of samples the better the distribution of all of the sample means will approximate a normal distribution.

Let’s draw more samples.

Now I take 1,000 samples means and plot them.

sample 1000

  1. The mean is still 7.15 and not exactly 7.13
  2. The shape is approximately normal but still a little bit left-skewed
  3. The standard error is equal to 0.29 as estimated by the CLT theorem

Let’s take even more sample means.

This time I take 10,000 samples means and plot them.

sample 10000

  1. The mean is now exactly 7.13
  2. The distribution shape is definitely normal
  3. The standard error is equal to 0.29 as estimated by the CLT theorem

Just for the fun let’s do another example and this time with a different sample size to see if we get the standard error right.
So using the CLT theorem the σx should be 0.11 (1.61/√200)

sample size 200 10000

We have increased each sample size to 200 instead of 30 in the previous examples hence the variance in the sample means distribution has decreased and we now have a standard error smaller.
This confirms that as we increase the sample size the distribution becomes more normal and also the curve becomes taller and narrower.

Summary

The CLT confirms the intuitive notion that as we take more samples or as we increase the sample size , the distribution of the sample means will begin to approximate a normal distribution even if the original variables themselves are not normally distributed, with the mean equal to the mean of the original population (x̅=µ)  and the standard deviation of the sampling distribution equal to the standard deviation of original the population divided by the square root of the sample size (σx = σ/√n).

 

In the next post I’ll talk more in depth about the CLT and we’ll see how and where we can use the CLT.