Today there is already an idea submitted to Microsoft to enable this possibility however this idea does not have a lot of votes so it is not likely to be added anytime soon so then the workaround comes to the rescue!

According to the multiple comments submitted in the “Microsoft Idea”, the main reason is to show the **result of a statistical test** or **summary information of a model**. *(And this is exactly what I’m using it for too*)

Here are some comments submitted to Microsoft:

- “R Integration is fantastic… but seeing the console output in a visual is still needed
**to see all the coefficients and model stats etc.**..” - “Would be great to have this as a simple way to
**get results of statistical tests**.” - “That would be a fantastic possibility,I need to just
**show**the**summary of a regression model!**“ - “…I’m working with
**forecasts**as well as**predictive models**and need to be able**to print results**…”

The solutions that I will describe in this post are a workaround so it is not as straightforward as just outputting the result of the statistical test or model summary. It requires some text manipulation but is still pretty quick to implement.

Let’s say we want to run a t-test in Power BI using R and see the result in a visual.

If we were to write the R code in Rstudio and run it the output of the t-test will look like this:

```
# dataset <- unique(dataset)
res<-t.test( dataset$ColumnA,dataset$ColumnB, paired = TRUE,
alternative ="less" ,conf=0.95)
res
```

However, when we run the same code in an R visual in Power BI the output is as follows:

As we can see we get an error message “*The R code* didn’t *result* in creation of any visuals”. Unfortunately at the time of writing this post, it is not possible in Power BI to display the R console result.

And according to the error message we cannot simply print the result, however, **we can “plot” it** as long as an image is created so let’s do it!

GGplot is one of the most famous R libraries for creating graphics.

Now what we need to do before plotting the t-test output is to :

- First store the result of the statistical test or pedecitive model into a variable
- From this variable we can retrieve each coeeficient that we want to show
- Create a new text variable where we concantenate and indent all the information that we want to display
- Plot this variable using ggplot library with the annotate function and the magic happens

```
# dataset <- unique(dataset)
library(ggplot2)
res<-t.test( dataset$dem_percent_12,dataset$dem_percent_16, paired = TRUE,
alternative ="less" ,conf=0.95)
text = paste("R result:","\n",
"method: ",res$method ,"\n",
"alternative: ",res$alternative ,"\n",
"T: ",res$statistic ,"\n",
"p-value: ", res$p.value,"\n",
"Confidence Interval:","95%","\n",
"CI Low: ",res$conf.int[1],"\n",
"CI Up: ",res$conf.int[2],"\n")
ggplot() +
annotate("text", x = 0,5, y = 0, size=6, label = text, hjust = 0) +
theme_void()
```

Another solution is to use the library Datagrid this solution is a bit tidier and neater.

I may be old fashion but I actually prefer a simple text output using the ggplot approach above.

Here the approach is very similar but instead of simply concatenating the t-test result variables in a text variable we create a data frame and create a row for each variable that we want to display.

```
library(gridExtra)
library(grid)
res<-t.test( dataset$dem_percent_12,dataset$dem_percent_16, paired = TRUE,
alternative ="two" ,conf=0.95)
Name <- c("method", "alternative", "T-stat", "P-value")
Value <- c(res$method, res$alternative , res$statistic, res$p.value)
df <- data.frame(Name, Value)
tt <- ttheme_default(colhead=list(fg_params = list(parse=TRUE)))
grid.table(df, theme=tt)
```

And finally, the last solution that I can think of is to use the library Gtable which is very similar to the library grid but it has the advantage to be very flexible, more customizable and looking even nicer however it is less straightforward to implement.

```
library(gtable)
library(gridExtra)
library(grid)
res<-t.test( dataset$dem_percent_12,dataset$dem_percent_16, paired = TRUE,
alternative ="two" ,conf=0.95)
Name <- c("method", "alternative", "T-stat", "P-value")
Value <- c(res$method, res$alternative , res$statistic, res$p.value)
df <- data.frame(Name, Value)
g <- tableGrob(df ,rows = NULL)
g <- gtable_add_grob(g,
grobs = rectGrob(gp = gpar(fill = NA, lwd = 2)),
t = 2, b = nrow(g), l = 1, r = ncol(g))
g <- gtable_add_grob(g,
grobs = rectGrob(gp = gpar(fill = NA, lwd = 2)),
t = 1, l = 1, r = ncol(g))
grid.draw(g)
```

Here you find all the details on how to use the libraries Grid or Gtable.

And here is how the R visuals render in Power BI so depending on the desired size of the visual you may need to tweak the theme’s parameters.

As seen in this short post even though it is not possible at first to show text from a test result of a statistics test or summary of a statistics model like a regression in an R visual. But with this workaround, we can quickly retrieve the variables that we want to show and display them.

This may be a bit of tedious work for a large model with a lot of variables to output but I’m not sure either that running a large predictive model in Power BI would be the right thing to do.

]]>A t-test is a type of inferential statistic that can be used to determine if the means of two groups of data are significantly different from each other.

In other words, it tells us if the differences in means could have happened by chance.

There are three types of t-test:

- An Independent Samples t-test compares the means for two groups.
- A
**Paired sample t-test**compares means from the same group at different times. - A One sample t-test tests the mean of a single group against a known mean.

In this post, we will focus only on paired t-tests and I’ll be soon writing another post for the other types of t-test.

**A paired t-test** is used to compare two population means where we have two samples in which observations in one sample** can be paired** with observations in the other sample. We compare the two sample means at different times or under different conditions.

Examples of where we can use paired t-test:

**Before-and-after**: Observations on the same students’ diagnostic testresults before and after a particular module or course**Medicine**: Difference in cholesterol level before and after treatment, the difference in blood pressure before and after treatment.**Social research**: Determine whether there is a significant change in the scores of the same cases on the same variables over time such as % turnout in presidential elections by states

- Independence of the observations: Measurements for one subject do not affect measurements for any other subject
- Each of the paired measurements must be obtained from the same subject
- The differences between pairs are normally distributed

In this section, I will break down every single step on how to implement a Paired T-test in Power BI from t-stat, p-value, standard error, confidence interval and critical value. I will explain the role of each statistical measure what they are used for and how to calculate them using DAX only.

- “d-bar” is the average difference between paired data
- “SE” is the standard error of “d-bar” (we’re going to cover it further down)
- “
**ẟ**“: (delta greek letter) – since we’re using paired data sample delta is equal to zero (we’ll cover it in the null hypothesis section)

- Fully interactive visuals; custom visuals using R or Python are not
- Security policy within your organisation: Limitation to deploy python or R script to the portal
- Force you to better understand the test statistics process
- Easier to maintain: one language, one tool, integration and deployment easier, reusability
- Only DAX and some statistics knowledge required

The dataset contains data about the US presidential election at a county level with the percentage of votes that went to Republican candidates in percentage in 20212 and 2016. (500 rows).

*I downloaded this dataset from Datacamp but it is also publically available at https://dataverse.harvard.edu/dataverse/.**I chose this dataset because I needed a dataset with enough rows to make it easier to visualize the distribution*.

The model contains the main datasets that I described above and two parameters table that I will describe later.

Here we want to compare and test the two paired samples (2012 vs 2016) and we can make three different hypotheses:

**Two-tailed:**Is there any difference in means between the % of votes given to the Republican candidates between 2012 and 2016?- H0=μ2012 – μ2016 = 0
- Ha =μ2012 – μ2016 <> 0

- Left-tailed: Was the % of votes given to the Republican candidates lower in 2012 compared to 2016?
**H**=μ2012 – μ2016 >= 0_{0}**H**=μ2012 – μ2016 < 0_{a}

**Right-tailed**: Was the % of votes given to the Republican candidates greater in 2012 compared to 2016?**H**=μ2012 – μ2016 <= 0_{0}**H**=μ2012 – μ2016 > 0_{a}

We will cover the null hypothesis “**H _{0}**” and the alternative hypothesis “

The first step to implementing our paired t-test in Power BI is to compare the two paired samples is of course to calculate their difference in means.

So after calculating the difference between the two variables “dem_percent_12” and “dem_percent_16” we can simply calculate the mean using the average Dax function.

```
diff 2012 vs 2016 = dem_county_pres[dem_percent_12]-dem_county_pres[dem_percent_16]
mean_diff = AVERAGE(dem_county_pres[diff 2012 vs 2016])
```

The **standard deviation** (sd) is a measure of how spread out values are. A small standard deviation indicates that the values tend to be close to the mean, while a large standard deviation indicates that the values are spread out over a wider range. We will use the **SD** measure to calculate the **Standard Error**.

Luckily there’s a built-in Dax function for the standard deviation, here we’re using the **sample **standard deviation formula since we’re working with a sample instead of a whole population.

The only difference between the **sample sd** and the **sd **formula is the denominator “n**-1**” for the **sample sd** instead of “n” for the **sd**. (the larger the sample is the closer the result of the two formulas will be)

```
sd_diff = STDEV.S(dem_county_pres[diff 2012 vs 2016])
```

To put it simply the **Standard Error** **SE **or **SEM **in our case** **is the estimated standard deviation of the sample mean

Its formula is the standard deviation (calculated above) divided by the square root of the sample size.

The difference between the **SD **and the **SEM **is that the standard deviation measures the dispersion from the individual values to the mean, while the Standard Error of the mean measures how far the sample mean of the data is likely to be from the true population mean.

```
SEM =
var __sd=[sd_diff]
var __n=[Size]
return
divide(__sd,sqrt(__n))
```

The *t*-statistic (also called t-value or t-score) is used in a *t*-test to determine whether to support or reject the null hypothesis

The larger the t-value is, the more likely the difference in means between the two samples will be statistically significant.

In order to support or reject the null hypothesis, we need to compare the t-stats result with the t-critical value given by the t-distribution table.

```
t_stat =
var __meandiff= [mean_diff]
var __sddiff= [sd_diff]
var __n=[Size]
var __parammudiff=0
var __se=[SE]
return
divide(__meandiff-__parammudiff,__se)
```

The t-critical value is **the cutoff between retaining or rejecting the null hypothesis**. If the t-statistic value is greater than the t-critical, meaning that it is beyond it on the x-axis, then the null hypothesis is rejected and the alternate hypothesis is accepted.

**How to calculate the t-critical value?**

Without a computer calculating the critical requires the use of the t-distribution table.

**step 1:**Calculate the**degree of freedom****df**–>**sample size -1**(15-1 =14 for the example above)**step 2:**Choose the alpha level, alpha level is the threshold value used to judge whether a test statistic is statistically significant we often use 0.05 (95% of confidence) but it can vary according to the domain area (we used 0.05 in the abox example)- step 3: Choose either the one or two-tailed distribution
- one tail:
**left-tailed**: difference in means between the paired samples is strictly lower than 0- right-tailed: difference in means between the paired samples is strictly greater than 0 (example above)

- two-tailed: difference in means between the paired samples is not equal to 0 (greater or lower but not equal)

- one tail:
- step 4: lookup for the
**df**,**alpha**level and the**one-tailed/tow-tailed**intersection in the grid

Luckily, we don’t need to import the t-distribution table in Power BI and do the lookup ourselves since we can use the built-in DAX function T.INV for on tailed T-test and for a T.INV.2 T-test. The two parameters that we need to pass to the function are the “probability” and the degree of freedom.

I manually entered some parameters into a table called “**CI**” to dynamically run different Paired T-test in Power BI. The Critical Value parameter is not needed since we can compute it using the two functions mentioned but I like to keep it to quickly refer to it if needed.

The above table is linked to the table “**Hypothesis**” which contains the hypothesis tail that we want to use for the test.

Here is the formula to calculate the critical value using DAX and dynamically interact with the parameters (probability and on-tailed/two-tailed)

```
t_val =
var __df=[degree of freeedom]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",abs(T.INV(FIRSTNONBLANK(CI[Probability],1),__df)),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",abs(T.INV(FIRSTNONBLANK(CI[Probability],1),__df)),
T.INV.2T(1-FIRSTNONBLANK(CI[Probability],1),__df)
)
```

**Left critical region**

We have now calculated the t-critical value we just need to add its value to the mean of the difference of our two samples “dbar”.

For a left-tailed test, we will only be looking into the left-critical region so to reject the null hypothesis the t-stat must lie to the left of the critical value in other words it should be lower than the left critical value.

For a two-tailed test, the T-stats should either be lower than the Left-critical value or greater than the right-critical value.

The formula for the left critical value is: mean_diff – critical value

```
left cr =
var __crit = [t_val]
var __mudiff=[mean_diff]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",__mudiff-__crit,
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",BLANK(),
__mudiff-__crit
)
```

**Right critical region**

As for the right-tailed test, the T-stat must be greater than the right-critical value to reject the null hypothesis.

The formula for the right-critical value is: mean_diff + critical value

```
right cr =
var __crit = [t_val]
var __mudiff=[mean_diff]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",BLANK(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",__mudiff+__crit,
__mudiff+__crit
)
```

A p-value is used in hypothesis testing to help us support or reject the null hypothesis. The p-value is the evidence **against** a null hypothesis. The smaller the p-value, the stronger the evidence that we should reject the null hypothesis.

As we already know from the critical value section, the **critical value** is** a point beyond which we can reject the null hypothesis**. **P-value** on the other hand is defined as the **probability that an observed difference could have occurred just by random chance**. The benefit of using a **p-value** is that we can test the estimated probability at any desired level of significance by comparing this probability with the significance level “**Alpha**” without needing to recalculate the critical value each time.

To sum it up they both do the same thing: helping us to support or reject the null hypothesis in a test. They are two different approaches to the same result.

I personally tend to always use the p-value since I find it easier to calculate and interpret. (e.g. with a p-value of 0.06 we may fail to reject the null hypothesis however we can still observe moderate evidence)

The built-in DAX functions to calculate the p-value are T.DIST for the left-tailed test, T.DIST.RT for the right-tailed test and T.DIST.2T for the two-tailed test. These functions require two parameters the **t-stat** and the **degree of freedom**.

And here is the formula to dynamically interact with the different parameters.

```
p-value =
var __df=[degree of freeedom]
var __t_stat=[t_stat]
return
SWITCH(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",T.DIST(__t_stat,__df,TRUE()),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",T.DIST.RT(__t_stat,__df),
T.DIST.2T(abs(__t_stat),__df)
)
```

A Confidence Interval or **CI** is a **range of values** we are fairly sure our **true value** lies in.

In another word, the CI can answer the question of whether the result of our test is due to a chance or not within a certain degree of confidence.

The confidence level should be chosen before examining the data, a 95% confidence level is usually used. However, confidence levels of 90% and 99% are also often used depending on the domain area.

Note that a **one-tailed confidence** **interval **always extends from **minus infinity** to some value above the observed effect, or from some value below the observed effect to **plus infinity**.

**Lower**

Here is the DAX formula to calculate the CI Lower limit and dynamically interact with the different parameters.

```
lower =
var __meandiff=[mean_diff]
var __t=[t_val]
var __sd=[sd_diff]
var __n=[size]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed","-inf",
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",__meandiff-__t*divide(__sd,sqrt(__n)),
__meandiff-__t*divide(__sd,sqrt(__n))
)
```

**Upper**

And her is the Dax formula for the Upper limit. For a right-tailed the

```
upper =
var __meandiff=[mean_diff]
var __t=[t_val]
var __sd=[sd_diff]
var __n=[Size]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",__meandiff+__t*divide(__sd,sqrt(__n)),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed","+inf",
__meandiff+__t*divide(__sd,sqrt(__n))
)
```

Before visualising the outcome of our paired t-test in Power BI let’s define the null and alternative hypotheses.

The null hypothesis **H**_{0}** **assumes that any difference between the two paired samples is due to chance.

- For a
**two-tailed**test, the null hypothesis assumes the difference in means**is equal to 0**

- For a
**left-tailed**test the null hypothesis assumes the difference in means is n**ot lower than 0**

- For a
**right-tailed**test the null hypothesis assumes the difference in means is**not greater than 0**

As for the **alternate hypothesis H_{a}**, it is simply the direct

Here is the DAX measure to display the text result of our paired t-test:

```
Result =
var __lessAlt="The true difference in means is less than 0"
var __greaterAlt="The true difference in means is greater than 0"
var __twosidedAlt="The true difference in means is not equal to 0"
var __lessNull="The true difference in means is not less than 0"
var __greaterNull="The true difference in means is not greater than 0"
var __twosidedNull="There's no true difference in the means"
var __alpha=FIRSTNONBLANK(CI[Alpha],1)
var __Pval=[p-value]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",
if(__Pval<__alpha,"We reject the nUll hypothesis and we accept the alternative hypothesis: " & __lessAlt, "We fail to reject the null hypothesis: " & __lessNull),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",
if(__Pval<__alpha,"We reject the null hypothesis and we accept the alternative hypothesis: " & __greaterAlt, "We fail to reject the null hypothesis: " & __greaterNull),
if(__Pval<__alpha,"We reject the null hypothesis and we accept the alternative hypothesis: " & __twosidedAlt, "We fail to reject the null hypothesis: " & __twosidedNull)
)
```

Here we’re using a CR of 95% (or alpha 0.05).

As we can see the t-stat “30.30” is much greater than the critical value “8.79” or if we use the p-value approach we can see that the p-value is extremely low and far below the alpha significant level so we reject the null hypothesis in favour of the alternative hypothesis.

This time we’re running the paired t-test in Power BI with a 99% confidence interval and for a right-tailed.

In other words, we want to observe if the difference in the means is greater than 0.

Since we’re using a right-tailed test we’re using only the right critical value so the t-test must be greater than the right critical value which is the case so we reject the null hypothesis. (p-value is also below the alpha significant level)

As for the Confidence Interval, we can say that we’re 99% confident that the true difference in means should lie between 6.30 and +infinite.

Times to run a left-tailed test… Can we reject the null hypothesis?

Of course not! The** t-stat** “30.30” is far **greater **than the **left critical value** “5.18” and the **p-value is extremely large** “1” so without any doubt we **fail to reject** the null hypothesis.

To make sure that I correctly implemented the paired t-test in Power BI I added and displayed the result of the R “t.test” function and all results were accurate thanks to the great built-in DAX functions supported by Power BI. I hope that the PBI team will add even more statistical functions in the future.

It seems that R Visuals do not render from time to time when using Publish to the web but if you click on “Focus mode” then “Back to report” it eventually appears after a few seconds. Another reason to not use R and stick to DAX

This post only covered the implementation of the** paired T-test in Power BI** but as we saw there are three types of t-test so I should soon post the implementation of the other t-test.

Also, there are a few things that I did not cover on purpose since I did not want this post to become too statistics-heavy but rather focus on the DAX implementation side, however, one thing that I need to mention is that I by default used the **Welch’s t-test** (R used it by default as well), there’s another t-test called

I previously wrote a post about AB testing using Power BI using DAX only so in this other DAX statistics post I wanted to show that implementing a paired t-test in Power BI using DAX only is perfectly feasible it just requires a bit more work than typing two lines of code in R but with the help of some bulti-in DAX functions the result is great.

If you’d like to implement your own paired t-test in Power BI and test your result I’d recommend using the following t-test calculator: http://www.sthda.com/english/rsthda/paired-t-test.php

Here is the “**Paired t-test in Power BI**” published to the web where we can interact with the parameters “Confidence Interval” and “hypothesis”

**In this post, I’m going to show another way to hide tables in Power BI which prevents users or anyone to view the hidden tables.**

To hide tables in Power BI we can either do it from the Model view or directly from the Report view.

Once done we can see the hidden icon enabled on the table “v_dimDate”

And once we open the Report view we can no longer see the hidden table, so far so good?

Well until your users get access to the PBIX file and discover the option to unhide tables or to view hidden tables.

As mentioned at the beginning of this post once we get access to the PBIX file of a report or even if we have a live connection to a tabular model or a PBI dataset it is possible to view the hidden objects.

As we can see the table “v_dimDate” is now visible even though we’ve hidden the table

In order to fully prevent users from seeing hidden tables, we need to install tabular editor (version 2 or 3) and enable the external tools in Power BI.

Once the model is opened via tabular editor we have access to the Tabular Object Model (TOM) properties and we can modify them. The property that we need to change is “Private” once we set private to True the table becomes hidden and can no longer be seen in Power BI even when we enable “ViewHidden” or “Unhide all”.

The only drawback is that once the Private property is set to True Power BI will think that this table does not exist and thus the IntelliSense will not work anymore. As Power BI no longer recognise this table it will also highlight in red this table in any formula but we can ignore it as the formula will still work.

As we can see the “v_dimDate” is no longer visible even after we enable the “View hidden” option and even though Power BI does not recognise this table we can still reference it in any DAX formulas.

If the developers are developing directly into Power BI it is probably a good idea to temporary set Private to False while developing and set it back to True before publishing or sharing the file.

This option can be very useful when it comes to hiding the tables used to configure Row Level Security or any other internal tables that we don’t want users to see.

Of course, this technique does not replace the Object Level Security and should not be used for such a purpose and as a best practice, I’d always recommend not to give access to the underlying model of a report to the users and always use the Live connection whenever it’s possible.

Finally, to learn more about the Tabular Model I’d highly recommend to take the “mastering-tabular” course from SQLBI.

]]>So whenever a user is complaining about a pivot table connected to an SSAS or PBI dataset being too slow or showing wrong figures we may need to retrieve the MDX queries generated by Excel to investigate the issue.

There are several ways to retrieve the MDX queries generated by Excel such as using profiler, Xevents or even installing an Excel add-in.

However, all these options may not be possible in your company if you have a strict policy regarding the tools that can be installed or if you don’t have sufficient permission on the SSAS server.

So if none of the above options is available to you, you can still use this Visual Basic script to get the MDX queries created by Excel.

- Open the Excel tab with the pivot table
- From the Excel tab press “
`Alt+F11`

” to get into the Visual Basic Editor - Copy/Paste the below VBA script and specify the destination path of the output TXT file
- Press run

```
Sub CheckMDX()
Dim pvtTable As PivotTable
Dim fso As Object
Dim Fileout As Object
Set pvtTable = ActiveSheet.PivotTables(1)
Set fso = CreateObject("Scripting.FileSystemObject")
Set Fileout = fso.CreateTextFile("C:\Temp\MDXOutput.txt", True, True)
Fileout.Write pvtTable.MDX
Fileout.Close
End Sub
```

I’m sharing this through a Power Bi app where we can simply copy/paste the DAX code of the selected function.

The correlation coefficient is a statistical measure of the relationship between two variables; the values range between -1 and 1. A correlation of -1 shows a perfect negative correlation, a correlation of 1 shows a perfect positive correlation. A correlation of 0.0 shows no linear relationship between the movement of the two variables.

To go a bit more in detail we can interpret the correlation coefficient as follows:

- -1: Perfect negative correlation
- Between -1 and <=-0.8: Very strong negative correlation
- Between >-0.8 and<=-0.6: Strong negative correlation
- Between >-0.6 and<=-0.4: Moderate negative correlation
- Between >-0.4 and<=-0.2: Weak negative correlation
- Between >-0.2 and<0: Very weak negative correlation
- 0: No correlation
- Between 0 and<0.2: Very weak positive correlation
- Between >=0.2 and <0.4: Weak positive correlation
- Between >=0.4 and <0.6: Moderate positive correlation
- Between >=0.6 and <0.8: Strong positive correlation
- Between >=0.8 and <1: Very strong positive correlation
- 1: Perfect positive correlation

One very important thing to remember is that when two variables are correlated, it does not mean that one causes the other. **Correlation does not imply causation**

Unlike in Excel, there’s no DAX built-in correlation function in Power BI (at the time of writing this post).

In Excel, the built-in function is called Correl, this function requires two arrays as a parameter (X and Y).

The Correl formula used in Excel is as follows:

There are actually several ways of writing the Pearson correlation coefficient formula but to keep consistent with the formula used in Excel I will stick with the above formula which is one of the most common anyway.

Since we saw the formula above we now need to translate it into DAX.

So let’s break down the formula:

- The
**Σ**(sigma) symbol is used to denote a sum of multiple terms (x1+ x2+x3..) which is an equivalent of sum or sumx **x̄**(mu x bar), is used to represent the mean of x**ȳ**(mu y bar), is used to represent the mean y- √ is the square root its dax function is sqrt

Now let’s see the DAX code for the Pearson correlation formula:

```
coeff corr =
//x̄
var __muX =calculate(AVERAGE(YourTable[x]))
//ȳ
var __muY=calculate(AVERAGE(YourTable[y]))
//numerator
var __numerator = sumx('YourTable',( [x]-__muX)*([y]-__muY))
//denominator
var __denominator= SQRT(sumx('YourTable',([x]-__muX)^2)*sumx('YourTable',([y]-__muY)^2))
return
divide(__numerator,__denominator)
```

Let’s now build a small report that will show the correlation between the head size (x) and the brain weight (y).

The data are as follows:

- Head_Size (variable x)
- Brain_Wight (variable y)

In Power BI when clicking on the Analytics icon we can easily add a trend line to visualize the relationship between two variables on a scatter plot.

However, to show the correlation coefficient on top of the trend line we still need to create a DAX measure that I have called “coeff corr”.

And as the final touch let’s create another measure “coeff correl type” that will return the interpretation of the correlation so we can display it on top of our visual.

```
coeff correl type =
SWITCH(TRUE,
[coeff corr]=-1 ,"Perfect negative correlation",
[coeff corr]>-1 && [coeff corr]<=-0.8 ,"Very strong negative correlation",
[coeff corr]>-0.8 && [coeff corr]<=-0.6 ,"Strong negative correlation",
[coeff corr]>-0.6 && [coeff corr]<=-0.4 ,"Moderate negative correlation",
[coeff corr]>-0.4 && [coeff corr]<=-0.2 ,"Weak negative correlation",
[coeff corr]>-0.2 && [coeff corr]<0 ,"Very weak negative correlation",
[coeff corr]=0 ,"No correlation",
[coeff corr]>0 && [coeff corr]<0.2 ,"Very weak positive correlation",
[coeff corr]>=0.2 && [coeff corr]<0.4 ,"Weak positive correlation",
[coeff corr]>=0.4 && [coeff corr]<0.6 ,"Moderate positive correlation",
[coeff corr]>=0.6 && [coeff corr]<0.8 ,"Strong positive correlation",
[coeff corr]>=0.8 && [coeff corr]<1 ,"Very strong positive correlation",
[coeff corr]=1 ,"Perfect positive correlation"
)
```

And this is how things look like when we concatenate our “coeff corr” with the “coeff correl type” measure and add them on top of our scatter plot.

This is just another post on the series of implementing statistical functions in DAX you can read some other similar posts in my blog such as AB testing in Power BI or Poisson distribution in Power BI.

Power BI is still lacking some advanced statistical functions compared to Excel but with DAX we can write almost any existing Excel function!

]]>The Poisson distribution can be applied in many different domains such as finance, biology, healthcare, retails, etc…

**Check for adequate customer service staffing**: Calculate whether customer service staffing is enough to handle all the calls without making customers wait on hold.**Number of Arrivals at a Restaurant**: Estimate the chances of having more than than 100 people visiting a particular restaurant**Number of bicycles Sold per Week**: If the number of bicycles sold by a bike shop in a week is already known, then the seller can easily predict the number of bicycles that he might be able to sell next week and thus better managuing his stock.

*To illustrate the use of poison distribution in Power BI I will use the bike shop as an example.*

Before jumping to the implementation of Poisson distribution in Power BI let’s have a look at the data.

Each week the seller reorder 30 bicycles to restock his inventory and he’s selling on average 32 bicycles each week.

Due to a shortage of stock, he is losing business opportunities every week but on the other hand, he does not want to overstock the shop owner wants to maximise his sales while still keeping the optimal stock.

In order to be more competitive, the seller only wants to be able to **fulfil at least 95% of the sales each week.**

At the moment he’s been able to reach 95% of fulfilment only 9 weeks out of 18 weeks.

In order to determine the most optimal stock to order on a weekly basis, we can use the Poisson distribution.

Like in Excel there’s a built-in function for the Poisson distribution in Power BI POISSON.DIST

```
POISSON.DIST(x,mean,cumulative)
```

As we can see the function is expecting 3 parameters: the number of occurrences “x”, the “mean” and the logical value that determines the form of the probability distribution.

**True**for the**cumulative distribution function (CDF)**: probability that the number of random events occurring will be between zero and x inclusive**False**for the**probability density function (PDF)**: probability that the number of events occurring will be exactly x

**λ**is the mean**x**is the number of occurrences (x=0,1,2,3…)**e**is Euler’s number (e=2.71828…)**!**is the factorial function

**Dax formula:**

(Remember that there’s already a built-in function for the Poisson Distribution in Power BI I put the Dax code for educational purposes only)

```
poisson non cumulative formula =
var __euler=2.71828
var __lambda=calculate([avg Sales wk],all('Stock Poisson'))
var __x=min(x[Value])
return
divide((__lambda^__x)*(__euler^-__lambda),FACT(__x))
```

**Dax formula:**

```
poisson cumulative formula =
var __euler=2.71828
var __lambda=calculate([avg Sales wk],all('Stock Poisson'))
--x[value]=0,1,2,3... (61 in my example)
var __x=GENERATESERIES(min(x[Value]),MAX(x[Value]),1)
return
sumx(__x,divide((__lambda^[Value])*(__euler^-__lambda),FACT([Value])))
```

In order to plot the distribution, we can create a calculated table that contains the Poisson distribution CDF and PDF.

```
x =
var __StartValue = 0
var __EndValue = ROUND([avg Sales wk]+8*[STD Sells wk],0)
var __IncrementValue =1
var __value =GENERATESERIES(__StartValue,__EndValue,__IncrementValue)
var __avgCalls=[avg Sales wk]
return
ADDCOLUMNS(__value,"PoissCum",POISSON.DIST([Value],__avgCalls,TRUE),"PoisNonCum",POISSON.DIST([Value],__avgCalls,FALSE))
```

Now let’s visualize the Poisson Non Cumulative distribution, as we can see the closer to the mean we are the highest the probability is. We can also observe the probability of selling less than 20 bicycles is very low as well as selling more than 40 bicycles.

Now if we take a look at the Poisson Cumulative distribution we can see that the probability of selling less than 40 bicycles is around 90% and the probability of selling less than 50 bicycles is around 99.9%.

If you remember the initial problem we want to find the optimal stock that ensures the shop fulfil 95% of the sales for each week. It looks like the stock should be somewhere between 40 and 50.

Let’s first create a dynamic parameter that allows us to play with the % fulfilment to reach.

```
Stock Analysis = GENERATESERIES(0.90,1,0.01)
```

Since we want to always be able to fulfil 95% of the sales each week we need to find the minimum stock to reorder every week to ensure that we reach 95%.

So in order to calculate the optimal stock, we need to use the **inverse function distribution (IDF) of Poisso**n, at the time of writing this post there’s not any Poiss_Inverse built-in function available in DAX and neither in Excel but this is pretty simple to write our own function.

To calculate the inverse function of the Poisson distribution we simply need to find the smallest integer N such that **POISSON.DIST(x,mean,cumulative=TRUE) >= Prob**

```
min lambda =
--Here we retrieve the Prob from the dynamic parameter created above
var __prob=FIRSTNONBLANK('Stock Analysis'[% Fulfilment],1)
return
--Then we find the smallest integer N >= __prob (95% in our case)
CALCULATE(min(x[Value]),round(x[PoissCum],4)>=__prob)
```

In other words, to calculate the optimal stock, we simply take the minimum value of the poison cumulative distribution that is greater or equal to the desired **probability to reach which gives us 42**.

To better understand the different probabilities that we can calculate with the Poisson distribution.

I wrote three other DAX measures as follows:

```
Exact probability = [Poisson Non Cumulative]
Less or equal probability = [Poisson Cumulative]
at least probability =
var __minX=min(x[Value])
return
1-calculate([Poisson Cumulative],x[Value]=__minX)
```

- The exact probability is given by the poisson distribution
- E.g. Probability of selling exactly 42 bicycles = 1.54%

- The less or equal probability probability is given by the cumulative distribution
- E.g. Probability of selling 42 or less bicycles = 96.2%

- The at least probability is ovtained by substrating the cumulative distribution from 1
- E.g. probability of selling at least 42 bicycles = 3.8%

We already saw above the two distribution functions PDF and CDF now let’s plot the three functions PDF, CDF and IDF.

As we can see the Inverse function is simply the inverted cumulative distribution and they both cross each other at the peak of the non-cumulative distribution which is obviously the mean of the weekly sales.

In this post, we saw that using the built-in DAX function it is quite simple to implement the Poisson Distribution in Power BI.

This post was just another shallow introduction to statistical concepts such as the one I wrote about Ab Testing with Power BI. It is, of course, impossible to cover all the aspects of the Poisson distribution in just a post as there are entire books written on this subject with more advanced concepts which are beyond my knowledge.

However, I wanted to show that we can easily implement the Poisson distribution in Power BI and obtain the same result that we would with any other more advanced statistical software.

Of course, we can always take advantage of R or python script in Power BI to get this done but unless we want to achieve something that is not possible to do with DAX like clustering in power bi using R I’d highly recommend using DAX only as much as possible.

]]>

Skewness is a **measure of symmetry, or more precisely, the lack of symmetry**. A distribution, or data set, is symmetric if it looks the same to the left and right of the centre point. For a unimodal (one mode only) distribution, negative skew commonly indicates that the *tail* is on the left side of the distribution, and positive skew indicates that the tail is on the right (see Figure below for an example).

In most of the statistics books, we find that as a general rule of thumb the skewness can be interpreted as follows:

- If the skewness is between -0.5 and 0.5, the data are fairly symmetrical
- If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed
- If the skewness is less than -1 or greater than 1, the data are highly skewed

The distribution of income usually has a **positive skew **with a mean greater than the median.

In the USA, more people have an income lower than the average income. This shows that there is an unequal distribution of income.

Here is another example:

If Warren Buffet was sitting with 50 Power BI developers the average annual income of the group will be greater than 10 million dollars.

Did you know that Power BI developers were making that much money?

Of course, we’re not… the distribution is highly skewed to the right due to an extremely high income in that case the mean would probably be more than 100 times higher than the median.

Age at retirement usually has a **negative skew**, most people retire in their 60s, very few people work longer, but some people retire in their 50s or even earlier.

Skewness can be used in just about anything in real life where we need to characterize the data or distribution.

- Many statistical models require the data to follow a normal distribution but in reality data rarely follows a perfect normal distribution. Therefore the measure of the Skewness becomes essential to know the shape of the distribution.
- Skewness tells us about the direction of outliers. The positive skewness is a sign of the presence of larger extreme values and the negative skewness indicates the presence of lower extreme values.
- Skewness can also tell us where most of the values are concentrated.

Skewness is also widely used in finance to estimate the risk of a predictive model.

At the time of writing this post, there’s no existing DAX function to calculate the skewness, this function exists in Excel since 2013, SKEW or SKEW.P.

The formula used by Excel is the “Pearson’s moment coefficient of skewness” there are other alternatives formulas but this one is the most commonly used.

**Calculate in DAX the Skewness of the distribution based on a Sample:**

```
Sample Skewness =
-- Number of values in my sample
var __N=calculate(COUNTROWS(height_data),
ALL(height_data[Height]))
-- sample mean
var __Avg=calculate(AVERAGE(height_data[Height]),
ALL(height_data[Height]))
-- sample standard deviation
var __Std=calculate(STDEV.S(height_data[Height]),
ALL(height_data[Height]))
return
DIVIDE(__N,(__N-1)*(__N-2)) *
sumx(height_data,
POWER(DIVIDE(height_data[Height]-__Avg,__Std),3))
```

Sample data refers to data partially extracted from the population.

**Calculate in DAX the Skewness of the distribution based on a Population:**

```
Skewness =
-- Number of values
var __N=calculate(COUNTROWS(height_data),
ALL(height_data[Height]))
-- Mean
var __Avg=calculate(AVERAGE(height_data[Height]),
ALL(height_data[Height]))
-- standard deviation
var __Std=calculate(STDEV.P(height_data[Height]),
ALL(height_data[Height]))
return
DIVIDE(1,__N) *
sumx(height_data,
POWER(divide(height_data[Height]-__Avg,__Std),3))
```

The population refers to the entire set that you are analysing.

The difference between the two resides in the first coefficient factor “1/N” vs “N/((N-1)*(N-2))” so in practical use the larger the sample will be the smaller the difference will be.

One of the most common pictures that we find online or in common statistics books is the below image which basically tells that a positive kurtosis will have a peaky curve while a negative kurtosis will have a flat curve, in short, it tells that kurtosis measures the peakedness of the curve.

The above explanation has been proven incorrect since the publication “Kurtosis as Peakedness, 1905 – 2014. *R.I.P**.*” of dr. Westfall. So the most correct interpretation of Kurtosis is that **it helps to detect existing outliers.**

*“*The logic is simple: Kurtosis is the **average of the standardized data** **raised to the fourth power**. Any standardized values that are less than 1 (i.e., data within one standard deviation of the mean, where the “peak” would be), contribute virtually nothing to kurtosis, since **raising a number that is less than 1 to the fourth power makes it closer to zero**. The only data values (observed or observable) that contribute to kurtosis in any meaningful way are those outside the region of the peak; i.e., the outliers. **Therefore, kurtosis measures outliers only; it measures nothing about the “peak**“.

Similar to Skewness, kurtosis is a statistical measure that is used to describe the distribution and to measure whether there are outliers in a data set.

And like Skewness Kurtosis is widely used in financial models, for investors high kurtosis could mean more extreme returns (positive or negative).

At the time of writing this post, there’s also no existing DAX function to calculate the Kurtosis, this function exists in Excel, the function is called Kurt.

The formula used by Excel is an adjusted version of Pearson’s kurtosis called **the excess kurtosis** which is Kurtosis -3.

It is very common to use the Excess Kurtosis measure to provide the comparison to the standard normal distribution.

So in this post, I will calculate in DAX the **Excess Kurtosis (Kurtosis – 3**).

**Calculate in DAX the Excess Kurtosis of the distribution based on a Sample:**

```
Sample Kurtosis =
-- Number of values in my sample
var __N=calculate(COUNTROWS(height_data),
ALL(height_data[Height]))
-- sample mean
var __Avg=calculate(AVERAGE(height_data[Height]),
ALL(height_data[Height]))
-- sample standard deviation
var __Std=calculate(STDEV.S(height_data[Height]),
ALL(height_data[Height]))
return
DIVIDE(__N*(__N+1),(__N-1)*(__N-2)*(__N-3)) *
sumx(height_data,
POWER(divide(height_data[Height]-__Avg,__Std),4))
-DIVIDE(3*(__N-1)^2,(__N-2)*(__N-3)) -- (-3 for excess kurtosis)
```

**Calculate in DAX the Excess Kurtosis of the distribution based on a Population:**

```
Kurtosis =
-- Number of values
var __N=calculate(COUNTROWS(height_data),
ALL(height_data[Height]))
-- mean
var __Avg=calculate(AVERAGE(height_data[Height]),
ALL(height_data[Height]))
-- standard deviation
var __Std=calculate(STDEV.P(height_data[Height]),
ALL(height_data[Height]))
return
DIVIDE(1,__N) *
sumx(height_data,
POWER(divide(height_data[Height]-__Avg,__Std),4))-3 -- (-3 for excess kurtosis)
```

In this post, we covered the concept of skewness and kurtosis and why it is important in the statistics or data analysis fields.

At the time of writing this post, there are no existing built-in functions in Power BI to calculate the Skewness or Kurtosis, however, we saw that it is pretty easy to translate a mathematic formula to a DAX formula.

In one of my previous posts “AB Testing with Power BI” I’ve shown that Power BI has some great built-in functions to calculate values related to statistical distributions and probability but even if Power BI is missing some functions compared to Excel, it turns out that most of them can be easily written in DAX!

]]>Partitions split a table into logical partition objects. Each partition contains a portion of the data, partitions can be processed in **parallel independent of other partitions** or excluded from processing operations if they don’t need to be refreshed.

The main reason for partitioning to improve the refresh performance of a model is due to parallelism.

Parallelism allows us to process multiple subsets of the same table in parallel and thus process a table faster.

As a rule of thumb, I usually partition tables with more than 15M rows but if the refresh of your model is still fast enough there may be no need for partitioning yet.

*By default, in SSAS the data are stored in segments where each segment has by default 8M rows and the first segment can store up to twice the default size (16M rows by default) so this where my rule of thumb is coming from*. *You can read more about segment and partitions here: Understanding segmentation and partitioning *

- Partitions should be equally distributed as much as possible (time columns are usually good candidates for partitions)
- Underlying tables should be indexed on the column used to filter the partition
- Avoid over partitioning your model as it could lead to the opposite intended effect
- Refresh time will increase as the engine will spend more time aggregating each partition together
- The memory size of your model increases if the partitions are too small (lower than the default segment size)
- If your model is very large it’s important to test different number of partitions

- Avoid under partitioning your model
- Give a meaningful name to each partition, I usually name my partitions “table name – period”(Sales CY, Sales Y-1…)
- Refresh only the partitions that need to be refreshed (historical data may not have changed)

Phil Seamark has recently written a post on how to Visualise your Power BI Refresh and of course we can apply the same technique of his excellent post to a Tabular Model. (You can also download the below report from Phil’s blog post).

**The model**

In this example, the model that I’m going to process contains a fact table of about 110M rows and a small dimension table of 2.5k rows.

**Model without Partitions – Full process**

In the first scenario, we notice that there’s no parallelism occurring during the process of the fact table that’s of course because there are no partitions on this table and as this table is large enough we should partition it.

**Model with Partitions – Full process**In the second scenario, I have partitioned the fact table into 12 partitions, I will certainly not partition the dimension table as it is way too small to be partitioned and I’ve run again a process full of the model.

And this time parallelism kicks in!

As seen above parallelism helps the model to be processed faster, the duration time has been reduced by more than 30% we can also observe that the total computation time has increased “Total CPU Time” so what does this mean?

It means that to process the model faster more CPU resources were needed than in the sequential process.

Thus the number of cores available or QPU (for Azure AS) will have a significant impact on the performance of your model. You can have a lot of partitions but if your Server or Azure AS Tier does not have enough computation power you will still be limited by the hardware.

Partitioning and parallelism can significantly improve the processing time of your model, however, it is important to always test a different number of partitions and design your own rules depending on your own criteria. (cost, acceptable processing time, near real-time, complexity of your model, etc)

Also during the example that I’ve described in this post, nothing else was running on my machine at the time of processing my model so it is also crucial to pay attention to the other things happening on your server such as other models being refreshed, heavy running queries etc…

This approach works well for most of the simple scenarios where we need to keep control over the number of partitions to refresh on a daily basis or any frequencies.

Partitions split a table into logical partition objects. Each partition contains a portion of the data, partitions can be processed in parallel independent of other partitions or excluded from processing operations if they don’t need to be refreshed.

**Beefits of partitioning**

**Reduce the volume of data across different environments**- Parallel processing
- Incremental loading
- Set different data refresh frequency, historical data don’t need to be refreshed every day

Parallel processing and incremental loading are common scenarios for using the partitions, however, another scenario where **partitions are also very useful is for reducing the volume of data across each environment**.

In this post, I’m going to show how we can simply dynamically partitioning tables in SSAS using SQL.

Putting things in another words limiting the volume of data means:

- Reducing the size of our model
- Reducing the time processing of our model
- Reducing the CPU and memory consumption
**Reducing the cost**of our On-prem or cloud infrastructure

So disregarding whether our data are on-prem or in the cloud, the storage and the computation needed to run our BI system comes with a cost.

As it is common to keep 5+ years of historic data in a BI production system this is certainly not a good idea to keep historical data on a Development or a Test server.

Allocating the same computational power and storage capacity for all our servers (Dev, Test, UAT, Prod) is a waste of money *and energy*. So before jumping into the how-to-do part let me answer the why SQL question.

There are many techniques to dynamically partitioning tables in SSAS Tabular data models such as using TMSL scripting, SSIS, Powershell, .NET, and they all come with their pros and cons.

In my opinion, the most robust technique would be to use SSIS or Powershell automation combined with a release automation tool such as DevOps, however, this would require a lot of effort to implement and maintain.**The technique I’m going to describe has the advantage to be fairly simple, very fast to implement and easy to maintain.**

**Create a Param table across your various environments**

This parameter table is the main component to configure the number of partitions across each environment.

We need to have this table in each environment, this table contains different values in each environment and of course, we should have the “NBPartitions” value lower in the Dev/Test environments than in the Prod environment.

After the param table created and set up I decided to use a view to keep things simple and to avoid any dependencies.

The view will dynamically generate the datekey to be used in each partition.

```
CREATE OR ALTER VIEW [ssas].[v_CubePartitionsList]
AS
SELECT p.DBName
, p.TableName
, RANK() OVER (PARTITION BY p.dbname, p.tablename ORDER BY left(datekey,6) DESC) AS partitionNumber
, dd.YEAR
, left(MIN(datekey),6) AS FromMonth
, left(MAX(datekey),6) AS ToMonth
, MIN(dd.datekey) AS FromDate
, MAX(dd.datekey) AS Todate
FROM param.ssas.CubePartitionsParam p
INNER JOIN Param.dim.Date dd
ON DateKey BETWEEN CONVERT(INT, CONVERT(VARCHAR(12), DATEADD(MONTH, -p.NbPartitions , GETDATE()), 112)) AND CONVERT(INT, CONVERT(VARCHAR(12), GETDATE(), 112))
WHERE p.PartitionBy = 'Month'
GROUP BY p.dbname
, p.tablename
, dd.Year
, left(datekey,6)
, dd.firstdayofmonth
, dd.lastdayofmonth
UNION ALL
SELECT p.DBName
, p.TableName
, RANK() OVER (PARTITION BY p.dbname, p.tablename ORDER BY dd.Year DESC) AS partitionNumber
, dd.Year
, left(MIN(datekey),6) AS FromMonth
, left(MAX(datekey),6) AS ToMonth
, MIN(dd.datekey) AS FromDate
, MAX(dd.datekey) AS Todate
FROM param.ssas.CubePartitionsParam p
INNER JOIN Param.dim.Date dd
ON Year BETWEEN YEAR(GETDATE())-p.nbpartitions +1 AND YEAR(GETDATE())
WHERE p.PartitionBy = 'Year'
GROUP BY p.dbname
, p.tablename
, dd.YEAR
```

*The script I used to create the dim table used in this view can be found on github here*

Here is a subset of the output generated by the view for the FactSalesQuota and the FactInternetSales tables.

So far we’ve seen how to configure and generate a dynamic view that contains the list of date to be used for each partition.

Now let’s see how we can dynamically partitioning tables in SSAS using this view.

**Create the maximum number of partitions you need**

Here the trick is to create beforehand the maximum number of partitions that you will need in your live environment.

So let’s assume that you need 6 partitions for your production model (current year + 5 years of history; 1 year = 1 partition), but you only need 1 year in Dev and 2 years in Test.

Your model will still have 6 partitions across each environment, however, some partitions will be empty (0 rows) in Dev and Test.

By doing so, we can manage everything with the param table of our different server and we don’t need to use any .Net, TMSL or PowerShell code.

Then we need to use “**Select ***” in each partition to make the maintenance easier so if we need to add a new column in our fact table we do it in the view pointing to our fact only one time and all the partitions will reflect the changes made in the view.

And of course, as best practice, we should only include in the view the columns that we really in your model!

```
WITH Partitions_CTE AS (
SELECT fromdate
, todate
FROM [PARAM].[ssas].[v_CubePartitionsList]
WHERE DBName = 'AdventureWorksDW2017'
AND TableName = 'FactInternetSales'
AND partitionNumber = 9
)
SELECT fct.*
FROM AdventureWorksDW2017.dbo.FactInternetSales fct
INNER JOIN Partitions_CTE c
ON fct.OrderDateKey BETWEEN fromdate AND todate
```

The two values fromdate and todate are dynamically generated by the view and will correspond to the first day and last day of September 2020.

This is what the above query will return:

In this scenario here are historical data that we need for each environment:

- 6 months of data in Dev
- 1 year of data in Test
- 2 years of data in UAT
- 5 years of data in Prod

As we can see below the only thing needed to dynamically partitioning tables in SSAS is to change the values of the “nbpartitions” of the param table and everything else is fully dynamic.

As already mentioned we need to create the maximum number of partitions in our model disregarding in which environment our model will be deployed.

In this scenario we assume that we need 13 partitions divided by months, in production all the partitions are refreshed.

Now let’s see how we will process only 3 month of data in our Test environment:

We could perfectly generate a TMSL command via SSMS and then run the command on-demand or schedule it to the desired frequency.

However, whenever we want to change the number of partitions to refresh we also need to amend the JSON script to add or remove a partition which can become hard to maintain.

Here is the TMSL command that we can use to process only three partitions:

```
{
"refresh": {
"type": "automatic",
"objects": [
{
"database": "AdventureWorks2",
"table": "Internet Sales",
"partition": "Internet Sales - Current Month"
},
{
"database": "AdventureWorks2",
"table": "Internet Sales",
"partition": "Internet Sales - M-1"
},
{
"database": "AdventureWorks2",
"table": "Internet Sales",
"partition": "Internet Sales - M-2"
}
]
}
}
```

As we can see above only three partitions have been refreshed and the nine other partitions have never been refreshed.

**Now let’s see what the SQL approach will do:**

As we can see above instead of processing only three partitions we process all the existing partitions.

However, it still produces the same result as we are dynamically filtering the partitions by joining the fact table on the dynamic view thus even if the partitions are processed they are still empty (0 rows).

Dynamically partitioning tables in SSAS tabular can be a lot more complex than this approach especially when we need to handle partitions with different refresh frequencies or granularity but in scenarios where we need to refresh all the partitions, I found this technique very efficient and easy to use.

Here are the advantages/drawbacks of using this approach:

**Advantages**:

– This is straightforward to implement.

– Very easy to maintain through the param table.

– This is also quite flexible as long as we’re basing our partition on a date (day, month, quarter, semester, year…) any changes on the logic to calculate the partition size is done in a single place: the view.

**Drawbacks:**

– If we have a lot of partitions the SQL query will still be executed even if it returns 0 rows.

But with the right index on the right column, each query shouldn’t run for more than a few secs for each empty partition so not a big deal.

– This approach works very well for scenarios where we need to refresh all the partitions or set of partitions in each process, however, for more advanced partitioning strategy such as incremental refresh or refreshing only the partitions that have changed this approach is not suitable and we would need a more complex approach.