In the traditional data warehouse approach, it is very common to move all the data sources and historize them into an ODS database.

However, moving and storing data from all the data sources into a single database can quickly become costly and very complex to develop and maintain.

On the other hand, moving all data sources (even those from databases) to a data lake will likely be much cheaper, quicker, and easier to develop.

Once all the enterprise data sources are stored in the data lake, we can curate, transform and move the data where we wish.

To keep things short and simple we can say that a data lake is a storage place that stores and holds the enterprise data in its native, raw format. As we saw above this approach differs from the old traditional approach where we usually transform the data before the ingestion process.

In this post, we will store the data as parquet files which tend to offer better performance and compression than other common file formats. But depending on your scenario and what you want to do with your data once they are in the data lake you may need to work with Avro or even CSV formats.

So even if I use parquet files in this post the pipeline that we will create will work with any of those formats and would require only one small change in the sink dataset configuration.

As always to set up a synapse pipeline we first need to define the dataset source and the sink and the linked service used by the datasets. If you are moving data from an Onprem database which is what I’m doing in this post we also need to configure a self-hosted integration runtime.

We first need to set up the two linked services one for the dataset source and one for the dataset sink.

Go to linked Services then click on new and then select SQL Server, this method would work with any SQL database but I used SQL server on-prem for this post.

The most important thing is to make our connection string dynamic so for that we need to create two parameters the ServerName and the DBNAme.

For the authentification, I suggest using a service account and Azure key vault to store the password.

There are two main reasons at least that I can think of to use a dynamic connections string.

The reusability so we can reuse the same linked service across different databases and servers.

And the deployment, by using a server parameter we can simply change the server according to where we deploy the pipeline such as Test, UAT, and Prod.

For the target dataset, we need to create an Azure Data Lake Storage V2 linked Services. It is pretty straightforward to configure since you only need to pass the endpoint URL of your storage account. I assume that a storage account has already been configured with the right permission granted to the Synapse-managed identity account, if not you can follow the instruction here.

Similar to the linked services we must ensure that the datasets are parameterizable so we can reuse them for each dataset sharing the same type and also across different environments.

For the dataset source, we make the database name and server name parameterizable but we leave the table name since we’re going to use dynamic SQL to incrementally load the multiple table sources.

For the sink dataset, we’re again using parameters.

- The
**ContainerName**if we want to use different containers for different projects - The
**FolderPath**is unique for each table and I recommend using something like “schema/tablename/inputfiles” - The
**filename**will be the concatenation of the table + timestamp of the pipeline execution

It is now time to create our pipeline, as a sneak peek this is what will look like the final pipeline and I will break down each component one by one.

In order to make the synapse pipeline dynamic and reusable across different environments we have to create a bunch of parameters.

- ServerName: server source
- DBName: database source
- ContainerName: where the output files are stored
- FolderPath; where the output files of each table are stored
- FileName: Name of each output file

It does not really matter what default values I’m using since these values are going to be overridden later.

The key to incrementally loading our data with synapse is to have a control table that will contain all the parameters to configure the delta load of our multiple tables.

Here is a basic script to create the control table, we can add more columns for more complex scenarios like specific filters to apply for specific tables, or even write SQL queries to be run dynamically.

```
create table [dbo].[COntrolTable](
[ID] int identity(1,1)
, [ServerName] varchar(50)
, [DBname] varchar(50)
, [SchemaName] varchar(20)
, [TableName] varchar(50)
, [DateColumn]varchar(50)
, [LoadType] varchar(10)
, [FromDate] datetime2
, [ToProcess] varchar(3)
, [CreatedDate] datetime2
, [LastUpdatedDate] datetime2
)
```

If we want to avoid configuring the control table for each table one by one, we can use a similar query to the one below, again there are many ways to create and configure the control table.

And I highly recommend anticipating any potential future changes on this table as all the pipeline configurations will rely on the control table.

```
insert into [dbo].[COntrolTable]
select 'Your Server' as [ServerName]
, DB_NAME() as [DBname]
, s.name as [SchemaName]
, t.name as [TableName]
, 'LastEditedWhen' as [DateColumn]
, 'delta' as [LoadType]
, '1901-01-01 00:00:00' as [FromDate]
, 'yes' as [ToProcess]
, getdate() as [CreatedDate]
, getdate() as [LastUpdatedDate]
from sys.tables t
inner join sys.schemas s
on t.schema_id=s.schema_id
where s.name='Sales'
```

This is an example of the output of the control table used for this post.

As we can see the DateColumn that will be used for the incremental load is always the same which is, of course, good practice but in reality, this is rarely the case.

Here is the lookup activity which returns a JSON string containing all the values of each table that we need to incrementally load.

Note that we use the source dataset that we have previously created and we pass the values of the pipeline parameters to the dataset parameters.

The for each is used to loop through the list of tables returned by the lookup so can repeat the delta load for each table according to their parameters.

In order to fetch the values related to each table we need to pass the command “@activity(‘LkpListTables’).output.value”

The first part of our synapse pipeline allowed us to retrieve all the different parameters related to each table that we want to incrementally load. Now the second part of the pipeline is to run the delta load for each table.

Since we only want to load the new data we need to retrieve the DateTime of the last change that occurred on the table so then we can use this value as a filter for the next run.

The date of the last change becomes the “date from” of the next run and so on.

Here is the dynamic SQL command used in the lookup to retrieve the date of the last change.

Note that I tend to wait 1 second before running the copy activity to avoid potential missing commits that would occur within the same second, this may need not be needed in your scenario. second. This is actually very unlikely to happen since I’m using milliseconds for my technical date from column but this is a habit that I have maybe a bad habit though…

The copy activity is completely dynamic since it reads the parameter values returned by the lookup “LKPListTables”.

The dynamic SQL command is as follows:

All the variables are given by the lookup “LKPListTables” except the filter “<=” which is given by the lookup “LkpCurrentWaterMark”.

Note that I only filter on a time interval in this synapse pipeline but we can perfectly add more filters and make everything dynamic as long as we are using the control table.

The parquet files are going to be copied into the data lake that we have previously configured.

Each parquet file is created into a specific folder corresponding to the table name and we saw in this post above the file name is the combination of the table name and the timestamp of when the file was created. This should guarantee that each file has a unique name but also help to keep track of when the file was generated.

- @pipeline().parameters.ContainerName
- @concat(pipeline().parameters.FolderPath,’/’,item().TableName,’/’,’InputFiles’)
- @concat(item().TableName,’_’,utcNow())

Finally, before running the synapse pipeline the last thing to do is to update the watermark value for the next run. In order to update the DateFrom to use as a filter for the next run I use a simple stored procedure and pass as a parameter the date of the latest changed previously retrieved by the lookup “LkpCurrentWaterMark”

```
CREATE PROC [dbo].[SynapseUpdateWaterMark]
@DBName varchar(50),
@SchemaName varchar(50),
@TableNAme varchar(50),
@DateFrom datetime2
AS
BEGIN
SET NOCOUNT ON;
UPDATE t
SET FromDate=@DateFrom
, Tec_UpdatedDate=GETDATE()
FROM dbo.SynapseLoad t
WHERE DBNAme=@DBName
AND SchemaName=@SchemaName
AND TableName=@TableNAme
RETURN 0
END;
GO
```

Before running the pipeline or adding a trigger to schedule it, you will need to pass the parameters. Not that we can add multiple triggers to the same pipeline and thus pass different parameters for each run.

After the first run the pipeline will create the list of folders corresponding to the list of tables that we incrementally load to the data lake.

In each folder, a folder called “InputFiles” is created and then the parquet files are generated each time the pipeline will run. Note that depending on the configuration and the size of the data to load, more than one file can be created.

In this post, we created a simple synapse pipeline to incrementally load multiple tables into parquet files.

Even if this pipeline is simple we can easily tweak it and make it more complex with some specific rules or specific filters to be applied on some tables.

*It’s been a while since I wanted to write this post and this post had stayed in my draft post for already some months. So as done is better than perfect, I preferred to share it as it is now, and hopefully, it will still be helpful and clear enough and I should plan to revisit it soon.*

]]>

So let’s talk a bit about me and why I think it is now the time for me to get a mentor too.

So far I have been pretty lucky as well as happy with the way my career went.

After uni, I started my career in France as a BI consultant and I then moved to the UK for some years where I got the chance to work with incredible people.

I remember one day I was sitting with my manager who I now consider being my mentor at that time, that day he was helping me to solve some MDX problems.

I had always been impressed by his MDX skills and the way he was solving problems, this day I asked him how did you learn all that stuff he said to me everything I know is in this book and that was an MDX book written by Chris Webb, this is when I started to realized that we’re all learning from someone and that we need to read and of course, I started to read this book.

3 years later I was looking for other opportunities and went to an interview in which I literally failed, I had been interviewed by the BI director who was an MVP at that time and he asked me some very precise technical questions as well as theoretical questions.

As I could not answer more than 50% of the questions I was a bit frustrated and I literally said that I was not expected to answer such theoretical questions and that we can usually find this stuff online whenever we need to know it. His answer to my comment was again short and precise “Here we value knowledge” “When you’re on a call with a technical customer who’s challenging you you don’t have time to google it!”

These words still echo in my mind today…

That was probably the best thing that happened in my career, in fact, this failure has changed a lot of things for me, I started to deeply learn things by buying and reading books and also by purchasing paid courses. It might sound trivial but on average most people don’t do that which is OK as long as they’re clear with their goals. But once you start doing it you quickly start to see the result in your day-to-day job.

A month later for family reasons, I decided to move back home or at least closer so this time I was looking for new opportunities in Switzerland where I landed a job at MSC where I’m still working 5 years later.

A lot of things happened to me in these 5 years I rapidly grew as an SSAS and Power BI technical leader and then as a team leader of one of the BI teams and more recently leader of the BI architecture team.

There are two reasons why I quickly became a technical leader, I had a mentor (my former manager) that showed me the way and mindset to have to tackle problems and learn things, and secondly the interviewer who taught me a good lesson the hard way.

And there are also two reasons I quickly became a manager, I went out of my comfort zone, and then once again I had a manager that supported me. I made several mistakes such as saying yes to everyone or overselling things that once in production needed a rollback… But luckily my manager never really blamed me for that and instead, she helped me to realize why things went wrong so we can prevent it from happening again.

So far I believe that my career is going well but I still I feel like there are still a lot more things I should achieve and I don’t necessarily mean that I want to change jobs or get even more responsibility at least for now.

But more things like improving my leadership skills, time management, and of course my technical skills.

Also, I love blogging but I barely find enough time to blog, or sometimes it’s just a lack of motivation or no clear idea about what to blog about, this is clearly an area where I’d need someone to push me.

Also, I’m used to public speaking in my company where I lead the Microsoft community but I never spoke at any event so far not even at an online group, I’m not sure why though maybe because I’d like to get some feedback before my first presentation or maybe because of the imposter syndrome even though I’m pretty sure that there are a lot of interesting things that I could share (even without knowing everything about the topic).

So to the question of why I need a mentor, it’s pretty clear to me you always need someone who has been there before you, someone who has taken more risks than you, someone who has been out of his comfort zone more times than you someone one who has failed and learned more than you, someone who knows to give you this little nudge or big if you need a big one!

There are of course great people to learn from in my current company especially my new director who is a Microsoft MVP and I’ll be also constantly learning from my colleagues. But once you’re in a management position and the one providing technical training or the guy to go to solve more complex problems it becomes harder to grow. Of course, I’m still learning by doing things or I’m learning from my mistakes or even others’ mistakes but in terms of growth, I feel like I’m hitting a plateau.

Everything that I achieved so far was largely due to the fact that I had the right people around me, however, most of the goals that I set for this year have not been achieved… So I think it’s time again for me to get out of my comfort zone and get a mentor that knows what it takes to achieve things and see where it leads me.

Of course, the role of the mentor is to mentor you not to do the job for you, doing what it takes to achieve more will be my role.

]]>In this short post, we will see why Power BI desktop is consuming a lot of disk space and how we can safely reduce it.

There are multiple ways to analyze the disk usage space, we can use the built-in windows tool but I by far prefer more advanced tools like WinDirStat or TreeSize Free.

Whatever tool you’re using to analyze the usage of your disk space don’t forget that some files will not show up unless you run the app as Administrator and of course always be cautious before deleting files from your PC.

As we can see from the below scan Power BI desktop is using 13.4Gb of my disk. Actually, 13.4GB is not that much and that is because I did some cleansing a few weeks ago and before that cleaning-up, my Power BI desktop was using around 150GB.

Now let’s see why Power BI is using all that disk space. As we can see I did not scan the install folder of Power BI desktop which is usually installed in “program files…” but instead I scanned the directory “C:\Users\yourUser\AppData\Local\Microsoft\Power BI Desktop”

AppData is a hidden folder that contains custom settings and other information needed by applications. So in theory we should not modify or remove the files in that folder as there’s a risk of breaking something or affecting the smooth functioning of the Apps installed on the machine.

But if we know what we’re doing this is pretty safe so now let’s see what files we can safely delete and see what will be the impact of removing these files.

As the name implies TempSaves keeps copies of your Power BI Report when you close your report without saving it.

So when we work with large files (reports with imported data) and Power BI gets closed accidentally a copy of the open reports will be saved in the TempSaves folder. And this happened quite often actually, windows updates, Power BI stops responding or when we forgot to plug our laptop, etc.

In Power BI we can change the auto recovery settings to prevent Power BI from always saving copies of our reports but I’d strongly suggest keeping the auto recovery on, it saved my days many times!

As we see in the below scan Power BI can save multiple copies of the same report so when a report file is large the space used can quickly add up and become an issue for disk space usage.

So what can we do about it?

Well, to me the best solution is to run a disk scan from time to time and check for the files in the TempSaves folder that can be safely removed.

There’s also another folder “AutoRecovery” which stores the las auto recovered files but it is usually empty as whenever we open Power BI there’s a prompt message asking if we want to keep or remove auto-recovery files so Power BI does the cleansing for us in that folder.

Alternatively, we could create a power shell script to clear all files older than a specific date but seems more like an overkill to me…

As we saw above the majority of the disk space used by Power BI is the copies of reports that we should clean from time to time.

Well for the cache this is about the same, there are three types of cache that Power BI will store in the AppData folder.

The good thing about the cache files is that you don’t need to access the AppData folder to do the cleansing since you can control the maximum size and the clearing directly from Power BI.

The only disadvantage of clearing the cache directly from Power BI is that we cannot decide to keep the recent cache and get rid of the old cache files only. So if we want to manually clean the files (which I never do) we can still do it.

Here are the folders associated with each cache:

– Data Cache Management: *“C:\Users\xxx\AppData\Local\Microsoft\Power BI Desktop”*

– Q&Q Cache: *“C:\Users\xxx\AppData\Local\Microsoft\Power BI Desktop\Lucia Cache”*

– Folded Artifacts Cache:* “C:\Users\xx\AppData\Local\Microsoft\Power BI Desktop\FoldedArtifactsCache”*

Also just to avoid any confusion I want to add that these 3 cache options are not related to the Vertipaq engine cache or the report cache. If you want to find out more about the report cache and the Vertipaq cache you can check the youtube video of Marco The Tale of Two Caches

The **Data Cache management** is used for Power Query: (Always keep 32MB at least to make sure that PBI can preview the first 100 rows).

The **Folded Artifacts Cache** is used to improve the performance of direct queries (if it folds…)

And **Q&A cache** speaks for itself and so far I never met someone using it anyway well except for a demo.

As we saw in this post Power BI is not supposed to take much of our disk space since we can control how much cache we want to allocate and easily clear it from Power BI directly.

However, even if the auto-recovery is a great feature that I encourage you to always keep on, there’s no possibility to clear the auto-recovered files from PBI directly, so it has to be done manually from time to time.

]]>Some time ago I was tasked to migrate a multidimensional cube to a tabular model, this cube had around 2 billion rows and I also had to create a median measure on the new tabular model. I first thought that the median would be much faster to calculate on a tabular model than on a multidimensional model but it turns out that the median was still extremely slow to compute on tables with large numbers of rows.

At that time I had found a workaround but recently I decided to revisit this workaround which was a bit overkill and this time I came up with a simple Optimized median measure in Dax. This solution has some limitations that I will describe below but in most cases, it performs much faster than the built-in Median function on a large dataset it is literally 1,000 times faster and even more.

The **median** is the value separating the higher half from the lower half of a dataset.

It may be thought of as “the middle” value of a sorted list. The main advantage of the median in describing data compared to the mean is that it is not skewed by a small proportion of extreme values, in other words, it is more robust to outliers than the mean.

As an example, Median Income is usually a better way to suggest what a “typical” income is because income distribution can be very skewed.

If 10 BI developers were to sit together the median income and the mean income would likely be very close but if Warren Buffet was to sit with the 10 BI developers the average would be misleading and much higher than the median.

If the median tends to be slow to compute it is because to determine the median value in a sequence of numbers, the numbers must first be sorted. So this is the sorting operation that makes the median slow. If we were to find the median on a presorted sequence of values the result would be immediate. The time **computational complexity **of the median is **O(n Log n)** which means it does not scale very well.

As we can see below I ran a quick simulation for 100 values in R, the median *does not follow a linear time complexity “O(n)”, for 1 billion* rows the time complexity of the median would be 9 billion.

There are some optimized algorithms that exist to calculate the median which mainly relies on optimized sorting algorithms such as the quicksort but all these algorithms can not be developed in DAX as they required loop and recursive functions.

So how can we calculate the median in a more efficient way in DAX?

The limitation with the built-in Dax function

- It is very slow to compute the median over a large dataset
- You’re likely to run into an error “not enough memory” on my machine 64GB I had not been able to compute the median on more than 300M rows
- Even if you have the most powerful server there’s still a limitation to calculating the median using the built-in DAX function median. We cannot compute the median on a table that contains more than 2 billion rows. I initially found this limitation on Chris Webb’s blog and I confirm that as of today the limitation is still there.

The limitation with the Optimized median measure in Dax

With the solution that I propose, there’s no limitation on the dataset size I even ran it on a table with more than 5 billion rows. However, there is an important limitation to keep in mind and I will deep dive into it in another section below.

This time the limitation is not on the number of rows but on the number of **distinct values of the column for which we want to compute the median**.

In most business scenarios, we are usually interested to know the median of variables that have a small number of distinct values. And even if it does have a large number I will show a simple trick to reduce the number of distinct values while still keeping the median 99% accurate.

Before going to the code of the Optimized median measure in Dax it is important to understand how the calculation of the median works.

So assuming that the sequence of observations is sorted the Median is as follows:

The formula is pretty simple we only need to check whether the number of rows “n” is odd or even and apply the above calculation accordingly. So now if we want to calculate the median in DAX without the Median() function we only need to sort the data and find a way to select the nth row.

As we saw above the main issue with the median is that we need to have the sequence of values sorted and in DAX we have no control over that.

However, we have another huge advantage which is basically the Vertipaq engine, under the hood Vertipaq has applied some algorithms to compress and store the data in a way that it can quickly compute the number of rows for a specific value.

So since we know how to calculate the median and we know that Vertipaq is very powerful at computing aggregation calculations such as counting the number of rows by Year or the number of rows by product…

So this is how I suggest writing the Optimized median measure in Dax:

```
median optim =
var __nbrows=[nbrows]
var __isod=if(__nbrows/2>int(__nbrows/2),1,0)
var __n=if(__nbrows/2>int(__nbrows/2),int(__nbrows/2)+1,int(__nbrows/2))
var __list= ADDCOLUMNS(
SUMMARIZE('Table','Table'[median_colum]),
"cml",calculate([nbrows],
'Table'[median_colum] <= EARLIER('Table'[median_colum])
)
)
return
if(__isod=1,
minx(FILTER(__list,[cml]>=__n),[median_column]),
0.5*(minx(FILTER(__list,[cml]>=__n),[median_column])+minx(FILTER(__list,[cml]>=__n+1),[median_column]))
```

**Let’s now break down this measure in three and see how it works:**

The first part of the measure is just a setup:

- Get “n” the number of rows
- Check whether the number of rows is an odd or even number
- Find the nth value accordingly to the number of rows (odd or even)

```
var __nbrows=[nbrows]
//Find out if the number of rows is an odd or even number
var __isod=if(__nbrows/2>int(__nbrows/2),1,0)
//If the number of rows "n" is odd we divide it by 2 and add 1 (n+1)/2 else we keep n divided by 2 (n/2)
var __n=if(__nbrows/2>int(__nbrows/2),int(__nbrows/2)+1,int(__nbrows/2))
```

The second part is where the magic happens since we’re leveraging the Vertipaq engine.

- We create a virtual aggregated table that contains two columns
- First column “median_column” contains the distinct values of the column for which we want to compute the median
- The second column contains the cumulative sum of the number of rows of the median column

```
var __list= ADDCOLUMNS (
SUMMARIZE('Table',
'Table'[median_colum]
),
"cml",calculate(
[nbrows],
'Table'[median_colum] <= EARLIER('Table'[median_colum] )
))
```

Finally, we just iterate through the virtual table and we return the first value that has its cumulative sum of the number of rows greater or equal to the “nth” value. If the number of rows is even we calculate the value of (“nth” + “nth+1”) divided by two.

```
return
//if n is odd we return min of "cml">=(n+1)/2 else we return (min of "cml">=(n)/2 + min of "cml">=(n+1)/2)/2
if(__isod=1,
minx(FILTER(__list,[cml]>=__n),[median_column]),
0.5*(minx(FILTER(__list,[cml]>=__n),[median_column])+minx(FILTER(__list,[cml]>=__n+1),[median_column]))
)
```

As stated earlier in this post I mentioned that this measure is extremely fast compared to the built-in function, however, it can also be very slow when the cardinality of the column for which we want to compute the median is a bit large.

Now, let’s see in which scenario we should stick to the built-in median and when we should start using the optimized median measure.

To get these figures I ran a benchmark in DAX studio with 5 cold cache execution and 5 Warm cache execution. The data that I used

The table that I used for the first benchmark had 20 distinct values repeated X number of times, for example, the below table has 200M rows but only 20 distinct values.

**Cold Cache Execution comparison – Number of rows comparison**

As we can see from the above graph the optimized median measure is 500 times faster than the built-in function when the table has 250M rows. We can also see that the Median built-in function scales very badly. At only 50M rows which I consider small, the Median measure takes already 26 seconds to render which would be an issue in a report.

**Warm Cache Execution – Number of rows comparison**

When the cache is used the optimized measure is 10,000 times faster! Of course, it makes more sense to compare measures with cold cache executions but it was to illustrate that the optimized median measure is better at levering the cache than the classic median function.

The big drawback of using the optimized median measure is that it scales very badly as soon as the number of distinct values of the median column increases.

So for the next benchmark between the two measures, I use a dataset of 100M rows and this time the number of rows will not change, only the number of distinct values will change.

**Cold Cache Execution – Distinct values of the Median column comparison**

This time we see that the built-in median function is not impacted at all by the number of distinct values of the median column which makes perfect sense since it is always scanning the entire table so the cardinality does not matter at all only the number of rows matters. And this actually the exact opposite for the optimized median measure…

The short answer is of course it depends, and here is a more detailed answer:

- Very small dataset with less than 10M rows: the median should be fine
- For any large or very large dataset with a median column that has a small number of distinct values (less than 10k), you can use the optimized median measure without a doubt
- Dataset with more than 2 billion rows you have no choice but to use the optimized median measure
- Large dataset and median column with a high number of distinct values, you’re stuck! Or we can reduce the number of distinct values and use what I call the approximate median.

*You can find online some interesting research papers on the approximation of the median but my approach is nothing like this.*

As we saw above when we have a large dataset and when the column for which we want to calculate the median has a high number of distinct values none of the measures performs well.

So the solution that I propose is to calculate an approximate median by reducing the number of distinct values. Of course, by reducing the number of distinct values we will pay a price we will lose some accuracy, the median will remain very close to what the real median is.

First and easiest approach:

If the median column is a decimal number, we will simply reduce the number of decimals in that case, we will lose a very little accuracy, however, if after that the cardinality is still too large we can use the second approach.

Second approach:

If the median column is a whole number we will divide by 10 or more the median column until we reach a small enough number of distinct values, we will then compute the median and multiply it back with the denominator.

**Let’s see how it works:**

We have a dataset of 500M rows and the column for which we want to compute the median has 27,372 distinct values, we already know that none of the two measures will work well since the number of rows is too high for the bulti-in median and the cardinality of the median column is too high for the optimized median measure.

We have to reduce the number of unique values of the median column, so we divide the median column by 10 and we make sure to round the result to the nearest integer.

```
Median Column /10 = ROUND('Table'[Median Column]/10,0)
```

(If you need to implement such a solution I would advise applying this transformation in the ETL part and not in a DAX calculated* column.*)

Now we have 5,728 unique values which is good enough for running the optimized median measure.

We can now compute the approximate median which is based on the Median Column /10, the result of the approximate median is 1,191 which we then multiply back by 10 which returns 11,910.

As we can see by dividing the column by 10 we’re losing very little accuracy and in most scenarios having an approximate median will be more than enough.

At maximum in this specific scenario, I can miss the real median by 5 units which represents only 0.018%.

If we have a higher cardinality let’s say 10 million we will need to divide the median column by 1,000 which means that we can miss the real median by 500 units at most, but 500 over 1,000,000 is only 0.05% so still quite accurate.

As we saw the Optimized median measure in Dax can be 1,000 times faster than the built-in function but can also becomes very slow when the cardinality of the median column is too large. When it is the case we can use the approximate median but we will lose up to 0.05% of accuracy which should be acceptable

As for the cardinality of the median column I found that the ideal threshold is at around 10,000, this might vary for you if you have a bigger server, I ran the benchmark on ma personal machine which has 64GB or ram and 16 logical cores.

Except the limitation on the median column cardinality the optimized median measure can run on very large dataset I ran it on dataset with more than 5 billion of rows and it was still returning the median in less than a second.

Here are some figures, as we can see for 2.5B rows it does not event take half a second to run.

This was quite a long post for just a simple median measure, but I did not want to throw the Dax code without providing any context and details. Any measure optimization requires a good understanding of the context and how the measure is going to be used by the users. So this is why I wanted to highlight all the pros and cons of using the optimized median measure.

For most scenarios this measure will work amazingly I believe and I’m pretty sure it can still be tweaked a little to make it even faster or more customizable but at this point I’m really happy with the performance thanks to the way Vertipaq compress and quickly precalculate things.

]]>The public preview of SQL Server Analysis Services 2022 was released on May 2022 and at the time of writing this post, SSAS 2022 is still in public preview.

One might ask if it is still worth using an on-prem solution (or IASS), and the answer is always “it depends” but there are still some valid reasons for companies to stick with an on-prem/IASS solution.

- Cost-wise (this one really depends!)
- Memory and CPU limitation, you can allocate more than 400GB of memory on your physical server or virtual machine (since gen2 PBI manages the memory in a dynamic way but still…)
- Security (even if I tend to believe that the cloud is probably more secure than an on-prem solution…)
- No internal skills to move to the clouds
- Gouvernance, compliance, intellectual property, etc…

These are just some examples and I’m sure the list can go longer than that but of course, we can also argue these reasons.

Anyway, the purpose of this post is not to discuss cloud or not cloud but rather to emphasize that as of today there are still a lot of companies even large ones that heavily rely on SQL Server Analysis and thankfully Microsoft understood it and came up with this new release which for me a game changer for model and report designers.

Before jumping into the new features here are the two Microsoft links where you can find more information about the new release and keep up to date with new feature releases or cumulative updates:

The power BI Blog: What’s new in SQL Server 2022 Analysis Services CTP 2.0

The Microsoft learn doc: What’s new in SQL Server Analysis Services

**So what’s new in SSAS 2022?** According to Microsoft here are the newly available features:

New features and improvements in CTP 2.0

**The big game changer: Composite Models**- Improved MDX query performance
- Improved resource governance
- Query interleaving – Short query bias with fast cancellation

New features and improvements in RC0

- Horizontal Fusion
- Parallel Execution Plans for DirectQuery

Although there’s no information about it on the Microsoft side, there’s also great news for the DAX developers! All and I repeat All the new DAX functions or updated functions are supported in Analysis Services 2022!

Without further talking let’s review these new features together. Since I want to keep this post short enough I will only review the **Composite Model** and the **new DAX functions** and will write another post for the performance part. So this post is more intended for the PBI and SSAS developers while the other post will target a broader audience such as developers but also the DBA or infra administrators.

The composite models feature in SSAS is documented here: Using DirectQuery for Power BI datasets and Analysis Services (preview)

This new feature is now well known by the Power BI community and Azure AS, so no need to repeat the Microsoft documentation. I did some tests and as far as I can see all the limitations on PBI or Azure AS also apply to SSAS. Except for a few things that I detail below I did not notice any difference of course the only thing that differs is that **you must configure a data source gateway for any model used in your composite model** as long as they are SSAS models.

The great thing about this game-changer feature is that you can now create a report that combines tables from different models and no longer need to create silo reports for each department or duplicate data into every single model, this will save tons of time and simplify a lot the model creation. However, you should still make sure to follow the composite model guidance otherwise you may end up with over complex reports or slow reports.

**Only Analysis Services 2022 models are supported**

On top of the known limitations listed in Microsoft documentation, you have to bear in mind that only models deployed on Analysis Services 2022 are supported (Of course Azure AS and PBI are supported). So if you try to directly connect to a model deployed on an older version such as SSAS 2019 for example you will get the following error message:

So to fully leverage the composite model on your on-prem/IASS architecture you must migrate all your models to SSAS 2022.

**Relationships with SQL Direct query mode do not work**

This one sounds like a bug to me, in fact, if you try to create a relationship with an SQL Direct Query table, Power BI will crash and you will need to kill its process. I try to reproduce it using an Azure model with an SQL Direct Query table and it worked just fine.

I tried to connect and link my Analysis Service 2022 model with 4 other models.

- Local Model: Light blue header bar (Since I use connect live connection, my local model is not actually on my laptop but on the SSAS server)
- SSAS 2022 Direct Query: Dark blue header bar
- Azure AS: Red header bar
- SQL Import: No colour and sometimes Yellow header bar (looks like a small bug to me)
- SQL Direct Query: Purple header bar

I have to say that being able to easily distinguish each model location by colour is a great feature.

As of now, there’s no information about the new DAX functions supported in SSAS 2022 so I will give a bit more details about it and present just a couple of new or improved functions.

As you already know each month there’s a new Power BI release and in some releases Microsoft introduces new DAX functions or sometimes adds improvements to existing DAX functions, some of the recently added or improved functions were not available in SSAS 2019.

So the first thing that I wanted to check after I installed Analysis Service 2022 (after the composite model of course) was to find out if all the recent new and updated DAX functions were supported in SSAS 2022.

And the short answer is a big YES but that’s not all! At the time of writing this post, there’s a hidden function in Power BI not yet released called Offset and it turns out that this function is also supported in Analysis Services 2022.

Until Microsoft officially releases this new function and related documentation you can find more about it in Marc’s blog post: How OFFSET in DAX will make your life easier

As Marc suggested this function is going to make writing DAX much easier so having this function also available in Analysis Services 2022 is great news for the DAX developers.

Let’s now deep dive into some of the new and updated DAX functions that I like the most and which will for sure make DAX coding easier.

The full list of new functions documented by Microsoft can be found here: New DAX functions. *In reality, there have been more functions added within the last few years but I’m guessing Microsoft is listing only the most important functions.*

This is for me by far the biggest change in fact there are at least two improvements to the calculate function that makes DAX much easier to write.

Multiple filter co**nditions in Calculate:**

If you were to write this function in SSAS 2019:

```
multiple filters in calculate =
CALCULATE (
sum(Sales[Net Price]),
'Product'[Color] = "Red" || 'Product'[Brand] = "Contoso"
)
```

You will get the following error message:

Surprisingly, I discovered that the following function was working in SSAS 2019

```
CALCULATE (
sum(Sales[Net Price]),
'Product'[Color] = "Red" ,'Product'[Brand] = "Contoso"
)
```

But this exact same function was not:

```
CALCULATE (
sum(Sales[Net Price]),
'Product'[Color] = "Red" && 'Product'[Brand] = "Contoso"
)
```

Anyway, they both work on Azure Analysis Services 2022 now.

In addition to that, another great improvement on the Calculate function is the ability to use an aggregation function in boolean filter expressions.

The following function DAX expression was not working on SSAS 2019 and works fine on SSAS 2022:

```
Total sales on the last selected date =
CALCULATE (
SUM ( Sales[Sales Amount] ),
'Sales'[OrderDateKey] <= MAX ( 'Sales'[OrderDateKey] )
)
```

I only detailed the updates on the Calculate function but the same updates apply to the CalculateTable function and everything seems to work fine.

Another great function available is the Networkdays even though I prefer using my own Date table and maintaining it with the internal ETL process. This function has only been released recently (after the CPT2 release of analysis services 2022) so I did not expect to find it, but it is there. ( Maybe it was added in the RC0 update…)

This function was probably the one that I expect the least to be available since it’s not even fully released in Power BI yet but to my great surprise, it is also there.

I will not elaborate much on it since I already mentioned it above but all I can say is that having this function available means that at the time of writing this post Analysis Services 2022 is fully up to date with Power BI and Azure AS in terms of Dax functions.

In this part, I only described and reviewed the new features in Analysis Services 2022 related to the composite model and the support of the new DAX functions. I will write in a second post the review of the new features that are mainly related to the performance improvements.

So except for one bug that I found (there may be a few others not identified yet), I can say that for the DAX and composite model the SQL Server Analysis Services 2022 is now a complete equivalent to Azure Analysis Services so if you still have an on-prem/IASS solution make sure to check when this release will be available to General Availability as it comes with greats new features and improvements.

Of course, it is still lacking behind Power BI premium in terms of features but that’s another discussion.

]]>There are three main types of chi-square tests but in this post, we will focus only on the

A **chi-square test for independence** compares two variables (which can hold multiple values) in a contingency table to see if there is a relationship between each other. In other words, it tests to see whether distributions of categorical variables differ from each other.

Another common type of Chi-squared test (not covered in this post) is called the “**chi-square goodness of fit test**” it helps to determine if a sample data frequency matches a known or assumed population frequency.

**Social research:**to determine if there is a relationship between a voter’s opinion and their level of income**Clinical trial**: to determine if the treated group and control group are associated with each other or not (similar outcome after real treatment and inert treatment)**Market Research**: To determine if certain types of products sell better in certain geographic locations than others

Before jumping into the implementation of the Chi-square test in Power BI let’s see what are the assumption of the chi-squared.

- Both variables are categorical
- All observations are independent
- The values of each variable are mutually exclusive.
- The sample size should be large enough, in theory, at least 80% of the cell “expected” should have a value greater or equal to 5 and no cell should be equal to or lower than 1. Fisher’s exact test would be more appropriate in such scenarios.
- The study groups must be independent

**Why implement a Chi-square test in Power BI with DAX instead of R/Python?**

- Fully interactive visuals; custom visuals using R or Python are not
- Security policy within your organisation: Limitation to deploy python or R script to the portal
- Force you to better understand the test statistics process
- Easier to maintain: one language, one tool, integration and deployment easier, reusability
- No need to know other languages such as R or Python

To implement the Chi-square test in Power BI I used a dataset that contains data about a clinical trial which involved 105 patients separated into two groups “Treated” and “not treated”.

50 patients received a real treatment (“treated” group) while 55 received a placebo (“not treated” group). After two months we compare and observe the improvement ratio between the two groups.

The model contains the main datasets that I described above and a parameters table CI which I will describe later.

The contingency table can be created with the DAX code as follows:

```
contingency table = SUMMARIZE('treatment',
treatment[improvement],
treatment[treatment],
"nb rows",COUNTROWS(treatment)
)
```

By looking at the contingency table above it seems that the health condition of treated patients has improved after 2 months, now we need to confirm whether this was due to chance or if theis positive effect of the treatment is statistically significant.

So here the **null hypothesis** (** H_{0}**) states that the distribution of the health condition outcome is independent of the two groups (treated and not treated).

**The alternative hypothesis ( Ha**) states that there is a difference in the distribution of the health condition outcome among the two groups (treated and not treated).

The observed frequency is basically the number of values in each category or group.

```
Frequency = sum('contingency table'[nb rows])
```

Before computing the Chi-square test in Power BI we first need to calculate the expected frequency.

The expected frequency is expected to be the same for each category as it is basically going to calculate the average frequency or count and assign it to each category or group.

To calculate the expected frequency we always assume that the variables are not related to each other

And the DAX formula is as follows:

```
Expected Frequency =
var __rowTotal =calculate([Frequency],ALLEXCEPT('contingency table','contingency table'[improvement]))
var __colTotal =calculate([Frequency],ALLEXCEPT('contingency table','contingency table'[treatment]))
var __total =calculate([Frequency],all('contingency table'))
return
divide(__rowTotal*__colTotal,__total)
```

*For example, to get the expected frequency of group=not-treated & and health condition=improved we compute (55*61)/105=31.95*

In order to run the Chi-square test in Power BI, we simply need to compare the Expected Frequency with the Observed Frequency.

In this scenario, it seems obvious that there’s a significant difference between the two groups, however, depending on the confidence interval that we want we may not be 99% sure that the observed change is not due to chance so we do need a statistic test to confirm it.

Also in more complex scenarios when we have more than two categories, the result is less intuitive.

The formula to compute the Chi-Squared is pretty straightforward since we already have computed the Observed and Expected Frequencies.

chi-square | |

observed value | |

expected value |

Here is the DAX formula to compute the Chi-square test in Power BI

```
X2 =
sumx('contingency table',
divide(([Frequency]-[Expected Frequency])^2,[Expected Frequency])
)
```

In order to validate or reject the null hypothesis, we now need to compare the Chi Squared statistic to the critical value or calculate the p-value.

An important note here is that **whether the observed frequency is smaller or larger **than **the expected frequency** makes no difference in the X2 result. So the greater the difference between the two groups is (< or >) the higher the X2 will be.

The degrees of freedom (or df) are used to determine the critical value or to compute the p-value.

Usually, degrees of freedom relates to the size of the sample so a higher df means a larger sample and thus reduces the false positive ratio,

For Chi-Squared the degrees of freedom do not rely on the sample size but on the number of groups and categories to analyse so it tells** us how many values in our grid are actually independent**.

The formula of the DF for the chi-square is as follows: **df = (r-1)(c-1)** where r is the number of rows or groups and c is the number of columns or categories

r-1 | nb of rows – 1 |

c-1 | nb of columns – 1 |

deg of freedom | (r-1)(c-1) |

Since* there are two columns and two rows in our scenario the df is equal to 1: (2-1)(2-1)=1*

The chi-critical value is **the cutoff between retaining or rejecting the null hypothesis**. If the chi-statistic value is greater than the chi-critical value, meaning it is beyond it on the x-axis, the null hypothesis is rejected and the alternate hypothesis is accepted.

Here is the DAX formula to calculate the ci critical value:

```
chi_crit_val =
var __df=[deg freedom]
return
CHISQ.INV.RT(1-FIRSTNONBLANK(CI[Probability],1),__df)
```

To calculate the critical value we use the DAX function CHISQ.INV.RT which requires two parameters the probability associated with the chi-squared distribution and the degrees of freedom.

As for the probability or level of confidence we have in our model a table which allows us to dynamically change the value of the probability.

So with a confidence interval of 95% and with a degree of freedom of 1, the critical value is 3.841.

To find more about the critical value or how to calculate it by hand you can refer to my previous post PAIRED T-TEST IN POWER BI USING DAX and read the t-critical value section which is more elaborate than this post. (The only difference is that for a chi square test we will need to use the chi distribution table instead of the t distribution table)

One important thing to note is that we usually run a right-tailed test for the Chi-Square test since we want to test the difference between the two groups.

A Chi statistic value greater than the upper critical value would mean that there is a significant difference between the two groups while a Chi statistic value below the lower critical value (left tail) would mean that the resemblance between the two groups is too good to be true.

A p-value is used in hypothesis testing to help us support or reject the null hypothesis. The p-value is the evidence **against** a null hypothesis. The smaller the p-value, the stronger the evidence that we should reject the null hypothesis.

The built-in DAX function to calculate the right-tailed p-value of the Chi test is CHISQ.DIST.RT. This function requires two parameters the **chi-stat** and the **degrees of freedom**.

And here is the DAX formula:

```
p-value chi =
var __df=[deg freedom]
var x2=[X2]
return
CHISQ.DIST.RT(x2,__df)
```

Note that in this post we only run a right-tailed test for a chi-square test but there’s also another DAX function to run a two-tailed test but it is rarely used for a chi-square test.

Strictly speaking, since we are only checking if two (or more) variables are independent there is no confidence interval for a chi-square test such as in the t-test.

In the Power BI, you can interact with the slicer “CI” which gives the level of confidence to reject the null hypothesis under a specific percentage. And of course, the higher the probability the higher the critical value will be. And the higher the probability the smaller the p-value will need to be to reject the null hypothesis.

And here is the final result of the Chi-square test in Power BI as we can see there is a statistically significant difference between the expected frequencies and the observed frequencies since the X2 is greater than the critical value and/or the p-value is lower than the alpha 0.05.

In this post, I covered the implementation of the **Chi-square test in Power BI** but I only focus on the most common chi test which is Pearson’s chi-squared test**, there are other tests **that apply in more specific scenarios.

I previously wrote some other posts about applying statistics in Power BI using DAX only like AB testing using Power BI so in this other DAX statistics post I wanted to show that implementing a Chi-square test in Power BI using DAX only is perfectly feasible it just requires a bit more work than typing two lines of code in R but with the help of some bulti-in DAX functions the result is great.

Here is the Chi-square test in Power BI

]]>Azure REST API limits the number of items it returns per result, so when the results are too large to be returned in one response we need need to make multiple calls to the rest API.

So when the results are too large to be returned on one page, the response includes the “nextLink” which is the “URL” pointing to the next page to call and this continues until we reach the last page.

Here is an example of the “nextlinK” property returned in the body response:

```
{
"value": [
<returned-items>
],
"nextLink": "https://management.azure.com/{operation}?api-version={version}&%24skiptoken={token}"
}
```

Now let’s see how to create a simple pipeline to perform pagination in azure data factory.

For this pipeline, we will retrieve the list of datasets by Factory, this can be useful to retrieve this information when building an enterprise data catalog or data lineage.

At the time of writing this post, the maximum number of datasets that can be returned per request is 50 and there are** **67 datasets in my Factory so I need to make two calls to retrieve the whole list.

You can try the Azure rest API and retrieve the URL to call here

And here is the number of datasets in my ADF:

The good news is that there’s a built-in function that allows us to perform pagination in azure data factory very easily (This technique applies to ADF and Synapse pipeline for dataflow it is slightly different).

Here are the steps to follow:

- Create a Rest Linked Service if not already done
- Create the datasets Source and Sink
- Source: Rest Dataset and pass the URL provided by “Rest API try it” as a parameter
- Sink: A JSON dataset since the structured response is in JSON format (you can of course parse the JSON response later to fit your needs)

- Create a Copy activity that uses the two created datasets
- Configure the source copy activity
- Configure the sink copy activity

Here the trick is to define the type of pagination as “body” and the value as “nextLink” we don’t need to set a condition to stop as ADF will automatically stop as soon as the AbsoluteURL will be empty. In some other scenarios, we would probably need to set a condition to stop.

Since ADF or Synapse will append the multiple responses into a single file it will break the JSON format and will lead to an invalid JSON file so the trick here is to change the file pattern to “Set of objects” or “Array of Objects”. Power BI can read any of the two patterns thanks to Power Query which easily flattens the JSON structure into a table.

As we saw it is quite straightforward to perform pagination in azure data factory or the synapse pipeline using Rest API, however, in this post I described only a specific scenario where I needed to retrieve data from the Azure Rest API.

There are many other ways to handle pagination such as offset, range, etc… also handling pagination varies from ADF and Data Flow. Here is the full Microsoft documentation about retrieving data from a REST endpoint by using Azure Data Factory or Synapse.

]]>Today there is already an idea submitted to Microsoft to enable this possibility however this idea does not have a lot of votes so it is not likely to be added anytime soon so then the workaround comes to the rescue!

According to the multiple comments submitted in the “Microsoft Idea”, the main reason is to show the **result of a statistical test** or **summary information of a model**. *(And this is exactly what I’m using it for too*)

Here are some comments submitted to Microsoft:

- “R Integration is fantastic… but seeing the console output in a visual is still needed
**to see all the coefficients and model stats etc.**..” - “Would be great to have this as a simple way to
**get results of statistical tests**.” - “That would be a fantastic possibility,I need to just
**show**the**summary of a regression model!**“ - “…I’m working with
**forecasts**as well as**predictive models**and need to be able**to print results**…”

The solutions that I will describe in this post are a workaround so it is not as straightforward as just outputting the result of the statistical test or model summary. It requires some text manipulation but is still pretty quick to implement.

Let’s say we want to run a t-test in Power BI using R and see the result in a visual.

If we were to write the R code in Rstudio and run it the output of the t-test will look like this:

```
# dataset <- unique(dataset)
res<-t.test( dataset$ColumnA,dataset$ColumnB, paired = TRUE,
alternative ="less" ,conf=0.95)
res
```

However, when we run the same code in an R visual in Power BI the output is as follows:

As we can see we get an error message “*The R code* didn’t *result* in creation of any visuals”. Unfortunately at the time of writing this post, it is not possible in Power BI to display the R console result.

And according to the error message we cannot simply print the result, however, **we can “plot” it** as long as an image is created so let’s do it!

GGplot is one of the most famous R libraries for creating graphics.

Now what we need to do before plotting the t-test output is to :

- First store the result of the statistical test or pedecitive model into a variable
- From this variable we can retrieve each coeeficient that we want to show
- Create a new text variable where we concantenate and indent all the information that we want to display
- Plot this variable using ggplot library with the annotate function and the magic happens

```
# dataset <- unique(dataset)
library(ggplot2)
res<-t.test( dataset$dem_percent_12,dataset$dem_percent_16, paired = TRUE,
alternative ="less" ,conf=0.95)
text = paste("R result:","\n",
"method: ",res$method ,"\n",
"alternative: ",res$alternative ,"\n",
"T: ",res$statistic ,"\n",
"p-value: ", res$p.value,"\n",
"Confidence Interval:","95%","\n",
"CI Low: ",res$conf.int[1],"\n",
"CI Up: ",res$conf.int[2],"\n")
ggplot() +
annotate("text", x = 0,5, y = 0, size=6, label = text, hjust = 0) +
theme_void()
```

Another solution is to use the library Datagrid this solution is a bit tidier and neater.

I may be old fashion but I actually prefer a simple text output using the ggplot approach above.

Here the approach is very similar but instead of simply concatenating the t-test result variables in a text variable we create a data frame and create a row for each variable that we want to display.

```
library(gridExtra)
library(grid)
res<-t.test( dataset$dem_percent_12,dataset$dem_percent_16, paired = TRUE,
alternative ="two" ,conf=0.95)
Name <- c("method", "alternative", "T-stat", "P-value")
Value <- c(res$method, res$alternative , res$statistic, res$p.value)
df <- data.frame(Name, Value)
tt <- ttheme_default(colhead=list(fg_params = list(parse=TRUE)))
grid.table(df, theme=tt)
```

And finally, the last solution that I can think of is to use the library Gtable which is very similar to the library grid but it has the advantage to be very flexible, more customizable and looking even nicer however it is less straightforward to implement.

```
library(gtable)
library(gridExtra)
library(grid)
res<-t.test( dataset$dem_percent_12,dataset$dem_percent_16, paired = TRUE,
alternative ="two" ,conf=0.95)
Name <- c("method", "alternative", "T-stat", "P-value")
Value <- c(res$method, res$alternative , res$statistic, res$p.value)
df <- data.frame(Name, Value)
g <- tableGrob(df ,rows = NULL)
g <- gtable_add_grob(g,
grobs = rectGrob(gp = gpar(fill = NA, lwd = 2)),
t = 2, b = nrow(g), l = 1, r = ncol(g))
g <- gtable_add_grob(g,
grobs = rectGrob(gp = gpar(fill = NA, lwd = 2)),
t = 1, l = 1, r = ncol(g))
grid.draw(g)
```

Here you find all the details on how to use the libraries Grid or Gtable.

And here is how the R visuals render in Power BI so depending on the desired size of the visual you may need to tweak the theme’s parameters.

As seen in this short post even though it is not possible at first to show text from a test result of a statistics test or summary of a statistics model like a regression in an R visual. But with this workaround, we can quickly retrieve the variables that we want to show and display them.

This may be a bit of tedious work for a large model with a lot of variables to output but I’m not sure either that running a large predictive model in Power BI would be the right thing to do.

]]>A t-test is a type of inferential statistic that can be used to determine if the means of two groups of data are significantly different from each other.

In other words, it tells us if the differences in means could have happened by chance.

There are three types of t-test:

- An Independent Samples t-test compares the means for two groups.
- A
**Paired sample t-test**compares means from the same group at different times. - A One sample t-test tests the mean of a single group against a known mean.

In this post, we will focus only on paired t-tests and I’ll be soon writing another post for the other types of t-test.

**A paired t-test** is used to compare two population means where we have two samples in which observations in one sample** can be paired** with observations in the other sample. We compare the two sample means at different times or under different conditions.

Examples of where we can use paired t-test:

**Before-and-after**: Observations on the same students’ diagnostic testresults before and after a particular module or course**Medicine**: Difference in cholesterol level before and after treatment, the difference in blood pressure before and after treatment.**Social research**: Determine whether there is a significant change in the scores of the same cases on the same variables over time such as % turnout in presidential elections by states

- Independence of the observations: Measurements for one subject do not affect measurements for any other subject
- Each of the paired measurements must be obtained from the same subject
- The differences between pairs are normally distributed

In this section, I will break down every single step on how to implement a Paired T-test in Power BI from t-stat, p-value, standard error, confidence interval and critical value. I will explain the role of each statistical measure what they are used for and how to calculate them using DAX only.

- “d-bar” is the average difference between paired data
- “SE” is the standard error of “d-bar” (we’re going to cover it further down)
- “
**ẟ**“: (delta greek letter) – since we’re using paired data sample delta is equal to zero (we’ll cover it in the null hypothesis section)

- Fully interactive visuals; custom visuals using R or Python are not
- Security policy within your organisation: Limitation to deploy python or R script to the portal
- Force you to better understand the test statistics process
- Easier to maintain: one language, one tool, integration and deployment easier, reusability
- Only DAX and some statistics knowledge required

The dataset contains data about the US presidential election at a county level with the percentage of votes that went to Republican candidates in percentage in 20212 and 2016. (500 rows).

*I downloaded this dataset from Datacamp but it is also publically available at https://dataverse.harvard.edu/dataverse/.**I chose this dataset because I needed a dataset with enough rows to make it easier to visualize the distribution*.

The model contains the main datasets that I described above and two parameters table that I will describe later.

Here we want to compare and test the two paired samples (2012 vs 2016) and we can make three different hypotheses:

**Two-tailed:**Is there any difference in means between the % of votes given to the Republican candidates between 2012 and 2016?- H0=μ2012 – μ2016 = 0
- Ha =μ2012 – μ2016 <> 0

- Left-tailed: Was the % of votes given to the Republican candidates lower in 2012 compared to 2016?
**H**=μ2012 – μ2016 >= 0_{0}**H**=μ2012 – μ2016 < 0_{a}

**Right-tailed**: Was the % of votes given to the Republican candidates greater in 2012 compared to 2016?**H**=μ2012 – μ2016 <= 0_{0}**H**=μ2012 – μ2016 > 0_{a}

We will cover the null hypothesis “**H _{0}**” and the alternative hypothesis “

The first step to implementing our paired t-test in Power BI is to compare the two paired samples is of course to calculate their difference in means.

So after calculating the difference between the two variables “dem_percent_12” and “dem_percent_16” we can simply calculate the mean using the average Dax function.

```
diff 2012 vs 2016 = dem_county_pres[dem_percent_12]-dem_county_pres[dem_percent_16]
mean_diff = AVERAGE(dem_county_pres[diff 2012 vs 2016])
```

The **standard deviation** (sd) is a measure of how spread out values are. A small standard deviation indicates that the values tend to be close to the mean, while a large standard deviation indicates that the values are spread out over a wider range. We will use the **SD** measure to calculate the **Standard Error**.

Luckily there’s a built-in Dax function for the standard deviation, here we’re using the **sample **standard deviation formula since we’re working with a sample instead of a whole population.

The only difference between the **sample sd** and the **sd **formula is the denominator “n**-1**” for the **sample sd** instead of “n” for the **sd**. (the larger the sample is the closer the result of the two formulas will be)

```
sd_diff = STDEV.S(dem_county_pres[diff 2012 vs 2016])
```

To put it simply the **Standard Error** **SE **or **SEM **in our case** **is the estimated standard deviation of the sample mean

Its formula is the standard deviation (calculated above) divided by the square root of the sample size.

The difference between the **SD **and the **SEM **is that the standard deviation measures the dispersion from the individual values to the mean, while the Standard Error of the mean measures how far the sample mean of the data is likely to be from the true population mean.

```
SEM =
var __sd=[sd_diff]
var __n=[Size]
return
divide(__sd,sqrt(__n))
```

The *t*-statistic (also called t-value or t-score) is used in a *t*-test to determine whether to support or reject the null hypothesis

The larger the t-value is, the more likely the difference in means between the two samples will be statistically significant.

In order to support or reject the null hypothesis, we need to compare the t-stats result with the t-critical value given by the t-distribution table.

```
t_stat =
var __meandiff= [mean_diff]
var __sddiff= [sd_diff]
var __n=[Size]
var __parammudiff=0
var __se=[SE]
return
divide(__meandiff-__parammudiff,__se)
```

The t-critical value is **the cutoff between retaining or rejecting the null hypothesis**. If the t-statistic value is greater than the t-critical, meaning that it is beyond it on the x-axis, then the null hypothesis is rejected and the alternate hypothesis is accepted.

**How to calculate the t-critical value?**

Without a computer calculating the critical requires the use of the t-distribution table.

**step 1:**Calculate the**degree of freedom****df**–>**sample size -1**(15-1 =14 for the example above)**step 2:**Choose the alpha level, alpha level is the threshold value used to judge whether a test statistic is statistically significant we often use 0.05 (95% of confidence) but it can vary according to the domain area (we used 0.05 in the abox example)- step 3: Choose either the one or two-tailed distribution
- one tail:
**left-tailed**: difference in means between the paired samples is strictly lower than 0- right-tailed: difference in means between the paired samples is strictly greater than 0 (example above)

- two-tailed: difference in means between the paired samples is not equal to 0 (greater or lower but not equal)

- one tail:
- step 4: lookup for the
**df**,**alpha**level and the**one-tailed/tow-tailed**intersection in the grid

Luckily, we don’t need to import the t-distribution table in Power BI and do the lookup ourselves since we can use the built-in DAX function T.INV for on tailed T-test and for a T.INV.2 T-test. The two parameters that we need to pass to the function are the “probability” and the degree of freedom.

I manually entered some parameters into a table called “**CI**” to dynamically run different Paired T-test in Power BI. The Critical Value parameter is not needed since we can compute it using the two functions mentioned but I like to keep it to quickly refer to it if needed.

The above table is linked to the table “**Hypothesis**” which contains the hypothesis tail that we want to use for the test.

Here is the formula to calculate the critical value using DAX and dynamically interact with the parameters (probability and on-tailed/two-tailed)

```
t_val =
var __df=[degree of freeedom]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",abs(T.INV(FIRSTNONBLANK(CI[Probability],1),__df)),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",abs(T.INV(FIRSTNONBLANK(CI[Probability],1),__df)),
T.INV.2T(1-FIRSTNONBLANK(CI[Probability],1),__df)
)
```

**Left critical region**

We have now calculated the t-critical value we just need to add its value to the mean of the difference of our two samples “dbar”.

For a left-tailed test, we will only be looking into the left-critical region so to reject the null hypothesis the t-stat must lie to the left of the critical value in other words it should be lower than the left critical value.

For a two-tailed test, the T-stats should either be lower than the Left-critical value or greater than the right-critical value.

The formula for the left critical value is: mean_diff – critical value

```
left cr =
var __crit = [t_val]
var __mudiff=[mean_diff]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",__mudiff-__crit,
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",BLANK(),
__mudiff-__crit
)
```

**Right critical region**

As for the right-tailed test, the T-stat must be greater than the right-critical value to reject the null hypothesis.

The formula for the right-critical value is: mean_diff + critical value

```
right cr =
var __crit = [t_val]
var __mudiff=[mean_diff]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",BLANK(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",__mudiff+__crit,
__mudiff+__crit
)
```

A p-value is used in hypothesis testing to help us support or reject the null hypothesis. The p-value is the evidence **against** a null hypothesis. The smaller the p-value, the stronger the evidence that we should reject the null hypothesis.

As we already know from the critical value section, the **critical value** is** a point beyond which we can reject the null hypothesis**. **P-value** on the other hand is defined as the **probability that an observed difference could have occurred just by random chance**. The benefit of using a **p-value** is that we can test the estimated probability at any desired level of significance by comparing this probability with the significance level “**Alpha**” without needing to recalculate the critical value each time.

To sum it up they both do the same thing: helping us to support or reject the null hypothesis in a test. They are two different approaches to the same result.

I personally tend to always use the p-value since I find it easier to calculate and interpret. (e.g. with a p-value of 0.06 we may fail to reject the null hypothesis however we can still observe moderate evidence)

The built-in DAX functions to calculate the p-value are T.DIST for the left-tailed test, T.DIST.RT for the right-tailed test and T.DIST.2T for the two-tailed test. These functions require two parameters the **t-stat** and the **degree of freedom**.

And here is the formula to dynamically interact with the different parameters.

```
p-value =
var __df=[degree of freeedom]
var __t_stat=[t_stat]
return
SWITCH(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",T.DIST(__t_stat,__df,TRUE()),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",T.DIST.RT(__t_stat,__df),
T.DIST.2T(abs(__t_stat),__df)
)
```

A Confidence Interval or **CI** is a **range of values** we are fairly sure our **true value** lies in.

In another word, the CI can answer the question of whether the result of our test is due to a chance or not within a certain degree of confidence.

The confidence level should be chosen before examining the data, a 95% confidence level is usually used. However, confidence levels of 90% and 99% are also often used depending on the domain area.

Note that a **one-tailed confidence** **interval **always extends from **minus infinity** to some value above the observed effect, or from some value below the observed effect to **plus infinity**.

**Lower**

Here is the DAX formula to calculate the CI Lower limit and dynamically interact with the different parameters.

```
lower =
var __meandiff=[mean_diff]
var __t=[t_val]
var __sd=[sd_diff]
var __n=[size]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed","-inf",
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",__meandiff-__t*divide(__sd,sqrt(__n)),
__meandiff-__t*divide(__sd,sqrt(__n))
)
```

**Upper**

And her is the Dax formula for the Upper limit. For a right-tailed the

```
upper =
var __meandiff=[mean_diff]
var __t=[t_val]
var __sd=[sd_diff]
var __n=[Size]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",__meandiff+__t*divide(__sd,sqrt(__n)),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed","+inf",
__meandiff+__t*divide(__sd,sqrt(__n))
)
```

Before visualising the outcome of our paired t-test in Power BI let’s define the null and alternative hypotheses.

The null hypothesis **H**_{0}** **assumes that any difference between the two paired samples is due to chance.

- For a
**two-tailed**test, the null hypothesis assumes the difference in means**is equal to 0**

- For a
**left-tailed**test the null hypothesis assumes the difference in means is n**ot lower than 0**

- For a
**right-tailed**test the null hypothesis assumes the difference in means is**not greater than 0**

As for the **alternate hypothesis H_{a}**, it is simply the direct

Here is the DAX measure to display the text result of our paired t-test:

```
Result =
var __lessAlt="The true difference in means is less than 0"
var __greaterAlt="The true difference in means is greater than 0"
var __twosidedAlt="The true difference in means is not equal to 0"
var __lessNull="The true difference in means is not less than 0"
var __greaterNull="The true difference in means is not greater than 0"
var __twosidedNull="There's no true difference in the means"
var __alpha=FIRSTNONBLANK(CI[Alpha],1)
var __Pval=[p-value]
return
switch(TRUE(),
FIRSTNONBLANK(Hypothesis[Tail],1)="Left-Tailed",
if(__Pval<__alpha,"We reject the nUll hypothesis and we accept the alternative hypothesis: " & __lessAlt, "We fail to reject the null hypothesis: " & __lessNull),
FIRSTNONBLANK(Hypothesis[Tail],1)="Right-Tailed",
if(__Pval<__alpha,"We reject the null hypothesis and we accept the alternative hypothesis: " & __greaterAlt, "We fail to reject the null hypothesis: " & __greaterNull),
if(__Pval<__alpha,"We reject the null hypothesis and we accept the alternative hypothesis: " & __twosidedAlt, "We fail to reject the null hypothesis: " & __twosidedNull)
)
```

Here we’re using a CR of 95% (or alpha 0.05).

As we can see the t-stat “30.30” is much greater than the critical value “8.79” or if we use the p-value approach we can see that the p-value is extremely low and far below the alpha significant level so we reject the null hypothesis in favour of the alternative hypothesis.

This time we’re running the paired t-test in Power BI with a 99% confidence interval and for a right-tailed.

In other words, we want to observe if the difference in the means is greater than 0.

Since we’re using a right-tailed test we’re using only the right critical value so the t-test must be greater than the right critical value which is the case so we reject the null hypothesis. (p-value is also below the alpha significant level)

As for the Confidence Interval, we can say that we’re 99% confident that the true difference in means should lie between 6.30 and +infinite.

Times to run a left-tailed test… Can we reject the null hypothesis?

Of course not! The** t-stat** “30.30” is far **greater **than the **left critical value** “5.18” and the **p-value is extremely large** “1” so without any doubt we **fail to reject** the null hypothesis.

To make sure that I correctly implemented the paired t-test in Power BI I added and displayed the result of the R “t.test” function and all results were accurate thanks to the great built-in DAX functions supported by Power BI. I hope that the PBI team will add even more statistical functions in the future.

It seems that R Visuals do not render from time to time when using Publish to the web but if you click on “Focus mode” then “Back to report” it eventually appears after a few seconds. Another reason to not use R and stick to DAX

This post only covered the implementation of the** paired T-test in Power BI** but as we saw there are three types of t-test so I should soon post the implementation of the other t-test.

Also, there are a few things that I did not cover on purpose since I did not want this post to become too statistics-heavy but rather focus on the DAX implementation side, however, one thing that I need to mention is that I by default used the **Welch’s t-test** (R used it by default as well), there’s another t-test called

I previously wrote a post about AB testing using Power BI using DAX only so in this other DAX statistics post I wanted to show that implementing a paired t-test in Power BI using DAX only is perfectly feasible it just requires a bit more work than typing two lines of code in R but with the help of some bulti-in DAX functions the result is great.

If you’d like to implement your own paired t-test in Power BI and test your result I’d recommend using the following t-test calculator: http://www.sthda.com/english/rsthda/paired-t-test.php

Here is the “**Paired t-test in Power BI**” published to the web where we can interact with the parameters “Confidence Interval” and “hypothesis”

**In this post, I’m going to show another way to hide tables in Power BI which prevents users or anyone to view the hidden tables.**

To hide tables in Power BI we can either do it from the Model view or directly from the Report view.

Once done we can see the hidden icon enabled on the table “v_dimDate”

And once we open the Report view we can no longer see the hidden table, so far so good?

Well until your users get access to the PBIX file and discover the option to unhide tables or to view hidden tables.

As mentioned at the beginning of this post once we get access to the PBIX file of a report or even if we have a live connection to a tabular model or a PBI dataset it is possible to view the hidden objects.

As we can see the table “v_dimDate” is now visible even though we’ve hidden the table

In order to fully prevent users from seeing hidden tables, we need to install tabular editor (version 2 or 3) and enable the external tools in Power BI.

Once the model is opened via tabular editor we have access to the Tabular Object Model (TOM) properties and we can modify them. The property that we need to change is “Private” once we set private to True the table becomes hidden and can no longer be seen in Power BI even when we enable “ViewHidden” or “Unhide all”.

The only drawback is that once the Private property is set to True Power BI will think that this table does not exist and thus the IntelliSense will not work anymore. As Power BI no longer recognise this table it will also highlight in red this table in any formula but we can ignore it as the formula will still work.

As we can see the “v_dimDate” is no longer visible even after we enable the “View hidden” option and even though Power BI does not recognise this table we can still reference it in any DAX formulas.

If the developers are developing directly into Power BI it is probably a good idea to temporary set Private to False while developing and set it back to True before publishing or sharing the file.

This option can be very useful when it comes to hiding the tables used to configure Row Level Security or any other internal tables that we don’t want users to see.

Of course, this technique does not replace the Object Level Security and should not be used for such a purpose and as a best practice, I’d always recommend not to give access to the underlying model of a report to the users and always use the Live connection whenever it’s possible.

Finally, to learn more about the Tabular Model I’d highly recommend to take the “mastering-tabular” course from SQLBI.

]]>