In this post, I describe how to write an optimized median measure in Dax which under specific criteria can be 1,000 times faster than the built-in median function.
Some time ago I was tasked to migrate a multidimensional cube to a tabular model, this cube had around 2 billion rows and I also had to create a median measure on the new tabular model. I first thought that the median would be much faster to calculate on a tabular model than on a multidimensional model but it turns out that the median was still extremely slow to compute on tables with large numbers of rows.
At that time I had found a workaround but recently I decided to revisit this workaround which was a bit overkill and this time I came up with a simple Optimized median measure in Dax. This solution has some limitations that I will describe below but in most cases, it performs much faster than the built-in Median function on a large dataset it is literally 1,000 times faster and even more.
Table of Contents
What is Median
The median is the value separating the higher half from the lower half of a dataset.
It may be thought of as “the middle” value of a sorted list. The main advantage of the median in describing data compared to the mean is that it is not skewed by a small proportion of extreme values, in other words, it is more robust to outliers than the mean.
As an example, Median Income is usually a better way to suggest what a “typical” income is because income distribution can be very skewed.
If 10 BI developers were to sit together the median income and the mean income would likely be very close but if Warren Buffet was to sit with the 10 BI developers the average would be misleading and much higher than the median.
Why is it slow to compute the Median?
If the median tends to be slow to compute it is because to determine the median value in a sequence of numbers, the numbers must first be sorted. So this is the sorting operation that makes the median slow. If we were to find the median on a presorted sequence of values the result would be immediate. The time computational complexity of the median is O(n Log n) which means it does not scale very well.
As we can see below I ran a quick simulation for 100 values in R, the median does not follow a linear time complexity “O(n)”, for 1 billion rows the time complexity of the median would be 9 billion.
There are some optimized algorithms that exist to calculate the median which mainly relies on optimized sorting algorithms such as the quicksort but all these algorithms can not be developed in DAX as they required loop and recursive functions.
So how can we calculate the median in a more efficient way in DAX?
Median Current limitations
The limitation with the built-in Dax function
- It is very slow to compute the median over a large dataset
- You’re likely to run into an error “not enough memory” on my machine 64GB I had not been able to compute the median on more than 300M rows
- Even if you have the most powerful server there’s still a limitation to calculating the median using the built-in DAX function median. We cannot compute the median on a table that contains more than 2 billion rows. I initially found this limitation on Chris Webb’s blog and I confirm that as of today the limitation is still there.
The limitation with the Optimized median measure in Dax
With the solution that I propose, there’s no limitation on the dataset size I even ran it on a table with more than 5 billion rows. However, there is an important limitation to keep in mind and I will deep dive into it in another section below.
This time the limitation is not on the number of rows but on the number of distinct values of the column for which we want to compute the median.
In most business scenarios, we are usually interested to know the median of variables that have a small number of distinct values. And even if it does have a large number I will show a simple trick to reduce the number of distinct values while still keeping the median 99% accurate.
Before going to the code of the Optimized median measure in Dax it is important to understand how the calculation of the median works.
So assuming that the sequence of observations is sorted the Median is as follows:
The formula is pretty simple we only need to check whether the number of rows “n” is odd or even and apply the above calculation accordingly. So now if we want to calculate the median in DAX without the Median() function we only need to sort the data and find a way to select the nth row.
Optimized median measure in Dax
As we saw above the main issue with the median is that we need to have the sequence of values sorted and in DAX we have no control over that.
However, we have another huge advantage which is basically the Vertipaq engine, under the hood Vertipaq has applied some algorithms to compress and store the data in a way that it can quickly compute the number of rows for a specific value.
So since we know how to calculate the median and we know that Vertipaq is very powerful at computing aggregation calculations such as counting the number of rows by Year or the number of rows by product…
The DAX formula
So this is how I suggest writing the Optimized median measure in Dax:
Let’s now break down this measure in three and see how it works:
The first part of the measure is just a setup:
- Get “n” the number of rows
- Check whether the number of rows is an odd or even number
- Find the nth value accordingly to the number of rows (odd or even)
The second part is where the magic happens since we’re leveraging the Vertipaq engine.
- We create a virtual aggregated table that contains two columns
- First column “median_column” contains the distinct values of the column for which we want to compute the median
- The second column contains the cumulative sum of the number of rows of the median column
Finally, we just iterate through the virtual table and we return the first value that has its cumulative sum of the number of rows greater or equal to the “nth” value. If the number of rows is even we calculate the value of (“nth” + “nth+1”) divided by two.
When to use the Optimized median measure in Dax
As stated earlier in this post I mentioned that this measure is extremely fast compared to the built-in function, however, it can also be very slow when the cardinality of the column for which we want to compute the median is a bit large.
Now, let’s see in which scenario we should stick to the built-in median and when we should start using the optimized median measure.
Built-in Median VS Optimized Median based on the number of rows
To get these figures I ran a benchmark in DAX studio with 5 cold cache execution and 5 Warm cache execution. The data that I used
The table that I used for the first benchmark had 20 distinct values repeated X number of times, for example, the below table has 200M rows but only 20 distinct values.
Cold Cache Execution comparison – Number of rows comparison
As we can see from the above graph the optimized median measure is 500 times faster than the built-in function when the table has 250M rows. We can also see that the Median built-in function scales very badly. At only 50M rows which I consider small, the Median measure takes already 26 seconds to render which would be an issue in a report.
Warm Cache Execution – Number of rows comparison
When the cache is used the optimized measure is 10,000 times faster! Of course, it makes more sense to compare measures with cold cache executions but it was to illustrate that the optimized median measure is better at levering the cache than the classic median function.
Built-in Median VS Optimized Median based on the median column distinct values
The big drawback of using the optimized median measure is that it scales very badly as soon as the number of distinct values of the median column increases.
So for the next benchmark between the two measures, I use a dataset of 100M rows and this time the number of rows will not change, only the number of distinct values will change.
Cold Cache Execution – Distinct values of the Median column comparison
This time we see that the built-in median function is not impacted at all by the number of distinct values of the median column which makes perfect sense since it is always scanning the entire table so the cardinality does not matter at all only the number of rows matters. And this actually the exact opposite for the optimized median measure…
Now, what measure should we use?
The short answer is of course it depends, and here is a more detailed answer:
- Very small dataset with less than 10M rows: the median should be fine
- For any large or very large dataset with a median column that has a small number of distinct values (less than 10k), you can use the optimized median measure without a doubt
- Dataset with more than 2 billion rows you have no choice but to use the optimized median measure
- Large dataset and median column with a high number of distinct values, you’re stuck! Or we can reduce the number of distinct values and use what I call the approximate median.
Approximate the median
You can find online some interesting research papers on the approximation of the median but my approach is nothing like this.
As we saw above when we have a large dataset and when the column for which we want to calculate the median has a high number of distinct values none of the measures performs well.
So the solution that I propose is to calculate an approximate median by reducing the number of distinct values. Of course, by reducing the number of distinct values we will pay a price we will lose some accuracy, the median will remain very close to what the real median is.
How do we reduce the number of distinct values of the median column?
First and easiest approach:
If the median column is a decimal number, we will simply reduce the number of decimals in that case, we will lose a very little accuracy, however, if after that the cardinality is still too large we can use the second approach.
If the median column is a whole number we will divide by 10 or more the median column until we reach a small enough number of distinct values, we will then compute the median and multiply it back with the denominator.
Let’s see how it works:
We have a dataset of 500M rows and the column for which we want to compute the median has 27,372 distinct values, we already know that none of the two measures will work well since the number of rows is too high for the bulti-in median and the cardinality of the median column is too high for the optimized median measure.
We have to reduce the number of unique values of the median column, so we divide the median column by 10 and we make sure to round the result to the nearest integer.
(If you need to implement such a solution I would advise applying this transformation in the ETL part and not in a DAX calculated column.)
Now we have 5,728 unique values which is good enough for running the optimized median measure.
We can now compute the approximate median which is based on the Median Column /10, the result of the approximate median is 1,191 which we then multiply back by 10 which returns 11,910.
As we can see by dividing the column by 10 we’re losing very little accuracy and in most scenarios having an approximate median will be more than enough.
At maximum in this specific scenario, I can miss the real median by 5 units which represents only 0.018%.
If we have a higher cardinality let’s say 10 million we will need to divide the median column by 1,000 which means that we can miss the real median by 500 units at most, but 500 over 1,000,000 is only 0.05% so still quite accurate.
As we saw the Optimized median measure in Dax can be 1,000 times faster than the built-in function but can also becomes very slow when the cardinality of the median column is too large. When it is the case we can use the approximate median but we will lose up to 0.05% of accuracy which should be acceptable
As for the cardinality of the median column I found that the ideal threshold is at around 10,000, this might vary for you if you have a bigger server, I ran the benchmark on ma personal machine which has 64GB or ram and 16 logical cores.
Except the limitation on the median column cardinality the optimized median measure can run on very large dataset I ran it on dataset with more than 5 billion of rows and it was still returning the median in less than a second.
Here are some figures, as we can see for 2.5B rows it does not event take half a second to run.
This was quite a long post for just a simple median measure, but I did not want to throw the Dax code without providing any context and details. Any measure optimization requires a good understanding of the context and how the measure is going to be used by the users. So this is why I wanted to highlight all the pros and cons of using the optimized median measure.
For most scenarios this measure will work amazingly I believe and I’m pretty sure it can still be tweaked a little to make it even faster or more customizable but at this point I’m really happy with the performance thanks to the way Vertipaq compress and quickly precalculate things.