Dynamically partitioning tables in SSAS Tabular model using SQL
In this post, I describe how we can dynamically partitioning tables in SSAS tabular using SQL.
This approach works well for most of the simple scenarios where we need to keep control over the number of partitions to refresh daily or any frequencies.
Table of Contents
What is partition
Partitions split a table into logical partition objects. Each partition contains a portion of the data, partitions can be processed in parallel independent of other partitions or excluded from processing operations if they don’t need to be refreshed.
Benefits of partitioning
Here are some of the benefits of partitioning you can find more details in my post Performance benefits of partitioning in Analysis Services Tabular.
- Reduce the volume of data across different environments
- Parallel processing
- Incremental loading
- Set different data refresh frequencies, historical data don’t need to be refreshed every day
Parallel processing and incremental loading are common scenarios for using the partitions, however, another scenario where partitions are also very useful is for reducing the volume of data across each environment.
In this post, I’m going to show how we can simply dynamically partitioning tables in SSAS using SQL.
Why should we limit the volume of data?
Putting things in another word limiting the volume of data means:
- Reducing the size of our model
- Reducing the time processing of our model
- Reducing the CPU and memory consumption
- Reducing the cost of our On-prem or cloud infrastructure
So disregarding whether our data are on-prem or in the cloud, the storage and the computation needed to run our BI system comes with a cost.
As it is common to keep 5+ years of historical data in a BI production system this is certainly not a good idea to keep historical data on a Development or a Test server.
Allocating the same computational power and storage capacity for all our servers (Dev, Test, UAT, Prod) is a waste of money and energy. So before jumping into the how-to-do part let me answer the why SQL question.
Why use SQL over other techniques?
There are many techniques for dynamically partitioning tables in SSAS Tabular data models such as using TMSL scripting, SSIS, Powershell, .NET, and they all come with their pros and cons.
In my opinion, the most robust technique would be to use SSIS or Powershell automation combined with a release automation tool such as DevOps, however, this would require a lot of effort to implement and maintain.
The technique I’m going to describe has the advantage of being fairly simple, very fast to implement and easy to maintain.
How can we keep control over the volume of data across our different servers?
Create a Param table across your various environments
This parameter table is the main component to configure the number of partitions across each environment.
We need to have this table in each environment, this table contains different values in each environment and of course, we should have the “NBPartitions” value lower in the Dev/Test environments than in the Prod environment.
Generate dynamic filters for each partition
After the param table was created and set up I decided to use a view to keep things simple and to avoid any dependencies.
The view will dynamically generate the data to be used in each partition.
CREATE OR ALTER VIEW [ssas].[v_CubePartitionsList]
AS
SELECT p.DBName
, p.TableName
, RANK() OVER (PARTITION BY p.dbname, p.tablename ORDER BY left(datekey,6) DESC) AS partitionNumber
, dd.YEAR
, left(MIN(datekey),6) AS FromMonth
, left(MAX(datekey),6) AS ToMonth
, MIN(dd.datekey) AS FromDate
, MAX(dd.datekey) AS Todate
FROM param.ssas.CubePartitionsParam p
INNER JOIN Param.dim.Date dd
ON DateKey BETWEEN CONVERT(INT, CONVERT(VARCHAR(12), DATEADD(MONTH, -p.NbPartitions , GETDATE()), 112)) AND CONVERT(INT, CONVERT(VARCHAR(12), GETDATE(), 112))
WHERE p.PartitionBy = 'Month'
GROUP BY p.dbname
, p.tablename
, dd.Year
, left(datekey,6)
, dd.firstdayofmonth
, dd.lastdayofmonth
UNION ALL
SELECT p.DBName
, p.TableName
, RANK() OVER (PARTITION BY p.dbname, p.tablename ORDER BY dd.Year DESC) AS partitionNumber
, dd.Year
, left(MIN(datekey),6) AS FromMonth
, left(MAX(datekey),6) AS ToMonth
, MIN(dd.datekey) AS FromDate
, MAX(dd.datekey) AS Todate
FROM param.ssas.CubePartitionsParam p
INNER JOIN Param.dim.Date dd
ON Year BETWEEN YEAR(GETDATE())-p.nbpartitions +1 AND YEAR(GETDATE())
WHERE p.PartitionBy = 'Year'
GROUP BY p.dbname
, p.tablename
, dd.YEAR
The script I used to create the dim table used in this view can be found on GitHub here
Here is a subset of the output generated by the view for the FactSalesQuota and the FactInternetSales tables.
So far we’ve seen how to configure and generate a dynamic view that contains the list of dates to be used for each partition.
Now let’s see how we can dynamically partitioning tables in SSAS using this view.
Create the maximum number of partitions you need
Here the trick is to create beforehand the maximum number of partitions that you will need in your live environment.
So let’s assume that you need 6 partitions for your production model (current year + 5 years of history; 1 year = 1 partition), but you only need 1 year in Dev and 2 years in Test.
Your model will still have 6 partitions across each environment, however, some partitions will be empty (0 rows) in Dev and Test.
By doing so, we can manage everything with the param table of our different servers and we don’t need to use any .Net, TMSL or PowerShell code.
Then we need to use “Select *” in each partition to make the maintenance easier so if we need to add a new column in our fact table we do it in the view pointing to our fact only one time and all the partitions will reflect the changes made in the view.
And of course, as best practice, we should only include in the view the columns that we really have in your model!
WITH Partitions_CTE AS (
SELECT fromdate
, todate
FROM [PARAM].[ssas].[v_CubePartitionsList]
WHERE DBName = 'AdventureWorksDW2017'
AND TableName = 'FactInternetSales'
AND partitionNumber = 9
)
SELECT fct.*
FROM AdventureWorksDW2017.dbo.FactInternetSales fct
INNER JOIN Partitions_CTE c
ON fct.OrderDateKey BETWEEN fromdate AND todate
The two values fromdate and todate are dynamically generated by the view and will correspond to the first day and last day of September 2020.
This is what the above query will return:
Real case scenario
In this scenario here are historical data that we need for each environment:
- 6 months of data in Dev
- 1 year of data in Test
- 2 years of data in UAT
- 5 years of data in Prod
As we can see below the only thing needed to dynamically partitioning tables in SSAS is to change the values of the “nbpartitions” of the param table and everything else is fully dynamic.
Create all the partitions beforehand
As already mentioned we need to create the maximum number of partitions in our model disregarding in which environment our model will be deployed.
In this scenario, we assume that we need 13 partitions divided by months, in production all the partitions are refreshed.
Now let’s see how we will process only 3 month of data in our Test environment:
TMSL vs SQL
We could perfectly generate a TMSL command via SSMS and then run the command on-demand or schedule it to the desired frequency.
However, whenever we want to change the number of partitions to refresh we also need to amend the JSON script to add or remove a partition which can become hard to maintain.
Here is the TMSL command that we can use to process only three partitions:
{
"refresh": {
"type": "automatic",
"objects": [
{
"database": "AdventureWorks2",
"table": "Internet Sales",
"partition": "Internet Sales - Current Month"
},
{
"database": "AdventureWorks2",
"table": "Internet Sales",
"partition": "Internet Sales - M-1"
},
{
"database": "AdventureWorks2",
"table": "Internet Sales",
"partition": "Internet Sales - M-2"
}
]
}
}
As we can see above only three partitions have been refreshed and the nine other partitions have never been refreshed.
Now let’s see what the SQL approach will do:
As we can see above instead of processing only three partitions we process all the existing partitions.
However, it still produces the same result as we are dynamically filtering the partitions by joining the fact table on the dynamic view thus even if the partitions are processed they are still empty (0 rows).
conclusion
Dynamically partitioning tables in SSAS tabular can be a lot more complex than this approach especially when we need to handle partitions with different refresh frequencies or granularity but in scenarios where we need to refresh all the partitions, I found this technique very efficient and easy to use.
Here are the advantages/drawbacks of using this approach:
Advantages:
– This is straightforward to implement.
– Very easy to maintain through the param table.
– This is also quite flexible as long as we’re basing our partition on a date (day, month, quarter, semester, year…) any changes on the logic to calculate the partition size are done in a single place: the view.
Drawbacks:
– If we have a lot of partitions the SQL query will still be executed even if it returns 0 rows.
But with the right index on the right column, each query shouldn’t run for more than a few seconds for each empty partition so not a big deal.
– This approach works very well for scenarios where we need to refresh all the partitions or set of partitions in each process, however, for more advanced partitioning strategies such as incremental refresh or refreshing only the partitions that have changed this approach is not suitable and we would need a more complex approach.
3 thoughts on “Dynamically partitioning tables in SSAS Tabular model using SQL”
Excellent – thank you.
Very well explained , Thank you.
Cant see the SQL Script used to create Dim in Git, can you please check and provide the latest link please?. Thanks