PowerBI & Big Data – Using pre-calculated Aggregations of Semi- and Non-Additive Measures

Calculating and visualizing semi- and non-additive measures like distinct count in Power BI is usually not a big deal. However, things can become challenging if your data volume grows and exceeds the limits of Power BI!

In one of my recent projects we wanted to visualize data from the customers analytical platform based on Azure Databricks in Power BI. The connection between those two tools works pretty flawless which I also described in my previous post but the challenge was the use-case and the calculations. We wanted to display the distinct customers across various aggregations levels over a billion rows fact table. We came up with different potential solutions all having their pros and cons:

  1. load all data into Power BI (import mode) and do the aggregations there
  2. use Power BI with direct query and let the back-end do the heavy lifting
  3. load only necessary pre-aggregated data into Power BI (import mode)

Please keep in mind that we are dealing with a distinct count measure here. Semi- and Non-additive measure like this cannot easily be aggregated from lower levels to higher levels without having all the detail data available!

Option 1. has the obvious drawback that data model would be huge in size as we were dealing with billions of transactions. This would have exceeded our current size limits for Power BI data models.

Option 2. would usually work fine, but again, for the amount of data we were dealing with the back-end was just no able to provide sub-second latency that was required.

So we went for Option 3. and did the various aggregations on the different levels in Azure Databricks and loaded only the final results to Power BI. First we wanted to use Power BI Aggregations and Composite Models. Unfortunately, this did not work out for us as we were not in control which aggregation table (we had multiple for the different aggregation levels) was used by the engine which potentially resulted in wrong results when additional aggregation was done in Power BI. Also, when slicing for random aggregation levels, Power BI was querying the details in direct query mode causing very poor query performance.

After some further thinking we came up with a new solution which was also based on pre-calculated aggregations but not realized using built-in aggregation tables but having a combined table for all aggregations and some very straight-forward DAX to select the row we wanted! In the end the whole solution consisted of one SQL view using COUNT(DISTINCT xxx) aggregation and GROUP BY GROUPING SETS (T-SQL, Databricks, … supported in all major SQL engines) and a very simple DAX measure!

Here is a little example that illustrates the approach. Assume you want to calculate the distinct customers that bought certain products in a subcategory/category by year. The first step is to create a view that provides this information:

SELECT 
	od.[CalendarYear] AS [Year],
	dp.[ProductSubcategoryKey] AS [ProductSubcategoryKey],
	dp.[ProductCategoryKey] AS [ProductCategoryKey],
	COUNT(DISTINCT CustomerKey) AS [DC_Customers]
FROM [dbo].[FactInternetSales] fis
INNER JOIN [dbo].[vDimProductHierarchy] dp
	ON fis.[ProductKey] = dp.[ProductKey]
INNER JOIN dbo.[DimDate] od
	ON fis.[OrderDatekey] = od.[DateKey]
GROUP BY 
GROUPING SETS (
	(),
	(od.[CalendarYear]),
	(od.[CalendarYear], dp.[ProductSubcategoryKey], dp.[ProductCategoryKey]),
	(od.[CalendarYear], dp.[ProductCategoryKey]),
	(dp.[ProductSubcategoryKey], dp.[ProductCategoryKey]),
	(dp.[ProductCategoryKey])
)

Please note that when we have a natural relationship between hierarchy levels (= only 1:n relationships) we need to specify the current level and also all upper levels to allow a proper drill-down later on! E.g. ProductCategory (1 -> n) ProductSubcategory

This calculates all the different aggregation levels we need. Columns with NULL mean they were not filtered/grouped by when calculating the aggregation.
Rows 80-84 contain the aggregations grouped by Year only whereas rows 77-79 contain only aggregates by ProductCategoryKey. The rows 75-76 were aggregated by Year AND ProductCategoryKey.
Depending on your final report layout, you may not need all of them and you should consider removing those that are not needed!

This table is then loaded into Power BI. You can either use a custom SQL query like above in Power BI directly or create a view in the back-end system which would be my preferred solution. Alternatively you can also create all these grouping sets using Power Query/M. The incredible Imke Feldmann (t, b) came up with a solution that allows you to specify the grouping sets in a similar way as in SQL and do all this magic within Power BI directly! I hope she will blog about it pretty soon!
(The sample workbook at the end of this post also contains a little preview of this M-magic.)

Now that we have all the data we need in Power BI, we need to display the right values for the selections in the report which of course can be dynamic. That’s a bit tricky but once you understand the concept, it is pretty straight forward. First of all, the table containing the aggregations must not be related to any other table as we build them on the fly within our DAX measure. The table itself can also be hidden.

And this is the final DAX for our measure:

DC Customers = 
VAR _sel_SubcategoryKey = SELECTEDVALUE(DimProduct[ProductSubcategoryKey])
VAR _sel_CategoryKey = SELECTEDVALUE(DimProduct[ProductCategoryKey])
VAR _sel_Year = SELECTEDVALUE(DimDate[CalendarYear])
VAR _tbl_Agg = CALCULATETABLE(
    'CustomAggregations',
    TREATAS({_sel_SubcategoryKey}, CustomAggregations[ProductSubcategoryKey]),
    TREATAS({_sel_CategoryKey}, CustomAggregations[ProductCategoryKey]),
    TREATAS({_sel_Year}, CustomAggregations[Year])
)
VAR _AggCount = COUNTROWS(_tbl_Agg)
RETURN
    IF(_AggCount = 1, MAXX(_tbl_Agg, [DC_Customers]), _AggCount * -1)

The first part is to get all the selected values of the lookup/dimension tables the user selects on the report. These are all the _sel_XXX variables. SELECTEDVALUE() returns the selected value if only one item is in the current filter context and BLANK()/NULL otherwise. We then use TREATAS() to apply those filters (either a single item or NULL) to our aggregations table. This should usually only return a table with a single row so we can use MAXX() to get our actual value from that one row. I also added a check in case multiple rows are returned which can potentially happen if you use multi-selects in your filters and instead of showing wrong values I’d rather indicate that there is something wrong with the calculation.

The measure can then be sliced and diced by our pre-defined aggregation levels as if it would be a regular measure but instead of having to process those expensive calculations on the fly we use the pre-calculated aggregates!

One thing to be aware of is that it will produce wrong results if multiple items for any of the aggregation levels are selected so it is highly recommended to set all slicers/filters to single select only or ensure that the filtered aggregation levels are also used in the chart. In this case only the grand total will show wrong values or NULL then.
This could also be fixed in the DAX measure by checking how many rows are actually selected for each level and throw an error in case it is used in a filter and the count of values is > 1.

I did some further thinking and this approach could probably also be used to mimic custom roll-ups and unary operators we know from Analysis Services Multidimensional cubes. If I find some proper examples and this turns out to be feasibly I will write another blog post about it!

Download: Custom_Aggregations_NonAdditive_Measure.pbix

Data Virtualization in Microsoft Power BI – Part 2

In my previous post I showed how you can use Microsoft Power BI to create a Data Virtualization layer on top of multiple relational data sources querying them all at the same time through one common model. As I already mentioned in the post and what was also pointed out by Adam Saxton (b, t) in the comments is the fact, that this approach can cause serious performance problems at the data source and also on the Power BI side. So in this post we will have a closer look on what actually happens in the background and which queries are executed when you join different data sources on-the-fly.

We will use the same model as in the previous post (you can download it from there or at the end of this post) and run some basic queries against it so we can get a better understanding of the internals.
Here is our relationship diagram again as a reference. Please remember that each table comes from a different SQL server:

Relationships

In our test we will simply count the number of products for each Product Subcategory:

NumberOfProducts_by_SubCategory

Even though this query only touches two different data sources, it is a good way to analyze the queries sent to the data sources. To track these queries I used the built-in Performance Analyzer of Power BI desktop which can be enabled on the “View”-tab. It gives you detailed information about the performance of the report including the actual SQL queries (under “Direct query”) which were executed on the data sources. The plain text queries can also be copied using the “Copy queries” link at the bottom. In our case 3 SQL queries were executed against 2 different SQL databases:

Query 1:

SELECT TOP (1000001) 
     [t2].[ProductSubcategoryKey],
     [t2].[ProductSubcategoryName]
FROM 
     (
         (
             select 
                 [_].[ProductSubcategoryKey] as [ProductSubcategoryKey],
                 [_].[ProductSubcategoryAlternateKey] as [ProductSubcategoryAlternateKey],
                 [_].[EnglishProductSubcategoryName] as [ProductSubcategoryName],
                 [_].[SpanishProductSubcategoryName] as [SpanishProductSubcategoryName],
                 [_].[FrenchProductSubcategoryName] as [FrenchProductSubcategoryName],
                 [_].[ProductCategoryKey] as [ProductCategoryKey]
             from [dbo].[DimProductSubcategory] as [_]
         )
     ) AS [t2]
GROUP BY 
     [t2].[ProductSubcategoryKey],
     [t2].[ProductSubcategoryName] 

Result:
Results_Query1

The query basically selects two columns from the DimProductSubcategory table:

  1. ProductSubcategoryKey – which is used in the join with DimProduct
  2. ProductSubcategoryName – which is the final name to be displayed in the visual

The inner sub-select (line 7-14) represents the original Power Query query. It selects all columns from the DimProductSubcategory table and renames [EnglishProductSubcagetoryName] to [ProductSubcategoryName] (line 10). Any other Power Query steps that are supported in direct query like aggregations, groupings, filters, etc. would also show up here.

Query 2 (shortened):

SELECT TOP (1000001) 
     [semijoin1].[c67],
     SUM([a0]) AS [a0]
FROM 
     (
         (
             SELECT 
                 [t1].[ProductSubcategoryKey] AS [c29],
                 COUNT_BIG(*) AS [a0]
             FROM 
             (
                 (
                     select 
                         [$Table].[ProductKey] as [ProductKey],
                         [$Table].[ProductAlternateKey] as [ProductAlternateKey],
                         …
                     from [dbo].[DimProduct] as [$Table]
                 )
             ) AS [t1]
             GROUP BY [t1].[ProductSubcategoryKey]
     ) AS [basetable0]
inner join 
     (
         (SELECT N'Mountain Bikes' AS [c67],1 AS [c29] )  UNION ALL 
         (SELECT N'Road Bikes' AS [c67],2 AS [c29] )  UNION ALL 
         (SELECT N'Touring Bikes' AS [c67],3 AS [c29] )  UNION ALL 
         (SELECT N'Handlebars' AS [c67],4 AS [c29] )  UNION ALL 
         (SELECT N'Bottom Brackets' AS [c67],5 AS [c29] )  UNION ALL 
         …
         (SELECT null AS [c67],null AS [c29] ) 
     ) AS [semijoin1] 
     on [semijoin1].[c29] = [basetable0].[c29]
)
GROUP BY [semijoin1].[c67] 

(The query was shortened at line 16 and line 29 as the removed columns/rows are not relevant for the purpose of this example.)

Similar to Query 1 above, the innermost sub-select (line 13-17) in the FROM clause returns the results of the Power Query query for DimProduct whereas the outer sub-select (line 7-20) groups the result by the common join-key [ProductSubcategoryKey].
This result is then joined with a static table which is made up from hard-coded SELECTs and UNION ALLs (line 24-30). If you take a closer look, you will realize that this table actually represents the original result of Query 1! Additionally it also includes a special NULL-item (line 30) that is used to handle non-matching entries.
The last step is to group the joined tables to obtain the final results.

Query 3 (shortened):

SELECT 
     COUNT_BIG(*) AS [a0]
FROM 
     (
         (
             select 
                 [$Table].[ProductKey] as [ProductKey],
                 [$Table].[ProductAlternateKey] as [ProductAlternateKey],
                 …
             from [dbo].[DimProduct] as [$Table]
         )
     ) AS [t1] 

(The query was shortened at line 9 as the removed columns/rows are not relevant for the purpose of this example.)

The last query is necessary to display the correct grand total across all products and product sub-categories.

As you can see, most of the “magic” happens in Query 2. The virtual join or virtualization is done by hard-coding the results of the remote table/data source directly into the SQL query of the current table/data source. This works fine as long as the results of the remote query are small enough – both, in terms of numbers of rows and columns – but the more limiting factor is the number of rows. Roughly speaking, if you have more than thousand items that are joined this way, the queries tend to get slow. In reality this will very much depend on your data so I would highly recommend to test this with your own data!

I ran a simple test and created a join on the SalesOrderNumber which has about 27,000 distinct items. The query never returned any results and after having a look at the Performance Analyzer I realized, that the query similar to Query 2 above was never executed. I do not know yet whether this is because of the large number of items and the very long SQL query that is generated (27,000 times SELECT + UNION ALL !!!) or a bug.

At this point you may ask yourself if it makes sense to use Power BI for data virtualization or use another tool that was explicitly designed for this scenario. (Just google for “data virtualization”). These other tools may perform better even on higher volume data but they will also reach their limits if the joins get too big and, what is even more important, the are usually quite expensive.

So I think that Power BI is still a viable solution for data virtualization if you keep the following things in mind:
– keep the items in the join columns at a minimum
– use Power Query to pre-aggregate the data if possible
– don’t expect too much in terms of performance
– only use it when you know what you are doing 🙂

Downloads:

PowerBI_DataVirtualization_Part2.pbix
SQL_Query1.sql
SQL_Query2.sql
SQL_Query3.sql