Data Virtualization in Microsoft Power BI – Part 2

In my previous post I showed how you can use Microsoft Power BI to create a Data Virtualization layer on top of multiple relational data sources querying them all at the same time through one common model. As I already mentioned in the post and what was also pointed out by Adam Saxton (b, t) in the comments is the fact, that this approach can cause serious performance problems at the data source and also on the Power BI side. So in this post we will have a closer look on what actually happens in the background and which queries are executed when you join different data sources on-the-fly.

We will use the same model as in the previous post (you can download it from there or at the end of this post) and run some basic queries against it so we can get a better understanding of the internals.
Here is our relationship diagram again as a reference. Please remember that each table comes from a different SQL server:

Relationships

In our test we will simply count the number of products for each Product Subcategory:

NumberOfProducts_by_SubCategory

Even though this query only touches two different data sources, it is a good way to analyze the queries sent to the data sources. To track these queries I used the built-in Performance Analyzer of Power BI desktop which can be enabled on the “View”-tab. It gives you detailed information about the performance of the report including the actual SQL queries (under “Direct query”) which were executed on the data sources. The plain text queries can also be copied using the “Copy queries” link at the bottom. In our case 3 SQL queries were executed against 2 different SQL databases:

Query 1:

SELECT TOP (1000001) 
     [t2].[ProductSubcategoryKey],
     [t2].[ProductSubcategoryName]
FROM 
     (
         (
             select 
                 [_].[ProductSubcategoryKey] as [ProductSubcategoryKey],
                 [_].[ProductSubcategoryAlternateKey] as [ProductSubcategoryAlternateKey],
                 [_].[EnglishProductSubcategoryName] as [ProductSubcategoryName],
                 [_].[SpanishProductSubcategoryName] as [SpanishProductSubcategoryName],
                 [_].[FrenchProductSubcategoryName] as [FrenchProductSubcategoryName],
                 [_].[ProductCategoryKey] as [ProductCategoryKey]
             from [dbo].[DimProductSubcategory] as [_]
         )
     ) AS [t2]
GROUP BY 
     [t2].[ProductSubcategoryKey],
     [t2].[ProductSubcategoryName] 

Result:
Results_Query1

The query basically selects two columns from the DimProductSubcategory table:

  1. ProductSubcategoryKey – which is used in the join with DimProduct
  2. ProductSubcategoryName – which is the final name to be displayed in the visual

The inner sub-select (line 7-14) represents the original Power Query query. It selects all columns from the DimProductSubcategory table and renames [EnglishProductSubcagetoryName] to [ProductSubcategoryName] (line 10). Any other Power Query steps that are supported in direct query like aggregations, groupings, filters, etc. would also show up here.

Query 2 (shortened):

SELECT TOP (1000001) 
     [semijoin1].[c67],
     SUM([a0]) AS [a0]
FROM 
     (
         (
             SELECT 
                 [t1].[ProductSubcategoryKey] AS [c29],
                 COUNT_BIG(*) AS [a0]
             FROM 
             (
                 (
                     select 
                         [$Table].[ProductKey] as [ProductKey],
                         [$Table].[ProductAlternateKey] as [ProductAlternateKey],
                         …
                     from [dbo].[DimProduct] as [$Table]
                 )
             ) AS [t1]
             GROUP BY [t1].[ProductSubcategoryKey]
     ) AS [basetable0]
inner join 
     (
         (SELECT N'Mountain Bikes' AS [c67],1 AS [c29] )  UNION ALL 
         (SELECT N'Road Bikes' AS [c67],2 AS [c29] )  UNION ALL 
         (SELECT N'Touring Bikes' AS [c67],3 AS [c29] )  UNION ALL 
         (SELECT N'Handlebars' AS [c67],4 AS [c29] )  UNION ALL 
         (SELECT N'Bottom Brackets' AS [c67],5 AS [c29] )  UNION ALL 
         …
         (SELECT null AS [c67],null AS [c29] ) 
     ) AS [semijoin1] 
     on [semijoin1].[c29] = [basetable0].[c29]
)
GROUP BY [semijoin1].[c67] 

(The query was shortened at line 16 and line 29 as the removed columns/rows are not relevant for the purpose of this example.)

Similar to Query 1 above, the innermost sub-select (line 13-17) in the FROM clause returns the results of the Power Query query for DimProduct whereas the outer sub-select (line 7-20) groups the result by the common join-key [ProductSubcategoryKey].
This result is then joined with a static table which is made up from hard-coded SELECTs and UNION ALLs (line 24-30). If you take a closer look, you will realize that this table actually represents the original result of Query 1! Additionally it also includes a special NULL-item (line 30) that is used to handle non-matching entries.
The last step is to group the joined tables to obtain the final results.

Query 3 (shortened):

SELECT 
     COUNT_BIG(*) AS [a0]
FROM 
     (
         (
             select 
                 [$Table].[ProductKey] as [ProductKey],
                 [$Table].[ProductAlternateKey] as [ProductAlternateKey],
                 …
             from [dbo].[DimProduct] as [$Table]
         )
     ) AS [t1] 

(The query was shortened at line 9 as the removed columns/rows are not relevant for the purpose of this example.)

The last query is necessary to display the correct grand total across all products and product sub-categories.

As you can see, most of the “magic” happens in Query 2. The virtual join or virtualization is done by hard-coding the results of the remote table/data source directly into the SQL query of the current table/data source. This works fine as long as the results of the remote query are small enough – both, in terms of numbers of rows and columns – but the more limiting factor is the number of rows. Roughly speaking, if you have more than thousand items that are joined this way, the queries tend to get slow. In reality this will very much depend on your data so I would highly recommend to test this with your own data!

I ran a simple test and created a join on the SalesOrderNumber which has about 27,000 distinct items. The query never returned any results and after having a look at the Performance Analyzer I realized, that the query similar to Query 2 above was never executed. I do not know yet whether this is because of the large number of items and the very long SQL query that is generated (27,000 times SELECT + UNION ALL !!!) or a bug.

At this point you may ask yourself if it makes sense to use Power BI for data virtualization or use another tool that was explicitly designed for this scenario. (Just google for “data virtualization”). These other tools may perform better even on higher volume data but they will also reach their limits if the joins get too big and, what is even more important, the are usually quite expensive.

So I think that Power BI is still a viable solution for data virtualization if you keep the following things in mind:
– keep the items in the join columns at a minimum
– use Power Query to pre-aggregate the data if possible
– don’t expect too much in terms of performance
– only use it when you know what you are doing 🙂

Downloads:

PowerBI_DataVirtualization_Part2.pbix
SQL_Query1.sql
SQL_Query2.sql
SQL_Query3.sql

Data Virtualization in Microsoft Power BI

Data Virtualization is actually a very new topic to me as I have barely seen it implemented in the real world or at any of my customers. But it becomes more and more interesting when working with big data where you cannot simply load all data into a single in-memory data model but still need to query across different data sources. So I decided to investigate how this could be done with my favorite reporting tool Power BI which I know is capable to connect to different data sources out of the box and also provides a rich set of visualizations that I need.

But let’s start slowly.

What is Data Virtualization?

According to Wikipedia, “Data virtualization is any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located,[1] and can provide a single customer view (or single view of any other entity) of the overall data.”

So basically, combining data from multiple sources and multiple formats into a common semantic layer which can be queried on-the-fly without the need of any ETL/ELT.

Sounds awesome – right?

The problem is that in reality the things are not as simple as they may sound, especially when it comes to joining across the different sources. While data virtualization usually works fine for small amounts of data that can be easily processed, it can be quite challenging  on large amounts of data which is where data virtualization would actually make sense to avoid lengthy and costly ETL/ELT.

What does Power BI have to do with this?

At first sight – nothing. But lets examine what we currently have in Power BI:

  • a semantic layer and data modelling capabilities
  • access to various data sources via Direct Query (remember, we do not want to load any data!)
  • ability to combine data from those sources

The last part is the most important one here and you may wonder what I am talking about. And you are right, by default a DQ model is only linked to one data source at a time but you can add other data sources manually in the Power Query editor!
This is where it get’s interesting and what this blog post is about.

Test-Case

To verify the statements from above I built a little test case that involves 2 local database and an Azure SQL database over which I want to create a semantic layer using Power BI. All three databases are actually the same AdventureWorksDW databases but for the purpose of this demonstration this is OK as it is just a technical feasibility study.

I started by creating a new Power BI file and connected it to my first data base in Direct Query mode.

Then I selected a single table, in my case I have chosen “FactResellerSales”.

The next step is to add a new table using the “Edit Queries” button on the ribbon:

EditQueries_AddNewDatasources

You will see the one table you have just selected before as a Power Query query which you can simply copy and rename. I renamed it to “DimProduct” as I want to load the DimProduct table from my second local database which can be accomplished by simply changing the connection to the SQL database in the first step “Source” of the query (I use my second local database AdventureWorksDW2014 – instead of AdventureWorksDW2012):

Several things to point out here:

ChangeLocalDatabaseAndTable
  1. once you change the name of the server or the database, you may get prompted for credentials
  2. when you click on Table in the Data column, Power BI asks you if you want to replace the next step – simply press [Yes]
  3. at the “Source” step, Power BI will complain that the results of the current steps are not valid in Direct Query mode – this is fine as the final result will be in the next step (“Navigation”)

I repeated the same procedure again and also added the table “DimProductSubcategory” from my Azure SQL database.
So right now we have 3 Direct Query tables pointing to 3 different databases and 3 different tables.

Setup_Relationships

Now we need to connect our tables in the Relationship-view – similar as you would to with any other tables in a regular Direct Query setup:

Once the relationships are created, we can finally create our reports.

Are you excited? Well, I definitely was when I tried this setup the first time!

InteractiveReport

The visuals behave as if they were created on top of an Import Mode dataset or a Direct Query dataset that only connects to a single database. I think that is pretty awesome and again shows what Power BI is capable of!

We just used Power BI to create a semantic layer across different databases and tables which are now all joined and queried on-the-fly always showing the most recent data!

Conclusion

As you have seen, you Power BI allows us to combine multiple SQL databases in Direct Query mode and query them together as if they would be one single data source. So if you have a requirement where your data is distributed across databases and you it is too big to be loaded into memory or you need live data, you can give this approach a try. I have only tested it with regular Microsoft SQL databases but I assume that this works in a similar way with any other data source that supports Direct Query (e.g. SAP, Oracle, Spark, …) too. You can also do some basic transformations before joining the data in Direct Query mode which can also be very crucial when combining different data sources that might have slightly different formats. Again, I have not tested this thoroughly but at least everything that can be query folded should be  supported as a transformation.

In a follow-up post I will explain the technical details and what actually happens in the background when you use a setup like this so stay tuned!