Using Power BI Field Parameters to translate Data and Values

When building an enterprise reporting solution with Power BI, a question that always comes up is how to handle translations. Large enterprises operate in various countries where people also speak different languages. So a report should be available in all frequently used languages. Ideally, you just create a report once and then a user can decide (or it is decided for him) in which language the report is displayed.

Power BI only partially supports this scenario and the closest we could get *before field parameters* were introduced is already very well described by Chris Webb’s blog post on Implementing Data (As Well As Metadata) Translations In Power BI – a must-read if you need to deal with translations in Power BI. Another good read on the topic is the blog post Multilingual Reports in Power BI from PBI Guy.

As you will quickly realize, the translation of metadata is already pretty easy as it is baked into the engine. Unfortunately this is not the case when you need to translate actual data values (e.g. product names, …). In the multidimensional version of Analysis Services this just worked like a charm as it was also a native feature but this feature never made it to Analysis Services Tabular Models, Azure Analysis Services or Power BI.

The current approaches when it comes to data and value translations are more workarounds than actual solutions. They probably work fine for small data models and very specific use-cases but usually fall short in performance, usability or maintainability when implemented on a larger scale enterprise models.

The recently introduced Field Parameters in Power BI give us a bit more flexibility here and another potential solution to implement data and value translations in Power BI.

Here is what we want to achieve:

  • create a single report only
  • support for multiple languages – metadata and column data
  • only minor changes to the existing data model

How can Field Parameters help here?

Field Parameters allow you to select the columns you want to display in your report/visual on-the-fly. Based on the selection, the reporting engine decides which physical column(s) it needs to use in the query it generates and sends to the data model.
So we can create a Field Parameter for the different columns that hold the translated data values and already easily switch the language by changing the selection of our Field Parameter. This is how our Filed Parameter would be defined:

Translated ProductName = {
    ("product name", NAMEOF('DimProduct'[EnglishProductName]), 0, "en-US"),
    ("nom du produit", NAMEOF('DimProduct'[FrenchProductName]), 1, "fr-FR"),
    ("nombre de producto", NAMEOF('DimProduct'[SpanishProductName]), 2, "es-SP")
}

I did this for all the fields for which translated values are actually provided. Usually this is just a very small subset of all the available columns!

Translated MonthOfYear = {
    ("MonthName", NAMEOF('DimDate'[EnglishMonthName]), 0, "en-US"),
    ("mois de l'année", NAMEOF('DimDate'[FrenchMonthName]), 1, "fr-FR"),
    ("mes del año", NAMEOF('DimDate'[SpanishMonthName]), 2, "es-SP")
}

Translated DayOfWeek = {
    ("Day Of Week", NAMEOF('DimDate'[EnglishDayNameOfWeek]), 0, "en-US"),
    ("jour de la semaine", NAMEOF('DimDate'[FrenchDayNameOfWeek]), 1, "fr-FR"),
    ("día de la semana", NAMEOF('DimDate'[SpanishDayNameOfWeek]), 2, "es-SP")
}

As you can see, Field Parameters allow you to translate the metadata (first value) and also to define the column to use for the data values (second value, using NAMEOF() function).

To change all field parameters at once I introduced an additional 4th column that holds the culture/language of the current row which is then linked to another static DAX table that is defined as follows:

Language = DATATABLE("Culture", STRING, {{"en-US"}, {"fr-FR"}, {"es-SP"}})

Then relationships are set up between these tables:

In your report you can now simply use the column from the field parameters and add a slicer for the Language table to control which language is displayed. Note: this must be a single-select slicer as otherwise Power BI will build a hierarchy of the different languages!

Here is the final result:

(please use Full Screen mode from bottom right corner)

As you can see, we just created a single report that supports multiple languages for both, metadata and data values, allows you to easily switch between them and provides similar performance as if you would have built the report for a single language only!

There are still some open questions when it comes to translating all the labels used on the whole report which is already partially covered in the other blog posts referenced above but this approach brings us another step further to a fully translatable report.

Another nice feature of this approach is that you can also put security on top of the Language/Culture table so a user only sees exactly one row – the one with the language/culture of his choice or country. So a user would not even need to select the language but it would be selected for him automatically!
Ideally you could even use the USERCULTURE() DAX function but unfortunately this is currently not supported in the PBI service. There is already an open idea for which you can vote if this is important to you.
USERCULTUER() DAX function is now finally general available also in the service: https://powerbi.microsoft.com/en-us/blog/userculture-dax-function-now-supported-in-power-bi-premium/

The .pbix file can be downloaded here: PBI_Translations.pbix

Automating the Extraction of BIM metadata from PBIX Files using CI/CD pipelines

The latest updates can always be found in the

PowerBI.CICD repository

In the past I have been working on a lot of different Power BI projects and it has always been (and still is) a pain when it comes to the deployment of changes across multiple tiers(e.g. Dev/Test/Prod). The main problem here being that a file generated in Power BI desktop (</mark>.pbix<mark style="background-color:#ffffff" class="has-inline-color has-black-color">) is basically a binary file and the metadata of the actual data model (BIM) cannot be easily extracted or derived. This causes a lot of problems upstream when you want to automate the deployment using CI/CD pipelines. Here are some common approaches to tackle these issues:

  • Use of Power BI deployment pipelines
    The most native solution, however quite inflexible when it comes to custom and conditional deployments to multiple stages
  • Creation a Power BI Template (.pbit) in addition to your .pbix<mark style="background-color:#ffffff" class="has-inline-color has-black-color"> file and check in both
    This works because the </mark></mark><mark style="background-color:#ffffff" class="has-inline-color">.pbit</mark><mark style="background-color:#ffffff" class="has-inline-color has-black-color"><mark style="background-color:#ffffff" class="has-inline-color has-black-color"> file basically contains the BIM file but its creation is also a manual step
  • Extraction of the BIM file while PBI desktop is still running (e.g. using Tabular Editor)
    With the support of external tools this is quite easy, but is still a manual step and requires a 3rd party tool
  • Development outside of PBI desktop (e.g. using Tabular Editor)
    Probably the best solution but unfortunately not really suited for business users and for the data model only but not for the Power Queries

As you can see, there are indeed some options, but none of them is really ideal, especially not for a regular business user (not talking about IT pros). So I made up my mind and came up with the following list of things that I would want to see for proper CI/CD with Power BI files:

  • Users should be able to work with their tool of choice (usually PBI desktop, optional with Tabular Editor or any other 3rd party tool)
  • Automatically extracting the metadata whenever the data model changes
  • Persisting the metadata (BIM) in git to allow easy tracking of changes
  • Using the persisted BIM file for further automation (CD)

Solution

The core idea of the solution is to use CI/CD pipelines that automatically extracts the metadata of a .pbix file as soon as it is pushed to the Git repository. To do this, the .pbix file is automatically uploaded to a Power BI Premium workspace using the Power BI REST API and the free version of Tabular Editor 2 then extracts the BIM file via the XMLA endpoint and push it back to the repository.

I packaged this logic into ready-to-use YAML pipelines for Github Actions and Azure DevOps Pipelines being the two most common choices to use with Power BI. You can just copy the YAML files from the PowerBI.CICD repository to your own repo. Then simply provide the necessary information to authentication against the Power BI service and that’s it. As soon as everything is set up correctly. the pipeline will automatically create a .database.json for every PBIX file that you upload (assuming it contains a data model) and track it in your git repository!

All further details can be found directly in the repository which is also updated frequently!

Doing relative-time Slicers properly in Power BI

A very common requirement for a Power BI report that I stumble across at almost all of my customers is to automatically show data for the current day/month/year when a report is opened. At first sight this seems like a very trivial problem but once you dig into the problem, you will realize that all of the common solutions out there have some disadvantages and only solve the problem partially.

So here is what we want to achieve:

  • Show the Current Month (or Day, or Year)  by default
  • Works [in combination] with all other columns in the date table.
  • A single, easy to use slicer/filter to control the time selection and change from Current Month/Day/Year to any other value
  • Works with built-in time intelligence functions
  • Works with existing DAX measures
  • Works with any datamodel/report

Solutions like Relative Time Filter/Slicer, DAX or relative flags in the date table address only some points of the above list but definitely not all of them which is why I thought we need a better solution to this:

(please use full-screen mode)

We actually created a new table in our data model that is linked to the original date table. The reason why we cannot use the same table here is that the new table does not have unique date values as all dates/rows referring to our current calculations are duplicated. It has to be a many-to-one relationship with cross-filter direction set to both (even though we will only use the new table ‘Calendar_with_current’ to filter the existing table ‘Calendar’):

And that’s it basically. You can now exchange the original Calendar table with the new one to get the new “Current” values in your report. If you have time intelligence functions in place, you further need to extend them and add ALL('Calendar_with_current ') as a filter to make them work also with the new table. The old table can also be hidden now if you do not want to confuse the end users. To make a seamless switch you can further rename the tables.

I added an additional column to the table called Type that allows you to select which values you want to show – the original values (e.g. “September”), the values with “Current X” (e.g. “Current Month”), or both.Please see the second page/tab of the embedded report above.

So this raises the question how this new table can be created? To simplify this I have created a Power Query function that takes 3 parameters:

  • The current date table
  • A list of definitions of your current-values
  • The name of the unique date-column in your current date table (parameter 1)

The first and the third parameter should be clear, but what are the “CurrentDefinitions”?

It is basically a table which defines the relative time calculations that you want to extend your existing date table with. Here is an example:

The column Column refers to the column in which you want to create the relative date definition. The column NewValue specifies the value that you want to set for rows that match the third column Filter. The column Filter either takes a static filter expression like [RelativeMonth] = 1 (as in lines 5-8) but can also use existing M-functions and reference the existing Date-column using the placeholder <<DateColumn>> as you can see in lines 1-4.

The table can be maintained using “Enter Data” and can contain any number of rows/definitions!

For most of my scenarios this works pretty well and addresses all major problems highlighted above.

The latest Power Query function can be downloaded from my github repository: fn_DateTableWithCurrentCalculations.pq
Power BI desktop files can be downloaded from here: PBIX PBIT

Custom functions and complex return types in Power Query

When working with Power Query, you have probably already realized that every expression you write returns a value of a specific type. Usually this will be a primitive type like text, number, or date. (A full list of types available in Power Query can be found here: https://docs.microsoft.com/en-us/powerquery-m/m-spec-types). If for some reason the type of an expression cannot be defined, the special type *any* will be used. For sure you already encountered this when using Table.AddColumn which, by default, results in the new column being of type *any*.

To avoid this, you can use the optional fourth parameter and specify the resulting type of the expression. This can be very handy and saves you the Change Type step that usually comes afterwards.

This fourth parameter not only works for primitive types but also for complex types. If you do not specify it, the column type is again *any* even though the actual values are records:

Once you click the Expand-Button of the new MyRecord column in the table header, you will realize a short delay until the available fields are displayed. This indicates that PQ first has to evaluate the expression before it can provide you the list of fields within the record. For complex scenarios, this can take a long time and can also be avoided by explicitly specifying the type in the fourth parameter as shown below:

As you can see, PQ can now immediately display the available columns without having to evaluate the function!

This also works the very same way if you call a custom function as expression of your Table.AddColumn. But there is the caveat: If you have a function that returns a complex type, let’s say a record, you will usually want to specify the type as part of the function or within the definition of the function and not re-type it again each time you call the function.

Fortunately, there is a solution to this problem: the function Type.FunctionReturn. In combination with Value.Type you can derive the return type from the function dynamically!

So, if you have the following function:

let
    myTextFunction = () as text => "Called fn_myTextFunction!"
in
    myTextFunction

You can derive the resulting type of the function by using a combination of Type.FunctionReturn and Value.Type as shown below:

Ok, so this is already pretty cool – but what happens if your function returns a complex type like a record or a table?

You will realize that you can simply replace “as text” with “as record” and the function would now return a record – at least logically, you also need to change the actual expression:

let
    myFunction = () as record => [MyText="Called fn_MyTextFunction!"]
in
    myFunction

and then call it as before:

You will realize that now again it takes some time until the available fields are displayed indicating the function must be evaluated first. Another indicator for this to happen is the warning at the bottom and the link to “Load more”. If you think of it, this makes sense – Power Query knows that the function now returns a record, but does not know which fields the record contains and thus has to evaluate it. So how can we combine custom function that return complex types and the ability to specify the resulting type as part of the function?

The first thing that would come to your mind is to simply strong-type the return type of the function specifying each field individually, but this will result in an error:

Currently it is not supported to specify a complex type as the return type of a function – it only works with primitive types. But as you can guess, I did not start this blog post for no reason. There is a way to achieve this, even though it may not be as nice as it could and should be.

The solution here is the Type.ForFunction function which allows you to create a more specific definition of your function including the return type. This definition/type can then be applied to your original function using Value.ReplaceType:

let
    resultType = type [TextValue=text, NumValue=number],
    myFunction = (myText as text, myNum as number) => [TextValue=myText, NumValue=myNum],

    parameters = Type.FunctionParameters(Value.Type(myFunction)),
    requiredParameters = Type.FunctionRequiredParameters(Value.Type(myFunction)),

    newFunctionType = Type.ForFunction([ReturnType = resultType, Parameters = parameters], requiredParameters),

    newFunction = Value.ReplaceType(myFunction, newFunctionType)
in
    newFunction

You basically first define the final return type of the function and the function itself (lines 2 and 3). The other lines (5 to 10) take care of applying the return type to the function which can then be used in combination with the approach above to dynamically derive the return type when calling the function (using ype.FunctionReturn and Value.Type). This now allows you to specify everything that is related to the function in one place!

This is especially handy if you have a function that returns a record or a table which is re-used multiple times and the fields/columns may change over time. Using this approach allows you to only change the function and everything else is derived automatically.

Sample workbook for download: PQ Custom Function return types.pbix

Reading Delta Lake Tables natively in PowerBI

I also contributed the connector described in this post to the official delta.io Connectors page and repo (link). You will find the most recent updates in my personal repo which are then merged to the official repo once it has been tested thoroughly!

Working with analytical data platforms and big data on a daily basis, I was quite happy when Microsoft finally announced a connector for Parquet files back in November 2020. The Parquet file format is developed by the Apache foundation as an open-source project and has become a fundamental part of most data lake systems nowadays.

“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.”

However, Parquet is just a file format and does not really support you when it comes to data management. Common data manipulation operations (DML)  like updates and deletes still need to be handled manually by the data pipeline. This was one of the reasons why Delta Lake (delta.io) was developed besides a lot of other features like ACID transactions, proper meta data handling and a lot more. If you are interested in the details, please follow the link above.

So what is a Delta Lake table and how is it related to Parquet? Basically a Delta Lake table is a folder in your Data Lake (or wherever you store your data) and consists of two parts:

  1. Delta log files (in the sub-folder _delta_log)
  2. Data files (Parquet files in the root folder or sub-folders if partitioning is used)

The Delta log persists all transactions that modified the data or meta data in the table. For example, if you execute an INSERT statement, a new transaction is created in the Delta log and a new file is added to the data files which is referenced by the Delta log. If a DELETE statement is executed, a particular set of data files is (logically) removed from the Delta log but the data file still resides in the folder for a certain time. So we cannot just simply read all Parquet files in the root folder but need to process the Delta log first so we know which Parquet files are valid for the latest state of the table.

These logs are usually stored as JSON files (actually JSONL files to be more precise). After 10 transactions, a so-called checkpoint-file is created which is in Parquet format and stores all transactions up to that point in time. The relevant logs for the final table are then the combination of the last checkpoint-file and the JSON files that were created afterwards. If you are interested in all the details on how the Delta Log works, here is the full Delta Log protocol specification.

From those logs we get the information which Parquet files in the main folder must be processed to obtain the final table. The content of those Parquet files can then simply be combined and loaded into PowerBI.

I encapsulated all this logic into a custom Power Query function which takes the folder listing of the Delta table folder as input and returns the content of the Delta table. The folder listing can either come from an Azure Data Lake Store, a local folder, or an Azure Blob Storage. The mandatory fields/columns are [Content], [Name] and [Folder Path]. There is also an optional parameter which allows you the specify further options for reading the Delta table like the Version  if you want to use time-travel. However, this is still experimental and if you want to get the latest state of the table, you can simply omit it.

The most current M-code for the function can be found in my Github repository for PowerBI: fn_ReadDeltaTable.pq and will also be constantly updated there if I find any improvement.
The repository also contains an PowerBI desktop file (.pbix) where you can see the single steps that make up for the final function.

Once you have added the function to your PowerBI / Power Query environment you can call it like this:

= fn_ReadDeltaTable(
    AzureStorage.DataLake(
        "https://myadls.dfs.core.windows.net/public/data/MyDeltaTable.delta", 
        [HierarchicalNavigation = false]), 
    [Version = 12])

I would further recommend to nest your queries and separate the access to the storage (e.g. Azure Data Lake Store) and the reading of the table (execution of the function). If you are reading for an ADLS, it is mandatory to also specify [HierarchicalNavigation = false] !
If you are reading from a blob storage, the standard folder listing is slightly different and needs to be changed.

Right now the connector/function is still experimental and performance is not yet optimal. But I hope to get this fixed in the near future to have a native way to read and finally visualize Delta lake tables in PowerBI.

After some thorough testing the connector/function finally reached a state where it can be used without any major blocking issues, however there are still some known limitations:

  • Partitioned tables
    • currently columns used for partitioning will always have the value NULL FIXED!
    • values for partitioning columns are not stored as part of the parquet file but need to be derived from the folder path FIXED!
  • Performance
    • is currently not great but this is mainly related to the Parquet connector as it seems
    • very much depends on your data – please test on your own!
  • Time Travel
    • currently only supports “VERSION AS OF”
    • need to add “TIMESTAMP AS OF”
  • Predicate Pushdown / Partition Elimination
    • currently not supported – it always reads the whole table FIXED!

Any feedback is welcome!

Special thanks also goes to Imke Feldmann (@TheBIccountant, blog) and Chris Webb (@cwebb_bi, blog) who helped me writing and tuning the PQ function!

Downloads: fn_ReadDeltaTable.pq (M-code)

PowerBI & Big Data – Using pre-calculated Aggregations of Semi- and Non-Additive Measures

Calculating and visualizing semi- and non-additive measures like distinct count in Power BI is usually not a big deal. However, things can become challenging if your data volume grows and exceeds the limits of Power BI!

In one of my recent projects we wanted to visualize data from the customers analytical platform based on Azure Databricks in Power BI. The connection between those two tools works pretty flawless which I also described in my previous post but the challenge was the use-case and the calculations. We wanted to display the distinct customers across various aggregations levels over a billion rows fact table. We came up with different potential solutions all having their pros and cons:

  1. load all data into Power BI (import mode) and do the aggregations there
  2. use Power BI with direct query and let the back-end do the heavy lifting
  3. load only necessary pre-aggregated data into Power BI (import mode)

Please keep in mind that we are dealing with a distinct count measure here. Semi- and Non-additive measure like this cannot easily be aggregated from lower levels to higher levels without having all the detail data available!

Option 1. has the obvious drawback that data model would be huge in size as we were dealing with billions of transactions. This would have exceeded our current size limits for Power BI data models.

Option 2. would usually work fine, but again, for the amount of data we were dealing with the back-end was just no able to provide sub-second latency that was required.

So we went for Option 3. and did the various aggregations on the different levels in Azure Databricks and loaded only the final results to Power BI. First we wanted to use Power BI Aggregations and Composite Models. Unfortunately, this did not work out for us as we were not in control which aggregation table (we had multiple for the different aggregation levels) was used by the engine which potentially resulted in wrong results when additional aggregation was done in Power BI. Also, when slicing for random aggregation levels, Power BI was querying the details in direct query mode causing very poor query performance.

After some further thinking we came up with a new solution which was also based on pre-calculated aggregations but not realized using built-in aggregation tables but having a combined table for all aggregations and some very straight-forward DAX to select the row we wanted! In the end the whole solution consisted of one SQL view using COUNT(DISTINCT xxx) aggregation and GROUP BY GROUPING SETS (T-SQL, Databricks, … supported in all major SQL engines) and a very simple DAX measure!

Here is a little example that illustrates the approach. Assume you want to calculate the distinct customers that bought certain products in a subcategory/category by year. The first step is to create a view that provides this information:

SELECT 
	od.[CalendarYear] AS [Year],
	dp.[ProductSubcategoryKey] AS [ProductSubcategoryKey],
	dp.[ProductCategoryKey] AS [ProductCategoryKey],
	COUNT(DISTINCT CustomerKey) AS [DC_Customers]
FROM [dbo].[FactInternetSales] fis
INNER JOIN [dbo].[vDimProductHierarchy] dp
	ON fis.[ProductKey] = dp.[ProductKey]
INNER JOIN dbo.[DimDate] od
	ON fis.[OrderDatekey] = od.[DateKey]
GROUP BY 
GROUPING SETS (
	(),
	(od.[CalendarYear]),
	(od.[CalendarYear], dp.[ProductSubcategoryKey], dp.[ProductCategoryKey]),
	(od.[CalendarYear], dp.[ProductCategoryKey]),
	(dp.[ProductSubcategoryKey], dp.[ProductCategoryKey]),
	(dp.[ProductCategoryKey])
)

Please note that when we have a natural relationship between hierarchy levels (= only 1:n relationships) we need to specify the current level and also all upper levels to allow a proper drill-down later on! E.g. ProductCategory (1 -> n) ProductSubcategory

This calculates all the different aggregation levels we need. Columns with NULL mean they were not filtered/grouped by when calculating the aggregation.
Rows 80-84 contain the aggregations grouped by Year only whereas rows 77-79 contain only aggregates by ProductCategoryKey. The rows 75-76 were aggregated by Year AND ProductCategoryKey.
Depending on your final report layout, you may not need all of them and you should consider removing those that are not needed!

This table is then loaded into Power BI. You can either use a custom SQL query like above in Power BI directly or create a view in the back-end system which would be my preferred solution. Alternatively you can also create all these grouping sets using Power Query/M. The incredible Imke Feldmann (t, b) came up with a solution that allows you to specify the grouping sets in a similar way as in SQL and do all this magic within Power BI directly! I hope she will blog about it pretty soon!
(The sample workbook at the end of this post also contains a little preview of this M-magic.)

Now that we have all the data we need in Power BI, we need to display the right values for the selections in the report which of course can be dynamic. That’s a bit tricky but once you understand the concept, it is pretty straight forward. First of all, the table containing the aggregations must not be related to any other table as we build them on the fly within our DAX measure. The table itself can also be hidden.

And this is the final DAX for our measure:

DC Customers = 
VAR _sel_SubcategoryKey = SELECTEDVALUE(DimProduct[ProductSubcategoryKey])
VAR _sel_CategoryKey = SELECTEDVALUE(DimProduct[ProductCategoryKey])
VAR _sel_Year = SELECTEDVALUE(DimDate[CalendarYear])
VAR _tbl_Agg = CALCULATETABLE(
    'CustomAggregations',
    TREATAS({_sel_SubcategoryKey}, CustomAggregations[ProductSubcategoryKey]),
    TREATAS({_sel_CategoryKey}, CustomAggregations[ProductCategoryKey]),
    TREATAS({_sel_Year}, CustomAggregations[Year])
)
VAR _AggCount = COUNTROWS(_tbl_Agg)
RETURN
    IF(_AggCount = 1, MAXX(_tbl_Agg, [DC_Customers]), _AggCount * -1)

The first part is to get all the selected values of the lookup/dimension tables the user selects on the report. These are all the _sel_XXX variables. SELECTEDVALUE() returns the selected value if only one item is in the current filter context and BLANK()/NULL otherwise. We then use TREATAS() to apply those filters (either a single item or NULL) to our aggregations table. This should usually only return a table with a single row so we can use MAXX() to get our actual value from that one row. I also added a check in case multiple rows are returned which can potentially happen if you use multi-selects in your filters and instead of showing wrong values I’d rather indicate that there is something wrong with the calculation.

The measure can then be sliced and diced by our pre-defined aggregation levels as if it would be a regular measure but instead of having to process those expensive calculations on the fly we use the pre-calculated aggregates!

One thing to be aware of is that it will produce wrong results if multiple items for any of the aggregation levels are selected so it is highly recommended to set all slicers/filters to single select only or ensure that the filtered aggregation levels are also used in the chart. In this case only the grand total will show wrong values or NULL then.
This could also be fixed in the DAX measure by checking how many rows are actually selected for each level and throw an error in case it is used in a filter and the count of values is > 1.

I did some further thinking and this approach could probably also be used to mimic custom roll-ups and unary operators we know from Analysis Services Multidimensional cubes. If I find some proper examples and this turns out to be feasibly I will write another blog post about it!

Download: Custom_Aggregations_NonAdditive_Measure.pbix

DatabricksPS and Azure AD Authentication

Avaiilable via PowerShell Gallery: DatabricksPS

Databricks recently announced that it is now also supporting Azure Active Directory Authentication for the REST API which is now in public preview. This may not sound super exciting but is actually a very important feature when it comes to Continuous Integration/Continuous Delivery pipelines in Azure DevOps or any other CI/CD tool. Previously, whenever you wanted to deploy content to a new Databricks workspace, you first needed to manually create a user-bound API access token. As you can imagine, manual steps are also bad for otherwise automated processes like a CI/CD pipeline. With Databricks REST API finally supporting Azure Active Directory Authentication of regular users and service principals, this last manual step is finally also gone!

As I had this issue at many of my customers where we had already fully automated the deployment of our data platform based on Azure and Databricks, I also wanted to use this new feature there. The deployment of regular Databricks objects (clusters, notebooks, jobs, …) was already implemented in the CI/CD pipeline using my PowerShell module DatabricksPS and of course I did not want to rewrite any of those steps. So, I simply extend the module’s authentication methods to also support Azure Active Directory Authentication. The only thing that actually changed was the call to Set-DatabricksEnvironment which now supports additional parameter sets and parameters:

The first thing you will realize is that it is now necessary to specify the Databricks Workspace explicitly either using SubscriptionID/ResourceGroupName/WorkspaceName to uniquely identify the Databricks workspace within Azure or using the OrganizationID that you see displayed in the URL of your Databricks Workspace. For the actual authentication the parameters -ClientID, -TenantID, -Credential and the switch -ServicePrincipal are used.

Regardless of whether you use regular username/password authentication with an AAD user or an AAD service principal, the first thing you need to do in both cases is to create an AAD Application as described in the official docs from Databricks:
Using Azure Active Directory Authentication Library
Using a service principal

Once you have ensured all prerequisites exist, you can use the samples below to authenticate with your AAD username/password with DatabricksPS:

$username = 'myuser@mydomain.com'
$password = 'Pass@word1!'
$securePassword = ConvertTo-SecureString $password -AsPlainText -Force
$credential = New-Object System.Management.Automation.PSCredential($username, $securePassword)

$apiUrl = "https://westeurope.azuredatabricks.net"
$tenantId = "15970f38-6789-6789-6789-6e44bf2f5d11"
$clientId = "d73905f5-aaaa-bbbb-cccc-ecff76ba959c"
$subscriptionId = "69389949-1234-1234-1234-e499fac64209"
$resourceGroupName = "myResourceGroup"
$workspaceName = "myDatabricksWorkspace"

# Setup connection to Databricks using AAD Authentication
Set-DatabricksEnvironment -ApiRootUrl $apiUrl -Credential $credential 
  -ClientID $clientId -TenantID $tenantId 
  -SubscriptionID $subscriptionId -ResourceGroupName $resourceGroupName 
  -WorkspaceName $workspaceName

# Stop all existing clusters
Get-DatabricksCluster | Stop-DatabricksCluster

Here is another sample using a regular service principal authentication and the OrganizationID with DatabricksPS:

$clientId = '12345678-6789-6789-6789-6e44bf2f5d11' # = Application ID
$clientSecret = 'tN4Lrez.=5.Il]IAgRx6w6kJ@6C.ap7Y'
$secureClientSecret = ConvertTo-SecureString $clientSecret -AsPlainText -Force
$credential = New-Object System.Management.Automation.PSCredential($clientId, $secureClientSecret)

$apiUrl = "https://westeurope.azuredatabricks.net"
$tenantId = "15970f38-6789-6789-6789-6e44bf2f5d11"
$orgId = "1234535501392586"

# Setup connection to Databricks using AAD Authentication
Set-DatabricksEnvironment -ApiRootUrl $apiUrl -Credential $credential 
  -ClientID $clientId -TenantID $tenantId 
  -OrgID $orgId -ServicePrincipal

# Export all notebooks of the Databricks Workspace to a local folder
Export-DatabricksEnvironment -LocalPath "C:\db_export" 
 -Artifacts "Workspace" -WorkspaceRootPath "/" 

As you can see, once the environment is set up using the new authentication methods, the rest of the script stays the same and there is not much more you need to do fully automate your CI/CD pipeline with DatabricksPS!

I have not yet fully tested all cmdlets of the module so if you experience any issues, please contact me or open a ticket in the GIT repository.

How-To: Migrating Databricks workspaces

Foreword:
The approach described in this blog post only uses the Databricks REST API and therefore should work with both, Azure Databricks and also Databricks on AWS!

It recently had to migrate an existing Databricks workspace to a new Azure subscription causing as little interruption as possible and not loosing any valuable content. So I thought a simple Move of the Azure resource would be the easiest thing to do in this case. Unfortunately it turns out that moving an Azure Databricks Service (=workspace) is not supported:

Resource move is not supported for resource types ‘Microsoft.Databricks/workspaces’. (Code: ResourceMoveNotSupported)

I do not know what is/was the problem here but I did not have time to investigate but instead needed to come up with a proper solution in time. So I had a look what needs to be done for a manual export. Basically there are 5 types of content within a Databricks workspace:

  • Workspace items (notebooks and folders)
  • Clusters
  • Jobs
  • Secrets
  • Security (users and groups)

For all of them an appropriate REST API is provided by Databricks to manage and also exports and imports. This was fantastic news for me as I knew I could use my existing PowerShell module DatabricksPS to do all the stuff without having to re-invent the wheel again.
So I basically extended the module and added new Import and Export functions which automatically process all the different content types:

  • Export-DatabricksEnvironment
  • Import-DatabricksEnvironment

They can be further parameterized to only import/export certain artifacts and how to deal with updates to already existing items. The actual output of the export looks like this and of course you can also modify it manually to your needs – all files are in JSON except for the notebooks which are exported as .DBC file by default:

A very simple sample code doing and export and an import into a different environment could look like this:

Set-DatabricksEnvironment -AccessToken $accessTokenExport -ApiRootUrl "https://westeurope.azuredatabricks.net"
Export-DatabricksEnvironment -LocalPath 'D:\Desktop\MyExport' -CleanLocalPath

Set-DatabricksEnvironment -AccessToken $accessTokenImpport -ApiRootUrl "https://westeurope.azuredatabricks.net"
Import-DatabricksEnvironment -LocalPath 'D:\Desktop\MyExport' 

Having those scripts made the whole migration a very easy task.
In addition, these new cmdlets can also be used in your Continuous Integration/Continuous Delivery (CI/CD) pipelines in Azure DevOps or any other CI/CD tool!

So just download the latest version from the PowerShell gallery and give it a try!

Data Virtualization in Microsoft Power BI – Part 2

In my previous post I showed how you can use Microsoft Power BI to create a Data Virtualization layer on top of multiple relational data sources querying them all at the same time through one common model. As I already mentioned in the post and what was also pointed out by Adam Saxton (b, t) in the comments is the fact, that this approach can cause serious performance problems at the data source and also on the Power BI side. So in this post we will have a closer look on what actually happens in the background and which queries are executed when you join different data sources on-the-fly.

We will use the same model as in the previous post (you can download it from there or at the end of this post) and run some basic queries against it so we can get a better understanding of the internals.
Here is our relationship diagram again as a reference. Please remember that each table comes from a different SQL server:

Relationships

In our test we will simply count the number of products for each Product Subcategory:

NumberOfProducts_by_SubCategory

Even though this query only touches two different data sources, it is a good way to analyze the queries sent to the data sources. To track these queries I used the built-in Performance Analyzer of Power BI desktop which can be enabled on the “View”-tab. It gives you detailed information about the performance of the report including the actual SQL queries (under “Direct query”) which were executed on the data sources. The plain text queries can also be copied using the “Copy queries” link at the bottom. In our case 3 SQL queries were executed against 2 different SQL databases:

Query 1:

SELECT TOP (1000001) 
     [t2].[ProductSubcategoryKey],
     [t2].[ProductSubcategoryName]
FROM 
     (
         (
             select 
                 [_].[ProductSubcategoryKey] as [ProductSubcategoryKey],
                 [_].[ProductSubcategoryAlternateKey] as [ProductSubcategoryAlternateKey],
                 [_].[EnglishProductSubcategoryName] as [ProductSubcategoryName],
                 [_].[SpanishProductSubcategoryName] as [SpanishProductSubcategoryName],
                 [_].[FrenchProductSubcategoryName] as [FrenchProductSubcategoryName],
                 [_].[ProductCategoryKey] as [ProductCategoryKey]
             from [dbo].[DimProductSubcategory] as [_]
         )
     ) AS [t2]
GROUP BY 
     [t2].[ProductSubcategoryKey],
     [t2].[ProductSubcategoryName] 

Result:
Results_Query1

The query basically selects two columns from the DimProductSubcategory table:

  1. ProductSubcategoryKey – which is used in the join with DimProduct
  2. ProductSubcategoryName – which is the final name to be displayed in the visual

The inner sub-select (line 7-14) represents the original Power Query query. It selects all columns from the DimProductSubcategory table and renames [EnglishProductSubcagetoryName] to [ProductSubcategoryName] (line 10). Any other Power Query steps that are supported in direct query like aggregations, groupings, filters, etc. would also show up here.

Query 2 (shortened):

SELECT TOP (1000001) 
     [semijoin1].[c67],
     SUM([a0]) AS [a0]
FROM 
     (
         (
             SELECT 
                 [t1].[ProductSubcategoryKey] AS [c29],
                 COUNT_BIG(*) AS [a0]
             FROM 
             (
                 (
                     select 
                         [$Table].[ProductKey] as [ProductKey],
                         [$Table].[ProductAlternateKey] as [ProductAlternateKey],
                         …
                     from [dbo].[DimProduct] as [$Table]
                 )
             ) AS [t1]
             GROUP BY [t1].[ProductSubcategoryKey]
     ) AS [basetable0]
inner join 
     (
         (SELECT N'Mountain Bikes' AS [c67],1 AS [c29] )  UNION ALL 
         (SELECT N'Road Bikes' AS [c67],2 AS [c29] )  UNION ALL 
         (SELECT N'Touring Bikes' AS [c67],3 AS [c29] )  UNION ALL 
         (SELECT N'Handlebars' AS [c67],4 AS [c29] )  UNION ALL 
         (SELECT N'Bottom Brackets' AS [c67],5 AS [c29] )  UNION ALL 
         …
         (SELECT null AS [c67],null AS [c29] ) 
     ) AS [semijoin1] 
     on [semijoin1].[c29] = [basetable0].[c29]
)
GROUP BY [semijoin1].[c67] 

(The query was shortened at line 16 and line 29 as the removed columns/rows are not relevant for the purpose of this example.)

Similar to Query 1 above, the innermost sub-select (line 13-17) in the FROM clause returns the results of the Power Query query for DimProduct whereas the outer sub-select (line 7-20) groups the result by the common join-key [ProductSubcategoryKey].
This result is then joined with a static table which is made up from hard-coded SELECTs and UNION ALLs (line 24-30). If you take a closer look, you will realize that this table actually represents the original result of Query 1! Additionally it also includes a special NULL-item (line 30) that is used to handle non-matching entries.
The last step is to group the joined tables to obtain the final results.

Query 3 (shortened):

SELECT 
     COUNT_BIG(*) AS [a0]
FROM 
     (
         (
             select 
                 [$Table].[ProductKey] as [ProductKey],
                 [$Table].[ProductAlternateKey] as [ProductAlternateKey],
                 …
             from [dbo].[DimProduct] as [$Table]
         )
     ) AS [t1] 

(The query was shortened at line 9 as the removed columns/rows are not relevant for the purpose of this example.)

The last query is necessary to display the correct grand total across all products and product sub-categories.

As you can see, most of the “magic” happens in Query 2. The virtual join or virtualization is done by hard-coding the results of the remote table/data source directly into the SQL query of the current table/data source. This works fine as long as the results of the remote query are small enough – both, in terms of numbers of rows and columns – but the more limiting factor is the number of rows. Roughly speaking, if you have more than thousand items that are joined this way, the queries tend to get slow. In reality this will very much depend on your data so I would highly recommend to test this with your own data!

I ran a simple test and created a join on the SalesOrderNumber which has about 27,000 distinct items. The query never returned any results and after having a look at the Performance Analyzer I realized, that the query similar to Query 2 above was never executed. I do not know yet whether this is because of the large number of items and the very long SQL query that is generated (27,000 times SELECT + UNION ALL !!!) or a bug.

At this point you may ask yourself if it makes sense to use Power BI for data virtualization or use another tool that was explicitly designed for this scenario. (Just google for “data virtualization”). These other tools may perform better even on higher volume data but they will also reach their limits if the joins get too big and, what is even more important, the are usually quite expensive.

So I think that Power BI is still a viable solution for data virtualization if you keep the following things in mind:
– keep the items in the join columns at a minimum
– use Power Query to pre-aggregate the data if possible
– don’t expect too much in terms of performance
– only use it when you know what you are doing 🙂

Downloads:

PowerBI_DataVirtualization_Part2.pbix
SQL_Query1.sql
SQL_Query2.sql
SQL_Query3.sql

Power BI – Dynamic TopN + Others with Drill-Down

A very common requirement in reporting is to show the Top N items (products, regions, customers, …) and this can also be achieved in Power BI quite easily.

But lets start from the beginning and show how this requirement usually evolves and how to solve the different stages.

The easiest thing to do is to simply resize the visual (e.g. table visual) to only who 5 rows and sort them descending by your measure:

This is very straight forward and I do not think it needs any further explanation.

The next requirement that usually comes up next is that the customer wants to control, how many Top items to show. So they implement a slicer and make the whole calculation dynamic as described here:
SQL BI – Use of RANKX in a Power BI measure
FourMoo – Dynamic TopN made easy with What-If Parameter

Again, this works pretty well and is explained in detail in the blog posts.

Once you have implemented this change the business users usually complain that Total is wrong. This depends on how you implemented the TopN measure and what the users actually expect. I have seen two scenarios that cause confusion:
1) The Total is the SUM of the TopN items only – not reflecting the actual Grand Total
2) The Total is NOT the SUM of the TopN items only – people complaining that Power BI does not sum up correctly

As I said, this pretty much depends on the business requirements and after discussing that in length with the users, the solution is usually to simply add an “Others” row that sums up all values which are not part of the TopN items. For regular business users this requirement sounds really trivial because in Excel the could just add a new row and subtract the values of the TopN items from the Grand Total.

However, they usually will not understand the complexity behind this requirement for Power BI. In Power BI we cannot simply add a new “Others” row on the fly. It has to be part of the data model and as the TopN calculations is already dynamic, also the calculation for “Others” has to be dynamic. As you probably expected, also this has been covered already:
Oraylis – Show TopN and rest in Power BI
Power BI community – Dynamic Top N and Others category

These work fine even if I do not like the DAX as it is unnecessarily complex (from my point of view) but the general approach is the same as the one that will I show in this blog post and follows these steps:
1) create a new table in the data model (either with Power Query or DAX) that contains all our items that we want to use in our TopN calculation and an additional row for “Others”
2) link the new table also to the fact table, similar to the original table that contains your items
3) write a measure that calculates the rank for each item, filters the TopN items and assigns the rest to the “Others” item
4) use the new measure in combination with the new table/column in your visual

Step 1 – Create table with “Others” row

I used a DAX calculated table that does a UNION() of the existing rows for the TopN calculation and a static row for “Others”. I used ROW() first so I can specify the new column names directly. I further use ALLNOBLANKROW() to remove to get rid of any blank rows.

Subcategory_wOthers = UNION(
ROW("SubcategoryKey_wOthers", -99, "SubcategoryName_wOthers", "Others"), 
ALLNOBLANKROW('ProductSubcategory'[ProductSubcategoryKey], 'ProductSubcategory'[SubcategoryName])
) 

Step 2 – Create Relationship

The new table is linked to the same table to which the original table was linked to. This can be the fact-table directly or an intermediate table that then filters the facts in a second step (as shown below)

Step 3 – Create DAX measure

That’s actually the tricky part about this solution, but I think the code is still very easy to read and understand:

Top Measure ProductSubCategory =  
/* get the items for which we want to calculate TopN + Others */ 
VAR Items = SELECTCOLUMNS(ALL(Subcategory_wOthers), "RankItem", Subcategory_wOthers[SubcategoryName_wOthers]) 
/* add a measure that we use for ranking */ 
VAR ItemsWithValue = ADDCOLUMNS(Items, "RankMeasure", CALCULATE([Selected Measure], ALL(ProductSubcategory))) 
/* add a column with the rank of the measure within the items */ 
VAR ItemsWithRank = ADDCOLUMNS(ItemsWithValue, "Rank", RANKX(ItemsWithValue, [RankMeasure], [RankMeasure], DESC, Dense)) 
/* calculate whether the item is a Top-item or belongs to Others */ 
VAR ItemsWithTop = ADDCOLUMNS(ItemsWithRank, "TopOrOthers", IF([Rank] <= [Selected TopN], [RankItem], "Others")) 
/* select the final items for which the value is calculated */ 
VAR ItemsFinal = SELECTCOLUMNS( /* we only select a single column to be used with TREATAS() in the final filter */
     FILTER(
         ItemsWithTop, 
         CONTAINSROW(VALUES(Subcategory_wOthers[SubcategoryName_wOthers]), [TopOrOthers]) /* need to obey current filters on _wOthers table. e.g. after Drill-Down */
         && CONTAINSROW(VALUES(ProductSubcategory[SubcategoryName]), [RankItem])), /* need to obey current filters on base table */ 
     "TopN_Others", [RankItem]) 
RETURN      
CALCULATE(
    [Selected Measure], 
    TREATAS(ItemsFinal, Subcategory_wOthers[SubcategoryName_wOthers])
)

Step 4 – Build Visual

One of the benefits of this approach is that it also allows you to use the “Others” value in slicers, for cross-filtering/-highlight and even in drill-downs. To do so we need to configure our visual with two levels. The first one is the column that contains the “Others” item and the second level is the original column that contains the items. The DAX measure will take care of the rest.

And that’s it! You can now use the column that contains the artificial “Others” in combination with the new measure wherever you like. In a slicer, in a chart or in a table/matrix!

The final PBIX workbook can also be downloaded: TopN_Others.pbix