PowerShell module for Databricks on Azure and AWS:Over the last year I worked a lot with Databricks on Azure and I have to say that I was (and still am) very impressed how well it works and how it integrates with other services of the Microsoft Azure Data Platform like Data Lake Store, Data Factory, etc. Some of the projects I worked on also included CI/CD like pipelines using Azure DevOps where Databricks did not really shine so bright in the beginning. There are no native tasks for it or anything. But this is OK as for those scenarios, where you need to automate/script something, Databricks offers Continue reading PowerShell module for Databricks on Azure and AWS→
Storing Images in a PowerBI/Analysis Services Data Models:As some of you probably remember, when PowerPivot was still only available in Excel and Power Query did not yet exist, it was possible to load images from a database (binary column) directly into the data model and display them in PowerView. Unfortunately, this feature did not work anymore in PowerBI Desktop and the only way to display images in a visual was to provide the URL of the image which is public accessible. The visual would then grab the image on-the-fly from the URL and render it. This of course has various drawbacks: The image needs to be available Continue reading Storing Images in a PowerBI/Analysis Services Data Models→
Deploying an Azure Data Factory project as ARM Template:In my last post I wrote about how to Debug Custom .Net Activities in Azure Data Factory locally. This fixes one of the biggest issues in Azure Data Factory at the moment for developers. The next bigger problem that you will run into is when it comes to deploying your Azure Data Factory project. At the moment, you can only do it manually from Visual Studio which, for bigger projects, can take quite some time. So I extended and advanced the code from my CustomActivityDebugger. Well, actually I rewrote some major parts of it and moved it into a new Continue reading Deploying an Azure Data Factory project as ARM Template→
Error-handling in Power Query:Data is the daily bread-and-butter for any analyst. In order to provide good results you also need good data. Sometimes this data is very well prepared beforehand and you can use it as it is but it is also very common that you need to prepare and transform the data on your own. To do this Microsoft has introduced Power Query (on tool of the Power BI suite). Power Query can be used to extract, transform and load data into Excel and/or Power Pivot directly. When using any data you usually know what the data looks like and what to Continue reading Error-handling in Power Query→
Foreword: The approach described in this blog post only uses the Databricks REST API and therefore should work with both, Azure Databricks and also Databricks on AWS!
It recently had to migrate an existing Databricks workspace to a new Azure subscription causing as little interruption as possible and not loosing any valuable content. So I thought a simple Move of the Azure resource would be the easiest thing to do in this case. Unfortunately it turns out that moving an Azure Databricks Service (=workspace) is not supported:
Resource move is not supported for resource types ‘Microsoft.Databricks/workspaces’. (Code: ResourceMoveNotSupported)
I do not know what is/was the problem here but I did not have time to investigate but instead needed to come up with a proper solution in time. So I had a look what needs to be done for a manual export. Basically there are 5 types of content within a Databricks workspace:
Workspace items (notebooks and folders)
Security (users and groups)
For all of them an appropriate REST API is provided by Databricks to manage and also exports and imports. This was fantastic news for me as I knew I could use my existing PowerShell module DatabricksPS to do all the stuff without having to re-invent the wheel again. So I basically extended the module and added new Import and Export functions which automatically process all the different content types:
They can be further parameterized to only import/export certain artifacts and how to deal with updates to already existing items. The actual output of the export looks like this and of course you can also modify it manually to your needs – all files are in JSON except for the notebooks which are exported as .DBC file by default:
A very simple sample code doing and export and an import into a different environment could look like this:
Having those scripts made the whole migration a very easy task. In addition, these new cmdlets can also be used in your Continuous Integration/Continuous Delivery (CI/CD) pipelines in Azure DevOps or any other CI/CD tool!
So just download the latest version from the PowerShell gallery and give it a try!
In my previous post I showed how you can use Microsoft Power BI to create a Data Virtualization layer on top of multiple relational data sources querying them all at the same time through one common model. As I already mentioned in the post and what was also pointed out by Adam Saxton (b, t) in the comments is the fact, that this approach can cause serious performance problems at the data source and also on the Power BI side. So in this post we will have a closer look on what actually happens in the background and which queries are executed when you join different data sources on-the-fly.
We will use the same model as in the previous post (you can download it from there or at the end of this post) and run some basic queries against it so we can get a better understanding of the internals. Here is our relationship diagram again as a reference. Please remember that each table comes from a different SQL server:
In our test we will simply count the number of products for each Product Subcategory:
Even though this query only touches two different data sources, it is a good way to analyze the queries sent to the data sources. To track these queries I used the built-in Performance Analyzer of Power BI desktop which can be enabled on the “View”-tab. It gives you detailed information about the performance of the report including the actual SQL queries (under “Direct query”) which were executed on the data sources. The plain text queries can also be copied using the “Copy queries” link at the bottom. In our case 3 SQL queries were executed against 2 different SQL databases:
The query basically selects two columns from the DimProductSubcategory table:
ProductSubcategoryKey – which is used in the join with DimProduct
ProductSubcategoryName – which is the final name to be displayed in the visual
The inner sub-select (line 7-14) represents the original Power Query query. It selects all columns from the DimProductSubcategory table and renames [EnglishProductSubcagetoryName] to [ProductSubcategoryName] (line 10). Any other Power Query steps that are supported in direct query like aggregations, groupings, filters, etc. would also show up here.
(SELECTN'Mountain Bikes'AS[c67],1AS[c29])UNION ALL
(SELECTN'Road Bikes'AS[c67],2AS[c29])UNION ALL
(SELECTN'Touring Bikes'AS[c67],3AS[c29])UNION ALL
(SELECTN'Bottom Brackets'AS[c67],5AS[c29])UNION ALL
(The query was shortened at line 16 and line 29 as the removed columns/rows are not relevant for the purpose of this example.)
Similar to Query 1 above, the innermost sub-select (line 13-17) in the FROM clause returns the results of the Power Query query for DimProduct whereas the outer sub-select (line 7-20) groups the result by the common join-key [ProductSubcategoryKey]. This result is then joined with a static table which is made up from hard-coded SELECTs and UNION ALLs (line 24-30). If you take a closer look, you will realize that this table actually represents the original result of Query 1! Additionally it also includes a special NULL-item (line 30) that is used to handle non-matching entries. The last step is to group the joined tables to obtain the final results.
(The query was shortened at line 9 as the removed columns/rows are not relevant for the purpose of this example.)
The last query is necessary to display the correct grand total across all products and product sub-categories.
As you can see, most of the “magic” happens in Query 2. The virtual join or virtualization is done by hard-coding the results of the remote table/data source directly into the SQL query of the current table/data source. This works fine as long as the results of the remote query are small enough – both, in terms of numbers of rows and columns – but the more limiting factor is the number of rows. Roughly speaking, if you have more than thousand items that are joined this way, the queries tend to get slow. In reality this will very much depend on your data so I would highly recommend to test this with your own data!
I ran a simple test and created a join on the SalesOrderNumber which has about 27,000 distinct items. The query never returned any results and after having a look at the Performance Analyzer I realized, that the query similar to Query 2 above was never executed. I do not know yet whether this is because of the large number of items and the very long SQL query that is generated (27,000 times SELECT + UNION ALL !!!) or a bug.
At this point you may ask yourself if it makes sense to use Power BI for data virtualization or use another tool that was explicitly designed for this scenario. (Just google for “data virtualization”). These other tools may perform better even on higher volume data but they will also reach their limits if the joins get too big and, what is even more important, the are usually quite expensive.
So I think that Power BI is still a viable solution for data virtualization if you keep the following things in mind: – keep the items in the join columns at a minimum – use Power Query to pre-aggregate the data if possible – don’t expect too much in terms of performance – only use it when you know what you are doing 🙂
Data Virtualization is actually a very new topic to me as I have barely seen it implemented in the real world or at any of my customers. But it becomes more and more interesting when working with big data where you cannot simply load all data into a single in-memory data model but still need to query across different data sources. So I decided to investigate how this could be done with my favorite reporting tool Power BI which I know is capable to connect to different data sources out of the box and also provides a rich set of visualizations that I need.
But let’s start slowly.
What is Data Virtualization?
According to Wikipedia, “Data virtualization is any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data.”
So basically, combining data from multiple sources and multiple formats into a common semantic layer which can be queried on-the-fly without the need of any ETL/ELT.
Sounds awesome – right?
The problem is that in reality the things are not as simple as they may sound, especially when it comes to joining across the different sources. While data virtualization usually works fine for small amounts of data that can be easily processed, it can be quite challenging on large amounts of data which is where data virtualization would actually make sense to avoid lengthy and costly ETL/ELT.
What does Power BI have to do with this?
At first sight – nothing. But lets examine what we currently have in Power BI:
a semantic layer and data modelling capabilities
access to various data sources via Direct Query (remember, we do not want to load any data!)
ability to combine data from those sources
The last part is the most important one here and you may wonder what I am talking about. And you are right, by default a DQ model is only linked to one data source at a time but you can add other data sources manually in the Power Query editor! This is where it get’s interesting and what this blog post is about.
To verify the statements from above I built a little test case that involves 2 local database and an Azure SQL database over which I want to create a semantic layer using Power BI. All three databases are actually the same AdventureWorksDW databases but for the purpose of this demonstration this is OK as it is just a technical feasibility study.
I started by creating a new Power BI file and connected it to my first data base in Direct Query mode.
Then I selected a single table, in my case I have chosen “FactResellerSales”.
The next step is to add a new table using the “Edit Queries” button on the ribbon:
You will see the one table you have just selected before as a Power Query query which you can simply copy and rename. I renamed it to “DimProduct” as I want to load the DimProduct table from my second local database which can be accomplished by simply changing the connection to the SQL database in the first step “Source” of the query (I use my second local database AdventureWorksDW2014 – instead of AdventureWorksDW2012):
Several things to point out here:
once you change the name of the server or the database, you may get prompted for credentials
when you click on Table in the Data column, Power BI asks you if you want to replace the next step – simply press [Yes]
at the “Source” step, Power BI will complain that the results of the current steps are not valid in Direct Query mode – this is fine as the final result will be in the next step (“Navigation”)
I repeated the same procedure again and also added the table “DimProductSubcategory” from my Azure SQL database. So right now we have 3 Direct Query tables pointing to 3 different databases and 3 different tables.
Now we need to connect our tables in the Relationship-view – similar as you would to with any other tables in a regular Direct Query setup:
Once the relationships are created, we can finally create our reports.
Are you excited? Well, I definitely was when I tried this setup the first time!
The visuals behave as if they were created on top of an Import Mode dataset or a Direct Query dataset that only connects to a single database. I think that is pretty awesome and again shows what Power BI is capable of!
We just used Power BI to create a semantic layer across different databases and tables which are now all joined and queried on-the-fly always showing the most recent data!
As you have seen, you Power BI allows us to combine multiple SQL databases in Direct Query mode and query them together as if they would be one single data source. So if you have a requirement where your data is distributed across databases and you it is too big to be loaded into memory or you need live data, you can give this approach a try. I have only tested it with regular Microsoft SQL databases but I assume that this works in a similar way with any other data source that supports Direct Query (e.g. SAP, Oracle, Spark, …) too. You can also do some basic transformations before joining the data in Direct Query mode which can also be very crucial when combining different data sources that might have slightly different formats. Again, I have not tested this thoroughly but at least everything that can be query folded should be supported as a transformation.
In a follow-up post I will explain the technical details and what actually happens in the background when you use a setup like this so stay tuned!
Again, this works pretty well and is explained in detail in the blog posts.
Once you have implemented this change the business users usually complain that Total is wrong. This depends on how you implemented the TopN measure and what the users actually expect. I have seen two scenarios that cause confusion: 1) The Total is the SUM of the TopN items only – not reflecting the actual Grand Total 2) The Total is NOT the SUM of the TopN items only – people complaining that Power BI does not sum up correctly
As I said, this pretty much depends on the business requirements and after discussing that in length with the users, the solution is usually to simply add an “Others” row that sums up all values which are not part of the TopN items. For regular business users this requirement sounds really trivial because in Excel the could just add a new row and subtract the values of the TopN items from the Grand Total.
These work fine even if I do not like the DAX as it is unnecessarily complex (from my point of view) but the general approach is the same as the one that will I show in this blog post and follows these steps: 1) create a new table in the data model (either with Power Query or DAX) that contains all our items that we want to use in our TopN calculation and an additional row for “Others” 2) link the new table also to the fact table, similar to the original table that contains your items 3) write a measure that calculates the rank for each item, filters the TopN items and assigns the rest to the “Others” item 4) use the new measure in combination with the new table/column in your visual
Step 1 – Create table with “Others” row
I used a DAX calculated table that does a UNION() of the existing rows for the TopN calculation and a static row for “Others”. I used ROW() first so I can specify the new column names directly. I further use ALLNOBLANKROW() to remove to get rid of any blank rows.
The new table is linked to the same table to which the original table was linked to. This can be the fact-table directly or an intermediate table that then filters the facts in a second step (as shown below)
Step 3 – Create DAX measure
That’s actually the tricky part about this solution, but I think the code is still very easy to read and understand:
Top Measure ProductSubCategory=
/* get the items for which we want to calculate TopN + Others */
One of the benefits of this approach is that it also allows you to use the “Others” value in slicers, for cross-filtering/-highlight and even in drill-downs. To do so we need to configure our visual with two levels. The first one is the column that contains the “Others” item and the second level is the original column that contains the items. The DAX measure will take care of the rest.
And that’s it! You can now use the column that contains the artificial “Others” in combination with the new measure wherever you like. In a slicer, in a chart or in a table/matrix!
I frequently work on projects where we have multiple tiers on which our solution is deployed to using continuous integration / continuous deployment (CI / CD) pipelines in Azure DevOps. Once everything is deployed, you also need to monitor these different environments and check the status of the data or ETL pipelines. My tool of choice is usually Power BI desktop as it allows me to connect to e.g. SQL databases very easily. However, I always ended up creating a multiple Power BI files – one for each environment.
Having multiple files results in a lot of overhead when it comes to maintenance and also managing these files. Fortunately, I came across this little trick when I was investigating in composite models and aggregations that I am going to explain in this blog post.
To be honest, I barely used Power BI Direct Query in past and so maybe this feature has been there for quite some time without me realizing it but It may also be that it was introduced just recently with composite models.
So the “feature” is, that you can also use Query Parameters to parameterize your Direct Query queries. This is pretty awesome if you think of it for a second:
easy switching between databases
use one file for all environments
only maintain a single file
no need to import/load any data
The configuration within Power Query is also quite easy – simply replacing the hard coded values with the ones from the parameters:
And that’s it already! you can now easily switch between different databases by just using Power BI parameters and the Direct Query connection will change automatically to the new server/database.
Of course, all the target servers/databases have to have the same schema otherwise, you will get an error.
Caveats: Even though this looks quite trivial, there are some caveats which makes me believe this is not fully supported yet. You may noticed above already that in Power Query, when going to the step that actually queries the database, it complains about that this step would cause the whole table to be converted to Import Mode. However, you can just ignore it and go on with the next step to remain in Direct Query Mode.
It seems like Power BI keeps track from where a table was originally imported. So if you want to add a new table, make sure to copy an existing Direct Query table and change it accordingly instead of going to “New Source > …” !
Also, you need to make sure that you have entered the credentials for the different source databases at least once – otherwise Power BI will ask you when you query the database the first time. This is also the reason why this does not work so well in the Power BI service as changing the parameters there is not as simple as it is in Power BI desktop.
As I said, I do not know if this is a new feature (or a feature at all), but it is definitely helpful for certain scenarios.
Over the last year I worked a lot with Databricks on Azure and I have to say that I was (and still am) very impressed how well it works and how it integrates with other services of the Microsoft Azure Data Platform like Data Lake Store, Data Factory, etc.
Some of the projects I worked on also included CI/CD like pipelines using Azure DevOps where Databricks did not really shine so bright in the beginning. There are no native tasks for it or anything. But this is OK as for those scenarios, where you need to automate/script something, Databricks offers a REST API (Azure, AWS).
As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended up having 20+ single scripts which was not really maintainable any more. So I packed them into a PowerShell module and also published it to the PowerShell Gallery (https://www.powershellgallery.com/packages/DatabricksPS) for everyone to use!
The module works for Databricks on Azure and also if you run Databricks on AWS – fortunately the API endpoints are almost identical.
The usage is quite simple as for any other PowerShell module:
Install it using Install-Module cmdlet
Setup the Databricks environment using API key and endpoint URL
All the cmdlets are documented and contain links to official documentation of the Rest API call used by the cmdlet. Some API endpoints support different variations of parameters – this was implemented using different parameter sets in PowerShell. There are still some ongoing tests (especially on AWS) and improvements but I general all cmdlets work as expected. I hope this helps anyone else who also has to deal with the Databricks APIs frequently or has to integrate it in a CI/CD pipeline.
Azure Data Factory v2 is Microsoft Azure’s Platform as a Service (PaaS) solution to schedule and orchestrate data processing jobs in the cloud. As the name implies, this is already the second version of this kind of service and a lot has changed since its predecessor. One of these things is how datasets and pipelines are parameterized and how these parameters are passed between the different objects. The basic concepts behind this process are well explained by the MSDN documentation – for example Create a trigger that runs a pipeline on a schedule. In this example an trigger is created that runs a pipeline every 15 minute and passes the property “scheduledTime” of the trigger to the pipeline. This is the JSON expression that is used:
@trigger() basically references the object that is returned by the trigger and it seems that this object has a property called “scheduledTime”. So far so good, this is documented and fulfills the basic needs. Some of these properties are also documented here: System variables supported by Azure Data Factory but unfortunately not all of them.
So sometimes this trigger objects can be much more complex and also contain additional information that may not be documented. This makes it pretty hard for the developer to actually know which properties exist and how they could be used. A good example are Event-Based Triggers which were just recently introduced where the documentation only mentions the properties “fileName” and “folderPath” but it contains much more (details see further down). For simplicity I will stick to scheduled triggers at this point but the very same concept applies to all kinds of triggers and actually also to all other internal objects like @pipeline(), @dataset() or @activity() as well!
So how can you investigate those internal objects like @trigger() and see what they actually look like? Well, the answer is quite simple – just pass the object itself without any property to the pipeline. The target parameter of the pipeline can either be of type String or Object. This allows you to see the whole object on the Monitoring-page once the pipeline is triggered:
For the Scheduled-trigger, the object looks like this:
And as you can guess, you can pass any of these properties to the pipeline using the syntax “@trigger().<property_name>” or even the whole object! The syntax can of course also be combined with all the built-in expressions.
This should hopefully make it easier for you to build and debug more complex Azure Data Factory v2 pipelines!
Below you can find an example of the object that a Event-Based Trigger creates:
I just had the request to expose the UniqueNames of an Analysis Services Multidimensional cube in PowerBI. You may ask why I would want to do this and the answer is actually pretty simple: In SSAS MD the caption of elements/members even within the same attribute is not necessarily unique. This is because of the Key/Name concept where the elements/members are grouped by the Key but for the end-user a proper Name is displayed. So if you happen to have duplicate Names in your cube, import the values into PowerBI you will end up with less rows (and wrong values!) compared to the original SSAS MD cube because PowerBI (and also Analysis Services Tabular) does not have a Key/Name concept and therefore the grouping and what is display is always the same.
Having worked quite a lot with SSAS MD in the past I knew that every attribute member contains various internal properties, one of them being the UniqueName, which, as the name implies, is the unique identifier for each member regardless of the caption displayed for that member. And that’s exactly what I needed in this scenario. So the question is how to get this information in PowerBI as this is nothing that should usually be exposed to an end-user.
Anyway, I started to dig into this topic and made some this. Basically this is what I want to achieve:
For my sample I used to Adventure Works MD cube, opened it in PowerBI using Import-Mode and just selected the [Product].[Subcategory] hierarchy:
The UI is quite limited here and you can only select hierarchies and measures. However, getting the UniqueName of a given hierarchy can be achieved quite easily in a subsequent step by adding a new custom column:
And that’s already all you need to do. The column [Product.Subcategory] contains various information, one of them being the UniqueName of the product subcategory which can be accessed by the Cube.AttributeMemberId function.
My next step was to try to get some other properties in a similar way using the Cube.AttributeMemberProperty function. According to the documentation it is quite similar to Cube.AttributeMemberId but takes an additional parameter where you can define which property you want to retrieve. As the [Product].[Subcategory] hierarchy has a property called “Category” I tried this:
This caused a huge error in PowerBI desktop and so I tried different styles to define the property:
None of these worked though, neither for the regular properties nor for the internal ones. The main problem seems to be that the MDX query executed does not query any other properties except for the UniqueName not even if you specify them manually in your PowerQuery script. This means that so far there is no way to access member properties from within PowerBI. There is already a user voice where you can vote for this: https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/12443955-member-properties-ssas
Download:PowerBI_UniqueNames.pbix This PowerBI Desktop model contains all samples from above including the my failed tries for the properties!
As some of you probably remember, when PowerPivot was still only available in Excel and Power Query did not yet exist, it was possible to load images from a database (binary column) directly into the data model and display them in PowerView. Unfortunately, this feature did not work anymore in PowerBI Desktop and the only way to display images in a visual was to provide the URL of the image which is public accessible. The visual would then grab the image on-the-fly from the URL and render it. This of course has various drawbacks:
The image needs to be available via a public URL (e.g. upload it first to an Azure Blob Store)
The image cannot be displayed when you are offline
The link may break in the future or point to a different image as initially when the model was built
Until today I was sure that we have to live with this limitation but then I came across this blog post from Jason Thomas aka SqlJason. He shows a workaround to store images directly in the PowerBI data model and display them in the report as if they were regular images loaded from an URL. This is pretty awesome and I have to dedicate at least 99.9% of this blog post to Jason and his solution!
However, with this blog post I would like to take Jasons’ approach a step further. He creates the Base64 string externally and hardcodes it in the model using DAX. This has some advantages (static image, no external dependency anymore, …) but also a lot of disadvantages (externally create the Base64 string, manually copy&paste the Base64 string for each image, hard to maintain, cannot dynamically add images …). For scenarios where you have a local folder with images, a set of [private] URLs pointing to images or images stored in a SQL table (as binary) which you want to load into your PowerBI data model, this whole process should be automated and ideally done within PowerBI.
Fortunately, this turns out to be quite simple! Power Query provides a native function to convert any binary to a Base64 encoded string: Binary.ToText() . The important part to point out here is to use the second parameter which allows you to set the encoding of the resulting text. It supports two values: BinaryEncoding.Base64 (default) and BinaryEncoding.Hex. Once we have the Base64 string, we simply need to prefix it with the following meta data: “data:image/jpeg;base64, “
To make it easy, I wrote to two custom PowerQuery functions which convert and URL or a binary image to the appropriate string which can be used by PowerBI:
If your images reside in a local folder, you can simply load them using the “Folder” data source. This will give you a list of all images and and their binary content as separate column. Next add a new Custom Column where you call the above function to convert the binary to a prefixed Base64 string which can then be displayed in PowerBI (or Analysis Services) as a regular image. Just make sure to also set the Data Category of the column to “Image URL”:
And that’s it, now your visual will display the image stored in the data model without having to access any external resources!
Caution: As Jason also mentions at the end of his blog post, there is an internal limitation about the size of a text column. So this may cause issues when you try to load high-resolution images! In this case, simply lower the size/quality of the images before you load them. UPDATE May 2019: Chris Webb provides much more information and a solution(!) to this issue in his blog post: https://blog.crossjoin.co.uk/2019/05/19/storing-large-images-in-power-bi-datasets
This PowerBI Desktop model contains all samples from above including the PowerQuery functions!
As you probably know from my last blog post, I am currently upgrading the PowerBI reporting platform of one of my customer from a PowerBI backend (dataset hosted in PowerBI service) to an Azure Analysis Services backend. The upgrade/import of the dataset into Azure Analysis Services itself worked pretty flawless and after switching the connection of the reports everything worked as expected and everyone was happy. However, things got a bit tricky when it came to automatically refreshing the Azure Analysis Services database which was based on an Azure Data Lake Store. For the original PowerBI dataset, this was pretty straight forward as a scheduled refresh from an Azure Data Lake store data source works out of the box. For Azure Analysis Services this is a bit different.
When you build and deploy your data model from Visual Studio, your are prompted for the credentials to access ADLS which are then stored in the data source object of AAS. As you probably know, AAS uses OAuth authentication to access data from ADLS. And this also causes a lot of problems. OAuth is based on tokens and those tokens are only valid for a limited time, by default this is 2 hours. This basically means, that you can process your database for the next 2 hours and it will fail later on with an error message saying that the token expired. (The above applies to all OAuth sources!)
This problem is usually solved by using an Azure Service Principal instead of a regular user account where the token does not expire. Unfortunately, this is not supported at the moment for ADLS data sources and you have to work around this issue.
IMPORTANT NOTE: NONE OF THE FOLLOWING IS OFFICIALLY SUPPORTED BY MICROSOFT !!!
So the current situation that we need to solve is as follows:
we can only use regular user accounts to connect AAS to ADLS as service principals are not supported yet
the token expires after 2 hours
the database has to be processed on a regular basis (daily, hourly, …) without any manual interaction
manually updating the token is (of course) not an option
Back to our example – as we were already using Azure Automation for some other tasks, we decided to also use it here. Also, PowerShell integrates very well with other Azure components and was the language of choice for us. To accomplish our goal we had to implement 3 steps:
acquire a new OAuth token
update the ADLS data source with the new token
run our processing script
I could copy the code for the first step more or less from one of my older blog post (here) where I used PowerShell to acquire an OAuth token to trigger a refresh in PowerBI.
The second step is to update ADLS data source of our Azure Analysis Services model. To get started, the easiest thing to do is to simply open the AAS database in SQL Server Management Studio and let it script the existing datasource for you:
The resulting JSON will look similar to this one:
The important part for us is the “credential” field. It contains all the information necessary to authenticate against our ADLS store. However, most of this information is sensitive so only asterisks are displayed in the script. The rest of the JSON (except for the “credential” field) is currently hardcoded in the PowerShell cmdlet so if you want to use it, you need to change this manually!
The PowerShell cmdlet then combines the hardcoded part with an updated “credential”-field which is obtained by invoking a REST request to retrieve a new OAuth token. The returned object is modified a bit in order to match the required JSON for the datasource.
Once we have our final JSON created, we can send it to our Azure Analysis Services instance by calling the Invoke-ASCmd cmdlet from the SqlServer module.
Again, please see the original blog post mentioned above for the details of this approach.
After we have updated our datasource, we can simply call our regular processing commands which will then be executed using the newly updated credentials.
The script I wrote allows you to specify which objects to process in different ways:
whole database (by leaving AASTableName and AASPartitionName empty)
a single or multiple table and all its partitions (by leaving only AASPartitionName empty)
or multiple partitions of a single table (by specifying exactly one AASTableName and multiple AASPartitionNames
If multiple tables or partitions are specified, the elements are separated by commas (“,”)
So to make the Runbook work in your environment, follow all the initial steps as described in the original blog post from Microsoft. In addition, you also need to create an Application (Type = “Native”) in your Azure Active Directory to obtain the OAuth token programmatically. This application needs the “Sign in and read user profile” permission from the API “Windows Azure Active Directory (Microsoft.Azure.ActiveDirectory)”:
Also remember the ApplicationID, it will be used as a parameter for the final PowerShell Runbook (=parameter “ClientID”!
When it comes to writing the PowerShell code, simply use the code from the download at the end of this blog post.
For the actual credential that you are using, make sure that it has the following permissions:
to update the AAS datasource (can be set in the AAS model or for the whole server)
has access to the required ADLS files/folders which are processed (can be set e.g. via ADLS Data Explorer)
(if you previously used your own account to do all the AAS and ADLS development, this should work just fine)
In general, a similar approach should work for all kinds of datasources that require OAuth authentication but so far I have only tested it with Azure Data Lake Store!