Foreword: The approach described in this blog post only uses the Databricks REST API and therefore should work with both, Azure Databricks and also Databricks on AWS!
It recently had to migrate an existing Databricks workspace to a new Azure subscription causing as little interruption as possible and not loosing any valuable content. So I thought a simple Move of the Azure resource would be the easiest thing to do in this case. Unfortunately it turns out that moving an Azure Databricks Service (=workspace) is not supported:
Resource move is not supported for resource types ‘Microsoft.Databricks/workspaces’. (Code: ResourceMoveNotSupported)
I do not know what is/was the problem here but I did not have time to investigate but instead needed to come up with a proper solution in time. So I had a look what needs to be done for a manual export. Basically there are 5 types of content within a Databricks workspace:
Workspace items (notebooks and folders)
Security (users and groups)
For all of them an appropriate REST API is provided by Databricks to manage and also exports and imports. This was fantastic news for me as I knew I could use my existing PowerShell module DatabricksPS to do all the stuff without having to re-invent the wheel again. So I basically extended the module and added new Import and Export functions which automatically process all the different content types:
They can be further parameterized to only import/export certain artifacts and how to deal with updates to already existing items. The actual output of the export looks like this and of course you can also modify it manually to your needs – all files are in JSON except for the notebooks which are exported as .DBC file by default:
A very simple sample code doing and export and an import into a different environment could look like this:
Having those scripts made the whole migration a very easy task. In addition, these new cmdlets can also be used in your Continuous Integration/Continuous Delivery (CI/CD) pipelines in Azure DevOps or any other CI/CD tool!
So just download the latest version from the PowerShell gallery and give it a try!
Over the last year I worked a lot with Databricks on Azure and I have to say that I was (and still am) very impressed how well it works and how it integrates with other services of the Microsoft Azure Data Platform like Data Lake Store, Data Factory, etc.
Some of the projects I worked on also included CI/CD like pipelines using Azure DevOps where Databricks did not really shine so bright in the beginning. There are no native tasks for it or anything. But this is OK as for those scenarios, where you need to automate/script something, Databricks offers a REST API (Azure, AWS).
As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended up having 20+ single scripts which was not really maintainable any more. So I packed them into a PowerShell module and also published it to the PowerShell Gallery (https://www.powershellgallery.com/packages/DatabricksPS) for everyone to use!
The module works for Databricks on Azure and also if you run Databricks on AWS – fortunately the API endpoints are almost identical.
The usage is quite simple as for any other PowerShell module:
Install it using Install-Module cmdlet
Setup the Databricks environment using API key and endpoint URL
All the cmdlets are documented and contain links to official documentation of the Rest API call used by the cmdlet. Some API endpoints support different variations of parameters – this was implemented using different parameter sets in PowerShell. There are still some ongoing tests (especially on AWS) and improvements but I general all cmdlets work as expected. I hope this helps anyone else who also has to deal with the Databricks APIs frequently or has to integrate it in a CI/CD pipeline.
As you probably know from my last blog post, I am currently upgrading the PowerBI reporting platform of one of my customer from a PowerBI backend (dataset hosted in PowerBI service) to an Azure Analysis Services backend. The upgrade/import of the dataset into Azure Analysis Services itself worked pretty flawless and after switching the connection of the reports everything worked as expected and everyone was happy. However, things got a bit tricky when it came to automatically refreshing the Azure Analysis Services database which was based on an Azure Data Lake Store. For the original PowerBI dataset, this was pretty straight forward as a scheduled refresh from an Azure Data Lake store data source works out of the box. For Azure Analysis Services this is a bit different.
When you build and deploy your data model from Visual Studio, your are prompted for the credentials to access ADLS which are then stored in the data source object of AAS. As you probably know, AAS uses OAuth authentication to access data from ADLS. And this also causes a lot of problems. OAuth is based on tokens and those tokens are only valid for a limited time, by default this is 2 hours. This basically means, that you can process your database for the next 2 hours and it will fail later on with an error message saying that the token expired. (The above applies to all OAuth sources!)
This problem is usually solved by using an Azure Service Principal instead of a regular user account where the token does not expire. Unfortunately, this is not supported at the moment for ADLS data sources and you have to work around this issue.
IMPORTANT NOTE: NONE OF THE FOLLOWING IS OFFICIALLY SUPPORTED BY MICROSOFT !!!
So the current situation that we need to solve is as follows:
we can only use regular user accounts to connect AAS to ADLS as service principals are not supported yet
the token expires after 2 hours
the database has to be processed on a regular basis (daily, hourly, …) without any manual interaction
manually updating the token is (of course) not an option
Back to our example – as we were already using Azure Automation for some other tasks, we decided to also use it here. Also, PowerShell integrates very well with other Azure components and was the language of choice for us. To accomplish our goal we had to implement 3 steps:
acquire a new OAuth token
update the ADLS data source with the new token
run our processing script
I could copy the code for the first step more or less from one of my older blog post (here) where I used PowerShell to acquire an OAuth token to trigger a refresh in PowerBI.
The second step is to update ADLS data source of our Azure Analysis Services model. To get started, the easiest thing to do is to simply open the AAS database in SQL Server Management Studio and let it script the existing datasource for you:
The resulting JSON will look similar to this one:
The important part for us is the “credential” field. It contains all the information necessary to authenticate against our ADLS store. However, most of this information is sensitive so only asterisks are displayed in the script. The rest of the JSON (except for the “credential” field) is currently hardcoded in the PowerShell cmdlet so if you want to use it, you need to change this manually!
The PowerShell cmdlet then combines the hardcoded part with an updated “credential”-field which is obtained by invoking a REST request to retrieve a new OAuth token. The returned object is modified a bit in order to match the required JSON for the datasource.
Once we have our final JSON created, we can send it to our Azure Analysis Services instance by calling the Invoke-ASCmd cmdlet from the SqlServer module.
Again, please see the original blog post mentioned above for the details of this approach.
After we have updated our datasource, we can simply call our regular processing commands which will then be executed using the newly updated credentials.
The script I wrote allows you to specify which objects to process in different ways:
whole database (by leaving AASTableName and AASPartitionName empty)
a single or multiple table and all its partitions (by leaving only AASPartitionName empty)
or multiple partitions of a single table (by specifying exactly one AASTableName and multiple AASPartitionNames
If multiple tables or partitions are specified, the elements are separated by commas (“,”)
So to make the Runbook work in your environment, follow all the initial steps as described in the original blog post from Microsoft. In addition, you also need to create an Application (Type = “Native”) in your Azure Active Directory to obtain the OAuth token programmatically. This application needs the “Sign in and read user profile” permission from the API “Windows Azure Active Directory (Microsoft.Azure.ActiveDirectory)”:
Also remember the ApplicationID, it will be used as a parameter for the final PowerShell Runbook (=parameter “ClientID”!
When it comes to writing the PowerShell code, simply use the code from the download at the end of this blog post.
For the actual credential that you are using, make sure that it has the following permissions:
to update the AAS datasource (can be set in the AAS model or for the whole server)
has access to the required ADLS files/folders which are processed (can be set e.g. via ADLS Data Explorer)
(if you previously used your own account to do all the AAS and ADLS development, this should work just fine)
In general, a similar approach should work for all kinds of datasources that require OAuth authentication but so far I have only tested it with Azure Data Lake Store!
So besides the scheduled and manual refreshes from within the PowerBI service directly, we now have a third option to trigger refreshes but this time also from an external caller! This itself is already pretty awesome and some people already did some cool stuff leveraging the new API functions:
The basic idea is to use object from pre-built Azure Management DLLs to generate the OAuth Access token that is necessary to use the API. This works very well locally but cannot be used in the cloud – e.g. in combination with Azure Automation Runbooks or Azure Functions where you cannot install or reference any custom DLLs.
In this blog post I will show you how you can accomplish exactly this – create an Azure Automation Runbook to refresh your PowerBI dataset! But first of all there are some things that you need to keep in mind:
There are no service accounts in PowerBI so we will always use a “real” user
you need to supply the credentials of a “real” user
The user needs to have appropriate access to the dataset in order to refresh it
the dataset refresh must succeed if you do it manually in PowerBI
you are still limited to 8 refreshes/day through the API
OK, so lets get started. First of all we need an Azure Application which has permissions in PowerBI. The easiest way to do this is to use the navigate to https://dev.powerbi.com/apps, log in with your account and simply follow the steps on the screen. The only import thing is to select the App Type “Native app”. At the end, you will receive a ClientID and a ClientSecret – Please remember the ClientID for later use!
Now lets take a look at the PowerShell code. Instead of using any pre-built DLLs I removed all unnecessary code and do all the communication using Invoke-RestMethod. This is a very low-level function and is part of the standard PowerShell modules so there is no need to install anything! The tricky part is to acquire an Authentication Token using username/password as it is nowhere documented (at least I could not find it) what the REST call has to look like. So I used Fiddler to track the REST calls that the pre-built DLLs use and rebuilt them using Invoke-RestMethod. This is what I came up with:
$clientId is the ClientID of the Azure AD Application $pbiUsername is the email address of the PowerBI user. $pbiPassword is the password of the PowerBI user. The $authRepsonse then contains our Authentication token which we can use to make our subsequent calls:
CredentialName – the name of the Azure Automation credential that you created and which stores the PowerBI username and password
ClientID – the ID of your Azure Active Directory Application which you created in the first step
PBIDatasetName – the name of the PowerBI dataset that you want to refresh
PBIGroupName – (optional) the name of the group/workspace in which the PowerBI dataset from 3) resides
When everything is working as expected, you can create custom schedules or even create webhooks to trigger the script and refresh you PowerBI dataset! As you probably know, this is really powerful as you can now make the refresh of the PowerBI dataset part of your daily ETL job!
Recently I had to setup an Analysis Services cube and expose it to external users. This is usually done by using Internet Information Server (IIS) and creating a new WebSite which hosts msmdpump.dll. This DLL more or less wraps XMLA commands inside HTTP thus allowing external users to access the cube via HTTP. Besides Windows Authentication this setup also allows Basic Authentication and so external users can simply connect by specifying Username and Password in e.g. Excel when connecting to the cube:
They provide very useful information and you should be familiar with the general setup before proceeding here or using the final PowerShell script.
The PowerShell script basically performs the following steps:
Create a local folder as base for your WebSite in IIS
Copy SSAS ISAPI files (incl. msmdpump.dll) to the folder
Create and Configure an IIS AppPool
Create and Configure a IIS WebSite
Add and enable an ISAPI entry for msmdpump.dll
Configure Default Document
Update connection information to SSAS server
I tested it successfully with a clean installation of IIS 8.0 (using applicationhost.config.clean.install). In case you already have other WebSites running you may still consider doing the steps manually or adopting the script if necessary. The script is written not to overwrite any existing Folders, WebSites, etc. but you never know.
It is a very common scenario to create a SSAS Tabular Model out of an Power Pivot Model contained in an Excel workbook. Microsoft even created an wizard (or actually a separate Visual Studio project) that supports you doing this. Even further, this process is also a major part of Microsoft’s strategy to cover Personal BI, Team BI and Corporate BI within one technology being xVelocity. This all works just fine but there may also be scenarios where you need to do it the other way round – converting a Tabular model to Power Pivot. Several use-cases come into my mind but I am sure that the most important one is to making data available offline for e.g. sales people to take it with them on their every day work. And in this blog post I will show how this can be done!
But before taking a closer look into how this can be accomplished, lets first see how the import from Power Pivot to SSAS Tabular works. To do this start SQL Server Profiler and connect to your tabular instance. Then create a new Tabular project in Visual Studio based on an existing Power Pivot workbook. At this point you will notice a lot of events happening on our SSAS Tabular server. The most important event for us is “Command End” with the EventSubclass “9 – restore”:
SSAS actually restores a backup from a “Model.abf” backup file which is located in our project directory that we just created:
So far so good – but where does this file come from?
Well, the origin of the file has to be our Excel workbook that we imported. Knowing that all new office formats ending with “x” (.xlsx, .docx, …) are basically ZIP files, we can inspect our original Excel workbook by simply rename it to “.zip”. This allows us to browse the Excel file structure:
We will find a folder called “xl” which contains a sub-folder called “model”. This folder contains one item called “item.data”. If you take a closer look at the file size you may realize that both, the “Model.abf” file that we restored and the “item.data” file from our Excel workbook have the exact same size:
A Coincidence? Not really!
What happens behind the scenes when you import a Power Pivot model into SSAS Tabular is that this “item.data” file gets copied into your project directory and is renamed to “Model.abf” and then restored to the SSAS Tabular workspace instance by using an standard database restore.
Having this information probably makes you think: If it works in one direction, why wouldn’t it also work the other way round? And it does!
So here are the steps that you need to do in order to restore your SSAS Tabular backup into an Excel Power Pivot workbook:
Create a backup of your SSAS Tabular database and rename it to “item.data”
Create an empty Excel workbook and add a simple linked table to the Excel data model (which is actually Power Pivot). This is necessary to tell Excel that the workbook contains a Power Pivot model which has to be loaded once the file is opened.
Close the Excel workbook and rename it from “MyFile.xlsx” to “MyFile.xlsx.zip”
Open the .zip-file in Windows Explorer and locate the “\xl\model\”-folder
Replace the “item.data” file with the file that you created in step 1.
Rename the .zip-file back to “MyFile.xlsx”
Open the Excel Workbook
Voilá! You can now work with the data model as with any other Power Pivot model!
I tested this with a SSAS Tabular backup from SQL Server 2012 SP1 being restored to the streamed version of Excel from Office 365 with the latest version of Power Pivot. I assume that it also works with older versions but have not tested all combinations yet.
There are also some features that will not work, for example roles. If your Tabular database contains roles you will not be able to use this approach. Excel will complain that the Power Pivot model is broken. However, other Tabular features like partitions actually work with the little limitation that you cannot change them later on in the Power Pivot model or process them separately: Another thing to note here is that only up to 3 partitions are allowed, otherwise you will get the same error as for roles. I think this is related to the limitation of 3 partitions for SQL Server Analysis Services Standard Edition as Chris Webb described here.
Besides these obvious features there are also some other cool things that you can do in Tabular which are not possible in Power Pivot. Most (or actually all) of them are accessible only by using BIDS Helper – a great THANK YOU to the developers of BIDS Helper at this point! BIDS Helper enables you to add classical multidimensional features also to Tabular models which is not possible using standard Visual Studio only. Those include:
Translations (metadata only)
I tested it for DisplayFolders and Actions and both are working also in Power Pivot after the backup was restored and I further assume that all the other things will also work just fine. Simply keep in mind that Power Pivot is basically a fully featured Analysis Services instance running within Excel!
For my (and your) convenience I also created a little PowerShell script that does all the work:
The last thing to mention here is that I don’t know if this is officially supported in any way by Microsoft – actually I am pretty sure it is not – so watch out what you are doing and don’t complain if something is not working as expected.