How-To: Migrating Databricks workspaces

Foreword:
The approach described in this blog post only uses the Databricks REST API and therefore should work with both, Azure Databricks and also Databricks on AWS!

It recently had to migrate an existing Databricks workspace to a new Azure subscription causing as little interruption as possible and not loosing any valuable content. So I thought a simple Move of the Azure resource would be the easiest thing to do in this case. Unfortunately it turns out that moving an Azure Databricks Service (=workspace) is not supported:

Resource move is not supported for resource types ‘Microsoft.Databricks/workspaces’. (Code: ResourceMoveNotSupported)

I do not know what is/was the problem here but I did not have time to investigate but instead needed to come up with a proper solution in time. So I had a look what needs to be done for a manual export. Basically there are 5 types of content within a Databricks workspace:

  • Workspace items (notebooks and folders)
  • Clusters
  • Jobs
  • Secrets
  • Security (users and groups)

For all of them an appropriate REST API is provided by Databricks to manage and also exports and imports. This was fantastic news for me as I knew I could use my existing PowerShell module DatabricksPS to do all the stuff without having to re-invent the wheel again.
So I basically extended the module and added new Import and Export functions which automatically process all the different content types:

  • Export-DatabricksEnvironment
  • Import-DatabricksEnvironment

They can be further parameterized to only import/export certain artifacts and how to deal with updates to already existing items. The actual output of the export looks like this and of course you can also modify it manually to your needs – all files are in JSON except for the notebooks which are exported as .DBC file by default:

A very simple sample code doing and export and an import into a different environment could look like this:

Having those scripts made the whole migration a very easy task.
In addition, these new cmdlets can also be used in your Continuous Integration/Continuous Delivery (CI/CD) pipelines in Azure DevOps or any other CI/CD tool!

So just download the latest version from the PowerShell gallery and give it a try!

PowerShell module for Databricks on Azure and AWS

Over the last year I worked a lot with Databricks on Azure and I have to say that I was (and still am) very impressed how well it works and how it integrates with other services of the Microsoft Azure Data Platform like Data Lake Store, Data Factory, etc.

Some of the projects I worked on also included CI/CD like pipelines using Azure DevOps where Databricks did not really shine so bright in the beginning. There are no native tasks for it or anything. But this is OK as for those scenarios, where you need to automate/script something, Databricks offers a REST API (Azure, AWS).

As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended up having 20+ single scripts which was not really maintainable any more. So I packed them into a PowerShell module and also published it to the PowerShell Gallery (https://www.powershellgallery.com/packages/DatabricksPS) for everyone to use!

The module works for Databricks on Azure and also if you run Databricks on AWS – fortunately the API endpoints are almost identical.
The usage is quite simple as for any other PowerShell module:

  1. Install it using Install-Module cmdlet
  2. Setup the Databricks environment using API key and endpoint URL
  3. run the actual cmdlets (e.g. to start a cluster)


Here is the same code for you to copy&paste:

At the moment, the module supports the following APIs:

These APIs are not yet implemented but will be added in the near future:

All the cmdlets are documented and contain links to official documentation of the Rest API call used by the cmdlet. Some API endpoints support different variations of parameters – this was implemented using different parameter sets in PowerShell. There are still some ongoing tests (especially on AWS) and improvements but I general all cmdlets work as expected. I hope this helps anyone else who also has to deal with the Databricks APIs frequently or has to integrate it in a CI/CD pipeline.

The whole source code is also available from my Git-repository (https://github.com/gbrueckl/Databricks.API.PowerShell). If you want to provide any feedback, please use the Git-repository to do so.