A very common requirement for a Power BI report that I stumble across at almost all of my customers is to automatically show data for the current day/month/year when a report is opened. At first sight this seems like a very trivial problem but once you dig into the problem, you will realize that all of the common solutions out there have some disadvantages and only solve the problem partially.
So here is what we want to achieve:
Show the Current Month (or Day, or Year) by default
Works [in combination] with all other columns in the date table.
A single, easy to use slicer/filter to control the time selection and change from Current Month/Day/Year to any other value
Works with built-in time intelligence functions
Works with existing DAX measures
Works with any datamodel/report
Solutions like Relative Time Filter/Slicer, DAX or relative flags in the date table address only some points of the above list but definitely not all of them which is why I thought we need a better solution to this:
(please use full-screen mode)
We actually created a new table in our data model that is linked to the original date table. The reason why we cannot use the same table here is that the new table does not have unique date values as all dates/rows referring to our current calculations are duplicated. It has to be a many-to-one relationship with cross-filter direction set to both (even though we will only use the new table ‘Calendar_with_current’ to filter the existing table ‘Calendar’):
And that’s it basically. You can now exchange the original Calendar table with the new one to get the new “Current” values in your report. If you have time intelligence functions in place, you further need to extend them and add ALL('Calendar_with_current ') as a filter to make them work also with the new table. The old table can also be hidden now if you do not want to confuse the end users. To make a seamless switch you can further rename the tables.
I added an additional column to the table called Type that allows you to select which values you want to show – the original values (e.g. “September”), the values with “Current X” (e.g. “Current Month”), or both.Please see the second page/tab of the embedded report above.
So this raises the question how this new table can be created? To simplify this I have created a Power Query function that takes 3 parameters:
The current date table
A list of definitions of your current-values
The name of the unique date-column in your current date table (parameter 1)
The first and the third parameter should be clear, but what are the “CurrentDefinitions”?
It is basically a table which defines the relative time calculations that you want to extend your existing date table with. Here is an example:
The column Column refers to the column in which you want to create the relative date definition. The column NewValue specifies the value that you want to set for rows that match the third column Filter. The column Filter either takes a static filter expression like [RelativeMonth] = 1 (as in lines 5-8) but can also use existing M-functions and reference the existing Date-column using the placeholder <<DateColumn>> as you can see in lines 1-4.
The table can be maintained using “Enter Data” and can contain any number of rows/definitions!
For most of my scenarios this works pretty well and addresses all major problems highlighted above.
When working with Power Query, you have probably already realized that every expression you write returns a value of a specific type. Usually this will be a primitive type like text, number, or date. (A full list of types available in Power Query can be found here: https://docs.microsoft.com/en-us/powerquery-m/m-spec-types). If for some reason the type of an expression cannot be defined, the special type *any* will be used. For sure you already encountered this when using Table.AddColumn which, by default, results in the new column being of type *any*.
To avoid this, you can use the optional fourth parameter and specify the resulting type of the expression. This can be very handy and saves you the Change Type step that usually comes afterwards.
This fourth parameter not only works for primitive types but also for complex types. If you do not specify it, the column type is again *any* even though the actual values are records:
Once you click the Expand-Button of the new MyRecord column in the table header, you will realize a short delay until the available fields are displayed. This indicates that PQ first has to evaluate the expression before it can provide you the list of fields within the record. For complex scenarios, this can take a long time and can also be avoided by explicitly specifying the type in the fourth parameter as shown below:
As you can see, PQ can now immediately display the available columns without having to evaluate the function!
This also works the very same way if you call a custom function as expression of your Table.AddColumn. But there is the caveat: If you have a function that returns a complex type, let’s say a record, you will usually want to specify the type as part of the function or within the definition of the function and not re-type it again each time you call the function.
Fortunately, there is a solution to this problem: the function Type.FunctionReturn. In combination with Value.Type you can derive the return type from the function dynamically!
You will realize that now again it takes some time until the available fields are displayed indicating the function must be evaluated first. Another indicator for this to happen is the warning at the bottom and the link to “Load more”. If you think of it, this makes sense – Power Query knows that the function now returns a record, but does not know which fields the record contains and thus has to evaluate it. So how can we combine custom function that return complex types and the ability to specify the resulting type as part of the function?
The first thing that would come to your mind is to simply strong-type the return type of the function specifying each field individually, but this will result in an error:
Currently it is not supported to specify a complex type as the return type of a function – it only works with primitive types. But as you can guess, I did not start this blog post for no reason. There is a way to achieve this, even though it may not be as nice as it could and should be.
The solution here is the Type.ForFunction function which allows you to create a more specific definition of your function including the return type. This definition/type can then be applied to your original function using Value.ReplaceType:
You basically first define the final return type of the function and the function itself (lines 2 and 3). The other lines (5 to 10) take care of applying the return type to the function which can then be used in combination with the approach above to dynamically derive the return type when calling the function (using ype.FunctionReturn and Value.Type). This now allows you to specify everything that is related to the function in one place!
This is especially handy if you have a function that returns a record or a table which is re-used multiple times and the fields/columns may change over time. Using this approach allows you to only change the function and everything else is derived automatically.
Working with analytical data platforms and big data on a daily basis, I was quite happy when Microsoft finally announced a connector for Parquet files back in November 2020. The Parquet file format is developed by the Apache foundation as an open-source project and has become a fundamental part of most data lake systems nowadays.
“Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.”
However, Parquet is just a file format and does not really support you when it comes to data management. Common data manipulation operations (DML) like updates and deletes still need to be handled manually by the data pipeline. This was one of the reasons why Delta Lake (delta.io) was developed besides a lot of other features like ACID transactions, proper meta data handling and a lot more. If you are interested in the details, please follow the link above.
So what is a Delta Lake table and how is it related to Parquet? Basically a Delta Lake table is a folder in your Data Lake (or wherever you store your data) and consists of two parts:
Delta log files (in the sub-folder _delta_log)
Data files (Parquet files in the root folder or sub-folders if partitioning is used)
The Delta log persists all transactions that modified the data or meta data in the table. For example, if you execute an INSERT statement, a new transaction is created in the Delta log and a new file is added to the data files which is referenced by the Delta log. If a DELETE statement is executed, a particular set of data files is (logically) removed from the Delta log but the data file still resides in the folder for a certain time. So we cannot just simply read all Parquet files in the root folder but need to process the Delta log first so we know which Parquet files are valid for the latest state of the table.
These logs are usually stored as JSON files (actually JSONL files to be more precise). After 10 transactions, a so-called checkpoint-file is created which is in Parquet format and stores all transactions up to that point in time. The relevant logs for the final table are then the combination of the last checkpoint-file and the JSON files that were created afterwards. If you are interested in all the details on how the Delta Log works, here is the full Delta Log protocol specification.
From those logs we get the information which Parquet files in the main folder must be processed to obtain the final table. The content of those Parquet files can then simply be combined and loaded into PowerBI.
I encapsulated all this logic into a custom Power Query function which takes the folder listing of the Delta table folder as input and returns the content of the Delta table. The folder listing can either come from an Azure Data Lake Store, a local folder, or an Azure Blob Storage. The mandatory fields/columns are [Content], [Name] and [Folder Path]. There is also an optional parameter which allows you the specify further options for reading the Delta table like the Version if you want to use time-travel. However, this is still experimental and if you want to get the latest state of the table, you can simply omit it.
The most current M-code for the function can be found in my Github repository for PowerBI: fn_ReadDeltaTable.pq and will also be constantly updated there if I find any improvement. The repository also contains an PowerBI desktop file (.pbix) where you can see the single steps that make up for the final function.
Once you have added the function to your PowerBI / Power Query environment you can call it like this:
I would further recommend to nest your queries and separate the access to the storage (e.g. Azure Data Lake Store) and the reading of the table (execution of the function). If you are reading for an ADLS, it is mandatory to also specify [HierarchicalNavigation = false] ! If you are reading from a blob storage, the standard folder listing is slightly different and needs to be changed.
Right now the connector/function is still experimental and performance is not yet optimal. But I hope to get this fixed in the near future to have a native way to read and finally visualize Delta lake tables in PowerBI.
After some thorough testing the connector/function finally reached a state where it can be used without any major blocking issues, however there are still some known limitations:
currently columns used for partitioning will always have the value NULLFIXED!
values for partitioning columns are not stored as part of the parquet file but need to be derived from the folder pathFIXED!
is currently not great but this is mainly related to the Parquet connector as it seems
very much depends on your data – please test on your own!
currently only supports “VERSION AS OF”
need to add “TIMESTAMP AS OF”
Predicate Pushdown / Partition Elimination
currently not supported – it always reads the whole tableFIXED!
I work a lot with Azure Databricks and a topic that always comes up is reporting on top of the data that is processed with Databricks. Even though notebooks offer some great ways to visualize data for analysts and power users, it is usually not the kind of report the top-management would expect. For those scenarios, you still need to use a proper reporting tool, which usually is Power BI when you are already using Azure and other Microsoft tools.
So, I am very happy that there is finally an official connector in PowerBI to access data from Azure Databricks! Previously you had to use the generic Spark connector (docs) which was rather difficult to configure and did only support authentication using a Databricks Personal Access Token.
With the new connector you can simply click on “Get Data” and then either search for “Azure Databricks” or go the “Azure” and scroll down until you see the new connector:
The next dialog that pops up will ask you for the hostname and HTTP path – this is very similar to the Spark connector. You find all the necessary information via the Databricks Web UI. As this connection is always bound to an existing cluster you need to go the clusters details page and check the Advanced Tab “JDBC/ODBC” as described here: (NOTE: you can simply copy the Server Hostname and the HTTP Path from the cluster page)
The last part is then the authentication. As mentioned earlier the new connector now also supports Azure Active Directory authentication which allows you to use the same user that you use to connect to the Databricks Web UI! Personal Access Tokens are also still supported and there is also Basic authentication using username/password.
Once you are connected, you can choose the tables that you want to import/connect and start building your report!
Here is also a quick overview which features are supported by the Spark and the Azure Databricks connector as there are some minor but important differences:
Power BI Desktop
Power BI Service
Direct Query (Desktop)
Direct Query (Service)
Manual Refresh (Service)
Scheduled Refresh (Service)
Azure Active Directory (AAD) Authentication
Personal Access Token Authentication
Performacne Improvements with Spark 3.x
Supports On-Premises data gateway
Features supported by Spark and Databricks Connector for PowerBI
*) Updated 2020-10-06: the new Databricks Connector for PowerBI now supports all features also in the PowerBI service!
Update 2020-10-06: So from the current point of view the new Databricks Connector is a superset of old Spark Connector with additional options for authentication and better performance with the latest Spark versions. So it is highly recommended to use the new Databricks Connector unless you have very specific reasons to use the Spark connector! Actually the only reason why I would still use the Spark connector is the support for the On-Premises data gateway in case your Spark or Databricks cluster is hosted in a private VNet.
So currently the generic Spark connector still looks superior simply for the support in the Power BI Service. However, I am quite sure that it will be fully supported also by the Power BI Service in the near future. I will update this post accordingly! On the other hand, Azure Active Directory authentication is a huge plus for the native Azure Databricks connector as you do not have to mess around with Databricks Personal Access Tokens (PAT) anymore!
Another thing that I have not yet tested but would be very interesting is whether Pass-Through security works with this new connector. So you log in with your AAD credentials in Power BI, they get passed on to Databricks and from there to the Data Lake Store. For Databricks Table Access Control I assume this will just work as it does for PAT as it is not related to AAD authentication.
Data Virtualization is actually a very new topic to me as I have barely seen it implemented in the real world or at any of my customers. But it becomes more and more interesting when working with big data where you cannot simply load all data into a single in-memory data model but still need to query across different data sources. So I decided to investigate how this could be done with my favorite reporting tool Power BI which I know is capable to connect to different data sources out of the box and also provides a rich set of visualizations that I need.
But let’s start slowly.
What is Data Virtualization?
According to Wikipedia, “Data virtualization is any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data.”
So basically, combining data from multiple sources and multiple formats into a common semantic layer which can be queried on-the-fly without the need of any ETL/ELT.
Sounds awesome – right?
The problem is that in reality the things are not as simple as they may sound, especially when it comes to joining across the different sources. While data virtualization usually works fine for small amounts of data that can be easily processed, it can be quite challenging on large amounts of data which is where data virtualization would actually make sense to avoid lengthy and costly ETL/ELT.
What does Power BI have to do with this?
At first sight – nothing. But lets examine what we currently have in Power BI:
a semantic layer and data modelling capabilities
access to various data sources via Direct Query (remember, we do not want to load any data!)
ability to combine data from those sources
The last part is the most important one here and you may wonder what I am talking about. And you are right, by default a DQ model is only linked to one data source at a time but you can add other data sources manually in the Power Query editor! This is where it get’s interesting and what this blog post is about.
To verify the statements from above I built a little test case that involves 2 local database and an Azure SQL database over which I want to create a semantic layer using Power BI. All three databases are actually the same AdventureWorksDW databases but for the purpose of this demonstration this is OK as it is just a technical feasibility study.
I started by creating a new Power BI file and connected it to my first data base in Direct Query mode.
Then I selected a single table, in my case I have chosen “FactResellerSales”.
The next step is to add a new table using the “Edit Queries” button on the ribbon:
You will see the one table you have just selected before as a Power Query query which you can simply copy and rename. I renamed it to “DimProduct” as I want to load the DimProduct table from my second local database which can be accomplished by simply changing the connection to the SQL database in the first step “Source” of the query (I use my second local database AdventureWorksDW2014 – instead of AdventureWorksDW2012):
Several things to point out here:
once you change the name of the server or the database, you may get prompted for credentials
when you click on Table in the Data column, Power BI asks you if you want to replace the next step – simply press [Yes]
at the “Source” step, Power BI will complain that the results of the current steps are not valid in Direct Query mode – this is fine as the final result will be in the next step (“Navigation”)
I repeated the same procedure again and also added the table “DimProductSubcategory” from my Azure SQL database. So right now we have 3 Direct Query tables pointing to 3 different databases and 3 different tables.
Now we need to connect our tables in the Relationship-view – similar as you would to with any other tables in a regular Direct Query setup:
Once the relationships are created, we can finally create our reports.
Are you excited? Well, I definitely was when I tried this setup the first time!
The visuals behave as if they were created on top of an Import Mode dataset or a Direct Query dataset that only connects to a single database. I think that is pretty awesome and again shows what Power BI is capable of!
We just used Power BI to create a semantic layer across different databases and tables which are now all joined and queried on-the-fly always showing the most recent data!
As you have seen, you Power BI allows us to combine multiple SQL databases in Direct Query mode and query them together as if they would be one single data source. So if you have a requirement where your data is distributed across databases and you it is too big to be loaded into memory or you need live data, you can give this approach a try. I have only tested it with regular Microsoft SQL databases but I assume that this works in a similar way with any other data source that supports Direct Query (e.g. SAP, Oracle, Spark, …) too. You can also do some basic transformations before joining the data in Direct Query mode which can also be very crucial when combining different data sources that might have slightly different formats. Again, I have not tested this thoroughly but at least everything that can be query folded should be supported as a transformation.
In a follow-up post I will explain the technical details and what actually happens in the background when you use a setup like this so stay tuned!
I frequently work on projects where we have multiple tiers on which our solution is deployed to using continuous integration / continuous deployment (CI / CD) pipelines in Azure DevOps. Once everything is deployed, you also need to monitor these different environments and check the status of the data or ETL pipelines. My tool of choice is usually Power BI desktop as it allows me to connect to e.g. SQL databases very easily. However, I always ended up creating a multiple Power BI files – one for each environment.
Having multiple files results in a lot of overhead when it comes to maintenance and also managing these files. Fortunately, I came across this little trick when I was investigating in composite models and aggregations that I am going to explain in this blog post.
To be honest, I barely used Power BI Direct Query in past and so maybe this feature has been there for quite some time without me realizing it but It may also be that it was introduced just recently with composite models.
So the “feature” is, that you can also use Query Parameters to parameterize your Direct Query queries. This is pretty awesome if you think of it for a second:
easy switching between databases
use one file for all environments
only maintain a single file
no need to import/load any data
The configuration within Power Query is also quite easy – simply replacing the hard coded values with the ones from the parameters:
And that’s it already! you can now easily switch between different databases by just using Power BI parameters and the Direct Query connection will change automatically to the new server/database.
Of course, all the target servers/databases have to have the same schema otherwise, you will get an error.
Caveats: Even though this looks quite trivial, there are some caveats which makes me believe this is not fully supported yet. You may noticed above already that in Power Query, when going to the step that actually queries the database, it complains about that this step would cause the whole table to be converted to Import Mode. However, you can just ignore it and go on with the next step to remain in Direct Query Mode.
It seems like Power BI keeps track from where a table was originally imported. So if you want to add a new table, make sure to copy an existing Direct Query table and change it accordingly instead of going to “New Source > …” !
Also, you need to make sure that you have entered the credentials for the different source databases at least once – otherwise Power BI will ask you when you query the database the first time. This is also the reason why this does not work so well in the Power BI service as changing the parameters there is not as simple as it is in Power BI desktop.
As I said, I do not know if this is a new feature (or a feature at all), but it is definitely helpful for certain scenarios.
I just had the request to expose the UniqueNames of an Analysis Services Multidimensional cube in PowerBI. You may ask why I would want to do this and the answer is actually pretty simple: In SSAS MD the caption of elements/members even within the same attribute is not necessarily unique. This is because of the Key/Name concept where the elements/members are grouped by the Key but for the end-user a proper Name is displayed. So if you happen to have duplicate Names in your cube, import the values into PowerBI you will end up with less rows (and wrong values!) compared to the original SSAS MD cube because PowerBI (and also Analysis Services Tabular) does not have a Key/Name concept and therefore the grouping and what is display is always the same.
Having worked quite a lot with SSAS MD in the past I knew that every attribute member contains various internal properties, one of them being the UniqueName, which, as the name implies, is the unique identifier for each member regardless of the caption displayed for that member. And that’s exactly what I needed in this scenario. So the question is how to get this information in PowerBI as this is nothing that should usually be exposed to an end-user.
Anyway, I started to dig into this topic and made some this. Basically this is what I want to achieve:
For my sample I used to Adventure Works MD cube, opened it in PowerBI using Import-Mode and just selected the [Product].[Subcategory] hierarchy:
The UI is quite limited here and you can only select hierarchies and measures. However, getting the UniqueName of a given hierarchy can be achieved quite easily in a subsequent step by adding a new custom column:
And that’s already all you need to do. The column [Product.Subcategory] contains various information, one of them being the UniqueName of the product subcategory which can be accessed by the Cube.AttributeMemberId function.
My next step was to try to get some other properties in a similar way using the Cube.AttributeMemberProperty function. According to the documentation it is quite similar to Cube.AttributeMemberId but takes an additional parameter where you can define which property you want to retrieve. As the [Product].[Subcategory] hierarchy has a property called “Category” I tried this:
This caused a huge error in PowerBI desktop and so I tried different styles to define the property:
None of these worked though, neither for the regular properties nor for the internal ones. The main problem seems to be that the MDX query executed does not query any other properties except for the UniqueName not even if you specify them manually in your PowerQuery script. This means that so far there is no way to access member properties from within PowerBI. There is already a user voice where you can vote for this: https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/12443955-member-properties-ssas
Download:PowerBI_UniqueNames.pbix This PowerBI Desktop model contains all samples from above including the my failed tries for the properties!
As some of you probably remember, when PowerPivot was still only available in Excel and Power Query did not yet exist, it was possible to load images from a database (binary column) directly into the data model and display them in PowerView. Unfortunately, this feature did not work anymore in PowerBI Desktop and the only way to display images in a visual was to provide the URL of the image which is public accessible. The visual would then grab the image on-the-fly from the URL and render it. This of course has various drawbacks:
The image needs to be available via a public URL (e.g. upload it first to an Azure Blob Store)
The image cannot be displayed when you are offline
The link may break in the future or point to a different image as initially when the model was built
Until today I was sure that we have to live with this limitation but then I came across this blog post from Jason Thomas aka SqlJason. He shows a workaround to store images directly in the PowerBI data model and display them in the report as if they were regular images loaded from an URL. This is pretty awesome and I have to dedicate at least 99.9% of this blog post to Jason and his solution!
However, with this blog post I would like to take Jasons’ approach a step further. He creates the Base64 string externally and hardcodes it in the model using DAX. This has some advantages (static image, no external dependency anymore, …) but also a lot of disadvantages (externally create the Base64 string, manually copy&paste the Base64 string for each image, hard to maintain, cannot dynamically add images …). For scenarios where you have a local folder with images, a set of [private] URLs pointing to images or images stored in a SQL table (as binary) which you want to load into your PowerBI data model, this whole process should be automated and ideally done within PowerBI.
Fortunately, this turns out to be quite simple! Power Query provides a native function to convert any binary to a Base64 encoded string: Binary.ToText() . The important part to point out here is to use the second parameter which allows you to set the encoding of the resulting text. It supports two values: BinaryEncoding.Base64 (default) and BinaryEncoding.Hex. Once we have the Base64 string, we simply need to prefix it with the following meta data: “data:image/jpeg;base64, “
To make it easy, I wrote to two custom PowerQuery functions which convert and URL or a binary image to the appropriate string which can be used by PowerBI:
If your images reside in a local folder, you can simply load them using the “Folder” data source. This will give you a list of all images and and their binary content as separate column. Next add a new Custom Column where you call the above function to convert the binary to a prefixed Base64 string which can then be displayed in PowerBI (or Analysis Services) as a regular image. Just make sure to also set the Data Category of the column to “Image URL”:
And that’s it, now your visual will display the image stored in the data model without having to access any external resources!
Caution: As Jason also mentions at the end of his blog post, there is an internal limitation about the size of a text column. So this may cause issues when you try to load high-resolution images! In this case, simply lower the size/quality of the images before you load them. UPDATE May 2019: Chris Webb provides much more information and a solution(!) to this issue in his blog post: https://blog.crossjoin.co.uk/2019/05/19/storing-large-images-in-power-bi-datasets
Download:StoreImageInPbiModel.pbix This PowerBI Desktop model contains all samples from above including the PowerQuery functions!
One of the most requested features when it comes to Azure ML is and has always been the integration into PowerBI. By now we are still lacking a native connector in PowerBI which would allow us to query a published Azure ML web service directly and score our datasets. Reason enough for me to dig into this issue and create some Power Query M scripts to do this. But lets first start off with the basics of Azure ML Web Services.
Every Azure ML project can be published as a Web Service with just a single click. Once its published, it can be used like any other Web Service. Usually we would send a record or a whole dataset to the Web Service, the Azure ML models does some scoring (or any other operation within Azure ML) and then sends the scored result back to the client. This is straight forward and Microsoft even supplies samples for the most common programming languages. The Web Service relies on a standardized REST API which can basically be called by any client. Yes, in our case this client will be PowerBI using Power Query. Rui Quintino has already written an article on AzureML Web Service Scoring with Excel and Power Query and also Chris Webb wrote a more generic one on POST Request in Power Query in general Web Service and POST requests in Power Query. Even Microsoft recently published an article how you can use the R Integration of Power Query to call a Azure ML Web Service here.
Having tried these solutions, I have to admit that they have some major issues: 1) very static / hard coded 2) complex to write 3) operate on row-by-row basis and might run into the API Call Limits as discussed here. 4) need a local R installation
As Azure ML usually deal with tables, which are basically Power Query DataSets, a requirement would be to directly use a Power Query DataSet. The DataSet has to be converted dynamically into the required JSON structure to be POSTed to Azure ML. The returned result, usually a table again, should be converted back to a Power Query DataSet. And that’s what I did, I wrote a function that does all this for you. All information that you have to supply can be found in the configuration of your Azure ML Web Service: – Request URI of your Web Service – API Key – the [Table to Score]
the [Table to Score] can be any Power Query table but of course has to have the very same structure (including column names and data types) as expected by the Web Service Input. Then you can simply call my function:
The whole process involves a lot of JSON conversions and is kind of complex but as I encapsulated everything into M functions it should be quite easy to use by simply calling the CallAzureMLService-function.
However, here is a little description of the used functions and the actual code:
ToAzureMLJson – converts any object that is passed in as an argument to a JSON element. If you pass in a table, it is converted to a JSON-array. Dates and Numbers are formatted correctly, etc. so the result can the be passed directly to Azure ML.
AzureMLJsonToTable – converts the returned JSON back to a Power Query Table. It obeys column names and also data types as defined in the Azure ML Web Service output. If the output changes (e.g. new columns are added) this will be taken care of dynamically!
Known Issues: As the [Table to Score] will probably come from a SQL DB or somewhere else, you may run into issues with Privacy Levels/Settings and the Formula Firewall. In this case make sure to enable Fast Combine for your workbook as described here.
The maximum timeout of a Request/Response call to an Azure ML Web Service is 100 seconds. If your call exceeds this limit, you might get an error message returned.I ran a test and tried to score 60k rows (with 2 numeric columns) at once and it worked just fine, but I would assume that you can run into some Azure ML limits here very easily with bigger data sets. As far as I know, these 100 seconds are for the Azure ML itself only. If it takes several minutes to upload your dataset in the POST request, than this is not part of this 100 seconds. If you are still hitting this issue, you could further try to split your table into different batches, score them separately and combine the results again afterwards.
So these are the steps that you need to do in order to use your Azure ML Web Service together with PowerBI: 1) Create an Azure ML Experiment (or use an existing) 2) Publish the Experiment as a Web Service 3) note the URL and the API Key of your Web Service 4) run PowerBI and load the data that you want to score 5) make sure that the dataset created in 4) has the exact same structure as expected by Azure ML (column names, data types, …) 6) call the function “CallAzureMLWebService” with the parameters from 3) and 5) 7) wait for the Web Service to return the result set 8) load the final table into PowerBI (or do some further transformations before)
And that’s it!
Download: You can find a PowerBI workbook which contains all the functions and code here: CallAzureMLWebService.pbix I used a simple Web Service which takes 2 numeric columns (“Number1” and “Number2”) and returns the [Number1] * [Number2] and [Number1] / [Number2]
PS: you will not be able to run the sample as it is as I changed the API Key and also the URL of my original Azure ML Web Service
One of the coolest features of Power BI is that I integrates very well with other tools and also offers a lot of interfaces which can be used to extend this capabilities even further. One of those is the R Integration which allows you to run R code from within Power BI. R scripts can either be used as a data source or for visualizing your data. In this post I will focus on the data source component and show how you can use a locally stored R script and execute it directly in Power BI. Compared to the native approach where you need to embed the R code in the Power BI file, this has several advantages:
Develop R script in familiar external tool like RStudio
Integration with Source Control
Leverage Power BI for publishing and visualizing results
Out of the box Power BI only supplies one function to call R scripts as a data source which is R.Execute(text). Usually, when you use the wizard, it simply passes your R script as a hardcoded value to this function. Knowing the power of Power BI and its scripting language M for data integration made me think – “Hey, as R scripts are just text files and Power BI can read text files, I could also dynamically read any R script and execute it!”
Well, turns out to be true! So I created a little M function where I pass in the file-path of an existing R script and which returns a table of data frames which are created during the execution of the script. Those can then be used like any other data sets/tables within Power BI:
And here is the corresponding M code for the Power Query function: (Thanks also to Imke Feldmann for simplifying my original code to the readable one below)
LoadLocalRScript = (file_path as text) as table =>
First we read the R script like any other regular CSV file but we use line-feed (“#(lf)”) as delimiter. So we get a table with one column and one row for each line of our original R script. Then we use Text.Combine() on our column to transform the single lines back into one long text resembling our original R script. This text can the be passed to the R.Execute() function to return the list of R data frames created during the execution of the script.
And that’s it! Any further steps are similar to using any regular R script which is embedded in Power BI so it is up to you on how you proceed from here. Just one thing you need to keep in mind is that changing the local R script might break the Power BI load if you changed or deleted any data frames which are referenced in Power BI later on.
One issues that I came across during my tests is that this approach does not work with scheduled refreshes in the Power BI Web Service via the Personal Gateway. The first reason for this is that it is currently not possible to use scheduled refresh if custom functions are involved. Even if you can work around this issue pretty easily by using the code from above directly in Power Query I still ran into issues with different privacy levels for the location of the R script and the R.Execute() function. But I will investigate into those issues and update this blog post accordingly (see UPDATE below). For the future I hope that is fixed by Microsoft and Power BI allows you to execute remote scripts natively – but until then, this approach worked quite well for me.
In case you are interested in more details on this approach, I am speaking at TugaIT in Lisbon, Portugal this Friday (20th of May 2016) about “Power BI for the Data Scientist” where I will cover this and lots of other interesting topics about the daily work of a data scientist and how PowerBI can used to ease them.