Gerhard Brueckl on BI & Data – Page 7 – Microsoft Data Platform and beyond …

PowerShell module for Databricks on Azure and AWS : Avaiilable via PowerShell Gallery: DatabricksPS Over the last year I worked a lot with Databricks on Azure and I have to say that I was (and still am) very impressed how well it works and how it integrates with other services of the Microsoft Azure Data Platform like Data Lake Store, Data Factory, etc. Some of the projects I worked on also included CI/CD like pipelines using Azure DevOps where Databricks did not really shine so bright in the beginning. There are no native tasks for it or anything. But this is OK as for those scenarios, where you need Continue reading →

Storing Images in a PowerBI/Analysis Services Data Models : As some of you probably remember, when PowerPivot was still only available in Excel and Power Query did not yet exist, it was possible to load images from a database (binary column) directly into the data model and display them in PowerView. Unfortunately, this feature did not work anymore in PowerBI Desktop and the only way to display images in a visual was to provide the URL of the image which is public accessible. The visual would then grab the image on-the-fly from the URL and render it. This of course has various drawbacks: The image needs to be available Continue reading →

Deploying an Azure Data Factory project as ARM Template : In my last post I wrote about how to Debug Custom .Net Activities in Azure Data Factory locally. This fixes one of the biggest issues in Azure Data Factory at the moment for developers. The next bigger problem that you will run into is when it comes to deploying your Azure Data Factory project. At the moment, you can only do it manually from Visual Studio which, for bigger projects, can take quite some time. So I extended and advanced the code from my CustomActivityDebugger. Well, actually I rewrote some major parts of it and moved it into a new Continue reading →

Error-handling in Power Query : Data is the daily bread-and-butter for any analyst. In order to provide good results you also need good data. Sometimes this data is very well prepared beforehand and you can use it as it is but it is also very common that you need to prepare and transform the data on your own. To do this Microsoft has introduced Power Query (on tool of the Power BI suite). Power Query can be used to extract, transform and load data into Excel and/or Power Pivot directly. When using any data you usually know what the data looks like and what to Continue reading →

Scheduled Data Refresh of OData Sources in Power BI

Posted on 2014-03-10 by Gerhard Brueckl — 5 Comments ↓

With the recent public release of Power BI it is finally possible to refresh Power Pivot workbooks online.Once a workbook is uploaded to SharePoint online and “Enabled” for Power BI you can schedule an automatic data refresh for the Power Pivot model. Though, at the moment only a very limit number of data sources are supported:

Windows Azure SQL Database data
Open Data protocol (OData) feeds
On premises data sources that are enabled for access in Power BI for Office 365

Especially for public available data OData feeds are very popular, for example from Wikipedia. Those public data feeds usually do not require any authentication so one would expect these data sources to work flawless with a scheduled data refresh of Power BI. Well, you are wrong here!
When you create a simple Power Pivot model with one OData source to e.g. Wikipedia, publish it and setup scheduled data refresh you will receive the following error:

The error message itself does not reveal any insights on the actual error. If you are in the (more or less) lucky situation that you are also owner of that site you can go to the Power BI Admin Center to get some further details on the error:

By downloading the CSV or hovering over a single Message you can see the details:

"Failed to find a match for the data source (connection string: data source=http://publicdata.clouddatahub.net/Web/Tables/fa9af6681cd64206b3aafe6d12408117/V1/Data;include atom elements=Auto;include expanded entities=False;integrated security=SSPI;persist security info=false;time out=600;schema sample size=25;retry count=5;retry sleep=100;keep alive=False;max received message size=4398046511104;base url=http://publicdata.clouddatahub.net/Web/Tables/fa9af6681cd64206b3aafe6d12408117/V1/Data) for the user 'Gerhard.Brueckl@XYZ.onmicrosoft.com'. The user is unauthorized or, the corresponding data source is not created. Check the user's permission to the data source or create a data source for the connection string. Tracing ID: 23f244ad-7921-48f7-b13a-ef68e8cf5503"

It says that for my user no corresponding data source exists or I do not have permissions to access it.

In our case the reason is that our user is unauthorized – the actually data source that was used must not necessarily exist in the user’s data sources (Those can be found via “My Power BI”) for scheduled data refresh to work.

So you will ask yourself what could go possibly wrong here as you are just accessing a public OData feed?!?
The reason is that in the connection string the “Integrated Security”-property is set to SSPI by default. In your local Excel/Power Pivot model this works just fine as the SSPI context can be resolved and sent to the OData feed. Sure the OData feed actually ignores this information as it is public but from an authentication point of view everything works correctly!

The problem with Scheduled Data Refresh is that SSPI simply does not work as it cannot be resolved and Power BI cannot use any service user for your request if SSPI is defined. The first thing that comes to your mind would be to simply set the authentication method to “Anonymous” which would be perfectly fine for a public data feed. However, Power Pivot does not support Anonymous authentication for OData sources:

As SSPI does not work we need to use “Basic” here and provide “User ID” and “Password”. Which UserID/Password you may ask? – and the simple answer is: “It does not matter!”
You can provide any values here, in my case I used “random” for both, User ID and Password! Another thing you need to ensure is that “Persist Securtiy Info” is set to True so your “Password” is stored in the final connection string making Power BI think that authentication is defined correctly within the connection string and Power BI does not have to do anything.

Once you changes those settings your data refresh will work like a charm:

On last thing you may realize is the “Running Time”. It takes significantly longer when doing the refresh in Power BI opposed to doing the refresh locally. Just keep that in mind, especially for bigger data feeds.

In the future I hope that Microsoft will introduce some kind of “Anonymous” authentication within the dropdown of Power Pivot or simply check at some point if the OData feed requires any authentication at all hence overwriting the authentication mode specified in the connection string when refreshing.

Applied Basket Analysis in Power Pivot using DAX

Posted on 2014-02-05 by Gerhard Brueckl — 25 Comments ↓

Basket Analysis is a very common analysis especially for online shops. Most online shops make use of it to make you buy products that “Others also bought …”. Whenever you view a specific product “Product A” from the online shop, basket analysis allows the shop to show you further products that other customers bought together with “Product A”. Its basically like taking a look into other customers shopping baskets. In this blog post I will show how this can be done using Power Pivot. Alberto Ferrari already blogged about it here some time ago and showed a solution for Power Pivot v1. There is also dedicated chapter in the whitepaper The Many-to-Many Revolution 2.0 which deals with Basket Analysis, already in Power Pivot v2. Power Pivot v2 already made the formula much more readable and also much faster in terms of performance. Though, there are still some things that I would like to add.

Lets take a look at the initial data model first:

First of all we do not want to modify this model but just extend it so that all previously created measures, calculations and, most important, the reports still work. So the only thing we do is to add our Product-tables again but with a different name. Note that I also added the Subcategory and Category tables in order to allow Basket Analysis also by the Product-Category hierarchy. As we further do not want to break anything we only use an inactive relationship to our ‘Internet Sales’ fact table.

After adding the tables the model looks as follows:

The next and actually last thing to do is to add the following calculated measure:

Sold in same Order :=

CALCULATE (

COUNTROWS ( ‘Internet Sales’ ),

CALCULATETABLE (

SUMMARIZE (

‘Internet Sales’,

‘Internet Sales’[Sales Order Number]

),

ALL ( ‘Product’ ),

USERELATIONSHIP ( ‘Internet Sales’[ProductKey], ‘Filtered Product’[ProductKey] )

)

)

(formatted using DAX Formatter)

The inner CALCULATETABLE returns a list/table of all [Sales Order Numbers] where a ‘Filtered Product’ was sold and uses this table to extend the filter on the ‘Internet Sales’ table. It is also important to use ALL(‘Product’) here otherwise we would have two filters on the same column ([ProductKey]) which would always result in an empty table. Doing a COUNTROWS finally returns all for all baskets where the filtered product was sold.
We could also change ‘Internet Sales'[Sales Order Number] to ‘Internet Sales'[CustomerKey] in order to analyze what other customers bought also in different baskets (This was done for Example 3). The whole SUMMARIZE-function could also be replaced by VALUES(‘Internet Sales'[Sales Order Number]). I used SUMMARIZE here as I had better experiences with it in terms of performance in the past, though, this may depend on your data. The calculation itself also works with all kind of columns and hierarchies, regardless whether its from table ‘Product’, ‘Filtered Product’, or any other table!

So what can we finally do with this neat formula?

1) Classic Basket Analysis – “Others also bought …”:

As we can see Hydration Packs are more often sold together with Mountain Bikes opposed to Road Bikes and Touring Bikes. We could also use a slicer on ‘Filtered Product Subcategory’=”Accessories” in order to see how often Accessories are sold together with other products. You may analyze by Color and Product Category:

As we can see people that buy black bikes are more likely to buy red helmets than blue helmets.

2) Basket Analysis Matrix:
What may also be important for us is which products are sold together the most often? This can be achieved by pulling ‘Product’ on rows and ‘Filtered Product’ on columns. By further applying conditional formatting we can identify correlations pretty easy:

Water Bottles are very often sold together with Bottle Cages – well, not really a surprise. Again, you can also use all kind of hierarchies here for your analysis.
This is what the whole matrix looks like:

The big blank section in the middle are our Bikes. This tells us that there is no customer that bought two bikes in the same order/basket.

For this analysis I used an extended version of the calculation above to filter out values where ‘Product’ = ‘Filtered Product’ as of course every product is sold within its own basket:

BasketMatrix :=

IF (

MIN ( ‘Product’[ProductKey] )

<> MIN ( ‘Filtered Product’[ProductKey] ),

[Sold in same Order]

)

3) Find Customers that have not bough a common product yet
As we now know from the above analysis which products are very often bought together we can also analyze which customers do not fall in this pattern – e.g. customers who have bough a Water Bottle but have not bought a Bottle Cage yet. Again we can extend our base-calculation to achieve this:

Not Sold to same Customer :=

IF (

NOT ( ISBLANK ( [Sum SA] ) ) && NOT ( [Sold to same Customer] ),

“Not Sold Yet”

)

The first part checks if the selected ‘Product’ was sold to the customer at all and the second part checks if the ‘Filtered Product’ was not sold to the customer yet. In that case we return “Not Sold Yet”, and otherwise BLANK() which is the default if the third parameter is omitted. That’s the result:

Aaron Phillips has bought a Water Bottle but no Mountain Bottle Cage nor a Road Bottle Cage – maybe we should send him some advertisement material on Bottle Cages? Smile

As you can see there are a lot of analyses possible on top of that little measure that we created originally. All work with any kind of grouping or hierarchy that you may have and no change to your data model is necessary, just a little extension.

And that’s it – Basket Analysis made easy using Power Pivot and DAX!

Downloads:

Sample Excel Workbook with all Examples: BasketAnalysis.xlsx
Sample PowerBI Desktop file with all Examples (DAX only, no Visuals): BasketAnalysis.pbix

Optimizing Columnar Storage for Measures

Posted on 2014-02-03 by Gerhard Brueckl — No Comments ↓

First of all I have to thank Marco Russo for his blog post on Optimizing High Cardinality Columns in Vertipaq and also his great session at SQL PASS Rally Nordic in Stockholm last year which taught me a lot about columnar storage in general. I highly recommend everyone to read the two mentioned resources before continuing here. Most of the ideas presented in this post are based on these concepts and require at least basic knowledge in columnar storage.

When writing one of my recent posts I ended up with a Power Pivot model with roughly 40M rows. It contained internet logs from Wikipedia, how often someone clicked on a given page per month and how many bytes got downloaded. As you can imagine those values can vary very heavily, especially the amount of bytes downloaded. So in the Power Pivot model we end up having a lot of distinct values in our column that we use for our measure. As you know from Marcos posts, the allocated memory and therefore also the performance of columnar storage systems is directly related to the number of distinct values in a column – the more the worse. Marco already described an approach to split up a single column with a lot of distinct values into several columns with less distinct values to optimize storage. These concepts can also be used on columns that contain measures or numeric values in general. Splitting numeric values is quite easy, assuming your values range from 1 to 1,000,000 you can split this column into two by dividing the value by 1000 and using MOD 1000 for the second column. Instead of one column with the value 123,456 you end up with two columns with the values 123 and 456. In terms of storage this means that instead of 1,000,000 distinct values we only need to store 2 x 1,000 distinct values. Nothing new so far.

The trick is to combine those columns again at query time to get the original results as before the split. For some aggregations like SUM this is pretty straight forward, others are a bit more tricky. Though, in general the formulas are not really very complex and can be adopted very easily to handle any number of columns:

Aggregation	DAX Formula
Value_SUM1	=SUMX(‘1M_Rows_splitted’, [Value_1000] * 1000 + [Value_1])
Value_SUM2	=SUM ( ‘1M_Rows_splitted'[Value_1000] ) * 1000 + SUM ( ‘1M_Rows_splitted'[Value_1] )
Value_MAX	=MAXX(‘1M_Rows_splitted’, [Value_1000] * 1000 + [Value_1])
Value_MIN	=MINX(‘1M_Rows_splitted’, [Value_1000] * 1000 + [Value_1])
Value_COUNT	=COUNTX(‘1M_Rows_splitted’, [Value_1000] * 1000 + [Value_1])
Value_DISTINCTCOUNT	=COUNTROWS ( SUMMARIZE ( ‘1M_Rows_splitted’, ‘1M_Rows_splitted'[Value_1000], ‘1M_Rows_splitted'[Value_1]))

As you can see you can still mimic most kind of aggregation even if the [Value]-column is split up.

Though, don’t exaggerate splitting your columns – too many may be a bit inconvenient to handle and may neglect the effect resulting in worse performance. Marco already showed that you can get a reduction of up to 90% in size, during my simple tests I came up with about the same numbers. Though, it very much depends on the number of distinct values that you actually have in your column!

I would not recommend to always use this approach for all your measure column – no, definitely not! First check how many distinct values your data/measures contain and decide afterwards. For 1 million distinct values it is probably worth it, for 10,000 you may reconsider using this approach. Most important here is to test this pattern with your own data, data model and queries! Test it in terms of size and of course also in terms of performance. It may be faster to split up columns but it may also be slower and it may be also different for each query that you execute against the tabular model / Power Pivot. Again, test with your own data, data model and queries to get representative results!

Here is a little test that you may run on your own to test this behavior. Simple create the following Power Query using M, load the result into Power Pivot and save the workbook. It basically creates a table with 1 million distinct values (0 to 999,999) and splits this column up into two. You can just copy the workbook, remove the last step “Remove Columns” and save it again to get the “original” workbook and Power Pivot model.

let

    List1 = List.Numbers(0, 1000000),
    TableFromList = Table.FromList(List1, Splitter.SplitByNothing(), null, null, ExtraValues.Error),

    RenamedColumns = Table.RenameColumns(TableFromList,{{"Column1", "Value"}}),

    ChangedType1 = Table.TransformColumnTypes(RenamedColumns,{{"Value", type number}}),

    InsertedCustom = Table.AddColumn(ChangedType1, "Value_1000", each Number.RoundDown([Value] / 1000)),

    InsertedCustom1 = Table.AddColumn(InsertedCustom, "Value_1", each Number.Mod([Value], 1000)),

    ChangedType = Table.TransformColumnTypes(InsertedCustom1,{{"Value_1000", type number}, {"Value_1", type number}}),

    RemovedColumns = Table.RemoveColumns(ChangedType,{"Value"})

in

    RemovedColumns

This is what I ended up with when you compare those two workbooks:

	# distinct values	Final Size
single column	1,000,000	14.4 MB
two columns	2 x 1,000	1.4 MB

We also get a reduction in size of 90%! Though, this is a special scenario …
In the real world, taking my previous scenario with the Wikipedia data as an example, I ended up with a reduction to 50%, still very good though. But as you can see the reduction factor varies very much.

Downloads:

Sample workbook with Power Query: 1M_Rows_splitted.xlsx

Upcoming Conferences and Events in Q1 2013

Posted on 2014-01-13 by Gerhard Brueckl — No Comments ↓

After PASS SQL Rally Nordic in Stockholm last year I am very happy to announce that I am going to speak at two more events in the first quarter of 2013.

I will do a session at the “Deutsche SQL Server Konferenz 2014” (=”German SQL Server Conference 2014″) on “Big Data Scenario mit Power BI vs. SAP HANA“. It is basically an advanced version of my blog post on SAP HANAs Big Data Scenario with Power BI with much more insights details on both technologies. The conference itself is a 3 day conference from 10th to 12th of February where day 1 is reserved for pre-conference sessions. It also features a lot of international speakers and of course also a good amount of English sessions (mine will be in German though). Also my colleague Marcel Franke will do a session about PDW and R which you also do not want to miss if you are into Big Data and predictive analytics!
Check out the agenda and make sure you register in time!

Later on the 6th of March I will speak at the SQLSaturday #280 in Vienna on “Scaling Analysis Services in the Cloud“. (This is not a typo, its really on 6th of March which is actually a Thursday!) The session focuses on how to get the best performance out of multidimensional Analysis Services solutions when they are moved to the Windows Azure cloud. In the end I will come up with some best practices and guide lines for this and similar scenarios.
Just make sure that you register beforehand to enjoy this full day of free sessions and trainings!

Hope to see you there!

Reporting Services MDX Field List and Using Measures on rows

Posted on 2014-01-03 by Gerhard Brueckl — 4 Comments ↓

When creating a Reporting Services report on top of an Analysis Services cube using the wizard it automatically creates a Field for each column in your MDX query. Those fields can then be used in your report. For reports based on a relational source the definition of these fields is quite simple, it is the same as the the column name of the originating query. For MDX queries this is very different. If you ever checked the definition of an automatically generated MDX field you will see a lengthy XML snippet instead:

The XMLs may look like these:

<?xml version="1.0" encoding="utf-8"?>
 <Field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
        xsi:type="Level" 
        UniqueName="[Product].[Subcategory].[Subcategory]" />

<?xml version="1.0" encoding="utf-8"?>
 <Field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
        xsi:type="Measure" 
        UniqueName="[Measures].[Internet Sales Amount]" />

As you can see those two are quite different in terms of xsi:type and UniqueName. The xsi:type “Level” refers to a dimension level whereas “Measure” refers to a measure. Depending on the type of field, different properties are available within the report:

For example the property BackgroundColor is only populated for fields of type “Measure” whereas the property UniqueName is only populated for fields of type “Level”. Measure properties are tied to the CELL PROPERTIES in your MDX query and Level properties are tied to DIMENSION PROPERTIES:

SELECT 
 NON EMPTY { [Measures].[Internet Sales Amount] } ON COLUMNS, 
 NON EMPTY { ([Product].[Subcategory].[Subcategory].ALLMEMBERS ) } 
 DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME ON ROWS 
 FROM [Adventure Works] 
 CELL PROPERTIES VALUE, BACK_COLOR, FORE_COLOR, FORMATTED_VALUE, FORMAT_STRING, FONT_NAME, FONT_SIZE, FONT_FLAGS

If we remove MEMBER_UNIQUE_NAME from the DIMENSION PROPERTIES we would no longer be able to use Fields!Subcategory.UniqueName in our SSRS expressions, or to be more precise it would simply always return NULL (or NOTHING in terms of Visual Basic). The same of course is also true for the CELL PROPERTIES.

So far this is nothing really new but there are some more things about the fields of MDX queries. There is a third xsi:type called “MemberProperty” which allows you to query member properties without having to define separate measures within your query:

<?xml version="1.0" encoding="utf-8"?>
 <Field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
        xsi:type="MemberProperty" 
        LevelUniqueName="[Product].[Subcategory].[Subcategory]" 
        PropertyName="Category" />

Once we add a member property to our MDX query SSRS also automatically creates this field for us. NOTE, this is only possible by manually modifying the MDX!

SELECT 
 NON EMPTY { [Measures].[Internet Sales Amount] } ON COLUMNS, 
 NON EMPTY { ([Product].[Subcategory].[Subcategory].ALLMEMBERS ) } 
 DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME, 
     [Product].[Subcategory].[Subcategory].[Category] ON ROWS 
 FROM [Adventure Works] 
 CELL PROPERTIES VALUE, BACK_COLOR, FORE_COLOR, FORMATTED_VALUE, FORMAT_STRING, FONT_NAME, FONT_SIZE, FONT_FLAGS

To get the [Category] which is associated to a given [Subcategory] you would usually need to create a separate measure like this:

WITH 
 MEMBER [Measures].[Category] AS (
 [Product].[Subcategory].Properties( "Category" )
 )
 SELECT 
 …

This has the very bad drawback that using the WITH MEMBER clause disables the formula engine cache for the whole query what may result in worse query performance. So you may consider using DIMENSION PROPERTIES instead of a custom Measure next time.

There is another very nice “feature” that is also related to the field list. If you ever had the requirement to create a parameter to allow the user to select which measure he wants to see in the report you probably came across this blog post by Chris Webb or this blog post by Rob Kerr. As you know by then, SSRS requires you to put the Measures-dimension on columns, otherwise the query is not valid. This is because the number of Measures is not considered to be dynamic (opposed to e.g. Customers) which allows SSRS to create a static field list. This makes sense as SSRS was originally designed for relational reporting and a table always has a fixed number of columns which are similar to fields in the final SSRS dataset. Using Measures on columns is the way how SSRS enforces this.

As we are all smart SSRS and MDX developers and we know what we are doing we can trick SSRS here. All we need to do is to write a custom MDX query using the expression builder – do not use or even open the Query Designer at this point otherwise your query may get overwritten!

SSRS also automatically creates the fields for us, but this time the fields are not defined correctly. It creates one field with a very cryptic name and the following XML definition:

<?xml version="1.0" encoding="utf-8"?>
 <Field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
        xsi:type="Measure" 
        UniqueName="[Measures].[MeasuresLevel]" />

As you can see SSRS thinks that this field is of type “Measure” but it actually is a “Level”. After changing this little thing we can access all field properties that are unique to Level-fields like <Field>.UniqueName

So this is the final MDX query and the associated XML field definition:

SELECT 
 {} ON 0,
 [Measures].members 
 DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME ON 1
 FROM [Adventure Works]

Ensure that you have defined the necessary DIMENSION PROPERTIES here otherwise they will not be available/populated within the report!

<?xml version="1.0" encoding="utf-8"?>
 <Field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
        xsi:type="Level" 
        UniqueName="[Measures].[MeasuresLevel]" />

In order to make use of this approach in a report parameter we further need to create calculated fields for our parameter label and parameter value:

The definition of the parameter is straight forward then:

You can not only use this approach to populate a parameter but you can also use it to crossjoin Measures on rows with any other hierarchy. This way you can avoid writing complex MDX just to work around this nasty SSRS limitation.

Downloads:

Sample SSRS report: CustomFieldList.rdl

Error-handling in Power Query

Posted on 2013-12-20 by Gerhard Brueckl — 24 Comments ↓

Data is the daily bread-and-butter for any analyst. In order to provide good results you also need good data. Sometimes this data is very well prepared beforehand and you can use it as it is but it is also very common that you need to prepare and transform the data on your own. To do this Microsoft has introduced Power Query (on tool of the Power BI suite). Power Query can be used to extract, transform and load data into Excel and/or Power Pivot directly.

When using any data you usually know what the data looks like and what to expect from certain columns – e.g. a value is delivered as a number, a text contains exactly 4 characters, etc.
Though, sometimes this does not apply for all rows of that dataset and your transformation logics may cause errors because of that. In order to avoid this and still have a clean data load you need to handle those errors. Power Query offers several options to this which I will elaborate in this post.

This is the sample data I will use for the following samples:

A	B	C
1	4	AXI23
2	5	TZ560
NA	6	UP945

we will perform simple transformations and type casts on this table to generate some errors:

Error-handling on row-level

This is the easiest way of handling errors. Whenever a transformation causes an error, we can simply remove the whole row from the result set:

This will generate the following result table:

A	B	C
1	4	AX312
2	5	TZ560

As you can see, the third row was simply removed. This is the easiest way on how to remove errors from your result set but of course this may not be what you want as those removed rows may contain other important information in other columns! Assume you want to calculate the SUM over all values in column B. As we removed the third row we also removed a value from column B and the SUM is not the same as without the error-handling (9 vs. 15)!

Error-handling on cell-level

As we now know that column A may result in an error, we can handle this error during our transformation steps. As “NA” could not be converted to a number we see Error as result for that column. Clicking on it gives use some more details on the error itself. To handle this error we need to create a new calculated column where we first check if the value can be converted to a number and otherwise return a default numeric value:

The M-function that we use is “try <expressions to try> otherwise <default if error>” which is very similar to a try-catch block in C#. If the expression causes an error, the default will be used instead. Details on the try-expression can be found in the Microsoft Power Query for Excel Formula Language Specification (PDF) which can be found here and is a must read for everyone interested in the M language.

We could further replace our column A by the new column A_cleaned to hide this error handling transformation.

A	B	C	A_cleaned
1	4	AXI23	1
2	5	TZ560	2
NA	6	UP945	0

Error-handling on cell-level with error details

There may also be cases where it is OK to have one of this errors but you need/want to display the cause of the error so that a power user may correct the source data beforehand. Again we can use the try-function, but this time without the otherwise-clause. This will return a record-object for each row:

After expanding the A_Try column and also the A_Try.Error column we will get all available information on the error:

A	B	C	A_Try.HasError	A_Try.Value	A_Try.Error.Reason	A_Try.Error.Message	A_Try.Error.Detail
1	4	AXI23	FALSE	1
2	5	TZ560	FALSE	2
	6	UP945	TRUE		DataFormat.Error	Could not convert to Number.	NA

As you can see we get quite a lot of columns here. We could e.g. use A_Try.HasError to filter out error rows (similar to error-handling on row-level) or we could use it in a calculated column to mimic error-handling on cell-level. What you want to do with all the information is up to you, but in case you don’t need it you should remove all unnecessary columns.

Downloads:

Power Query Error Handling Workbook: Power Query Error Handling.xlsx

Call for Speakers: SQLSaturday#280 Vienna

Posted on 2013-11-22 by Gerhard Brueckl — No Comments ↓

The SQL PASS Austria chapter is organizing a SQL Saturday in Vienna. It will be held on the March 6th 2014. (For those of you how are checking their calendars right now – yes it is actually a Thursday!) So register and be part of the first SQLSaturday that is held on a Thursday Smile !!!

Also the Call for Speakers is opened until January 5th, so make sure to submit your sessions by then.

Hope to see you there!

Using Power Query to analyze SSAS Disk Usage

Posted on 2013-11-15 by Gerhard Brueckl — 3 Comments ↓

Some time ago Bob Duffy blogged about on how to use Power Pivot to analyze the disk usage of multidimensional Analysis Services models (here). He uses a an VBA macro to pull meta data like filename, path, extension, etc. from the file system or to be more specific from the data directory of Analysis Services. Analysis Services stores all its data in different files with specific extensions so it is possible to link those files to multidimensional objects in terms of attributes, facts, aggregations, etc. Based on this data we can analyze how our data is distributed. Do we have too big dimensions? Which attribute uses the most space? Do our facts consume most of the space (very likely)? If yes, how much of it is real data and how big are my aggregations – if they are processed at all?!? – These are very common and also important things to know for an Analysis Services developer.

So Bob Duffy’s solution can be really useful. The only thing I did not like about it was the fact that it uses a VBA macro to get the data. This made me think and I came up with the idea of using Power Query to get this data. Btw, make sure to check out the latest release, there have been a lot of improvements recently!

With Power Query you have to option to load multiple files from a folder and also from its sub folders. When we do this on our Analysis Services data directory, we get a list of ALL files together with their full path, filename, extension and most important in this case their size which can be found by expanding the Attributes-record:

The final Power Query does also a lot of other things to prepare the data so it can be later joined to our FileExtensions-table that holds detailed information for each file extension. This table currently looks like below but can be extended by any other columns that may be necessary and/or useful for you:

FileType	FileType_Description	ObjectType	ObjectTypeSort	ObjectTypeDetails
ahstore	Attribute Hash Store	Dimensions	20	Attribute
asstore	Attribute String Store	Dimensions	20	Attribute
astore	Attribute Store	Dimensions	20	Attribute
bsstore	BLOB String Store	Dimensions	20	BLOB
bstore	BLOB Store	Dimensions	20	BLOB
dstore	Hierarchy Decoding Store	Dimensions	20	Hierarchy
khstore	Key Hash Store	Dimensions	20	Key
ksstore	Key String Store	Dimensions	20	Key
kstore	Key Store	Dimensions	20	Key
lstore	Structure Store	Dimensions	20	Others
ostore	Order Store	Dimensions	20	Others
sstore	Set Store	Dimensions	20	Others
ustore	ustore	Dimensions	20	Others
xml	XML	Configuration	999	Configuration
fact.data	Basedata	Facts	10	Basedata
fact.data.hdr	Basedata Header	Facts	10	Basedata
fact.map	Basedata Index	Facts	10	Basedata
fact.map.hdr	Basedata Index Header	Facts	10	Basedata
rigid.data	Rigid Aggregation Data	Facts	10	Aggregations
rigid.data.hdr	Rigid Aggregation Data Header	Facts	10	Aggregations
rigid.map	Rigid Aggregation Index	Facts	10	Aggregations
rigid.map.hdr	Rigid Aggregation Index Header	Facts	10	Aggregations
flex.data	Flexible Aggregation Data	Facts	10	Aggregations
flex.data.hdr	Flexible Aggregation Data Header	Facts	10	Aggregations
flex.map	Flexible Aggregation Index	Facts	10	Aggregations
flex.map.hdr	Flexible Aggregation Index Header	Facts	10	Aggregations
string.data	String Data (Distinct Count?)	Facts	10	Basedata
cnt.bin	Binary	Configuration	999	Binaries
mrg.ccmap	mrg.ccmap	DataMining	999	DataMining
mrg.ccstat	mrg.ccstat	DataMining	999	DataMining
nb.ccmap	nb.ccmap	DataMining	999	DataMining
nb.ccstat	nb.ccstat	DataMining	999	DataMining
dt	dt	DataMining	999	DataMining
dtavl	dtavl	DataMining	999	DataMining
dtstr	dtstr	DataMining	999	DataMining
dmdimhstore	dmdimhstore	DataMining	999	DataMining
dmdimstore	dmdimstore	DataMining	999	DataMining
bin	Binary	Configuration	999	Binaries
OTHERS	Others	Others	99999	Others

As you can see the extension may contain 1, 2 or 3 parts. The more parts the more specific this file extension is. If you checked the result of the Power Query it also contains 3 columns, FileExtension1, FileExtension2 and FileExtension3. To join the two tables we first need to load both tables into Power Pivot. The next step is to create a proper column on which we can base our relationship. If the 3-part extension is found in the file extensions table, we use it, otherwise we check the 2-part extension and afterwards the 1-part extension and in case nothing matches we use “OTHERS”:

=SWITCH(TRUE(),
 CONTAINS(FileTypes, FileTypes[FileType], [FileExtension3]), [FileExtension3],
 CONTAINS(FileTypes, FileTypes[FileType], [FileExtension2]), [FileExtension2],
 CONTAINS(FileTypes, FileTypes[FileType], [FileExtension1]), [FileExtension1],
 "OTHERS")

Then we can create a relationship between or PQ table and our file extension table. I also created some other calculated columns, hierarchies and measures for usability. And this is the final outcome:

You can very easily see, how big your facts are, the distribution between base-data and Aggregations, the Dimensions sizes and you can drill down to each individual file! You can of course also create a Power View report if you want to. All visualizations are up to you, this is just a very simple example of a report.

Enjoy playing around with it!

Downloads:
(please note that I added a filter on the Database name as a last step of the Power Query to only show Adventure Works databases! In order to get all databases you need to remove this filter!)

SSAS Disk Analysis Workbook: SSAS_DiskAnalysis.xlsx

SAP HANA’s Big Data Scenario with Power BI

Posted on 2013-11-11 by Gerhard Brueckl — 2 Comments ↓

While browsing the web for any BI related topics I recently came across this blog post about SAP HANA and how it can be used to analyze Big Data. Its actually pretty cool, SAP together with Amazon Web Services (AWS) offer a free try out of their tools for 4 hours which you can use to rebuild a predefined demo. The demo itself is very well explained and document with videos and scripts and gives some good insights on how to deal with Big Data in SAP HANA. Basically it is divided into 3 steps:
1) Load Wikipedia data (Pagehits, etc.) from Hadoop/Hive (~2GB of flatfiles)
2) Create a data mart with SAP HANA
3) Analyze results with SAP Lumira

Having done a lot recently with Power BI and its tools I asked myself if this would also be possible with Power BI? So I first did the demo on SAP HANA and afterwards I was quite sure that I could do the same also with Power BI.

And this was the initiation of this blog post where we will do the same demo but instead of SAP HANA we will use only tools of the Power BI suite. Basically we only use 3 of our Power tools:
1) Power Query to load the data
2) Power Pivot to build the “data mart”
3) Power View to analyze the data

First of all we need to load the data. The demo uses data from Wikipedia where for each year, month, day and hour the number of pagehits and bytesdownloaded are monitored per page: http://dumps.wikimedia.org/other/pagecounts-raw/

Once we have downloaded the files we can start loading them using Power Query’s “Load from Folder” source. It is basically textdata which needs to be split into several columns first. The delimiter used is 0x0001 which in Power Query’s M-language needs to be resolved to “#(0001)”:

= Table.SplitColumn(ImportedText ,"Column1",
     Splitter.SplitTextByDelimiter("#(0001)"),
     {"Column1.1", "Column1.2", "Column1.3", "Column1.4", 
      "Column1.5", "Column1.6", "Column1.7", "Column1.8"})

The columns are defined as PROJECTCODE, PAGENAME, YEAR, MONTH, DAY, HOUR, PAGEHITCOUNTFORHOUR and BYTESDOWNLOADEDFORHOUR where PROJECTCODE can be further split into language code and the real project code. There is some more logic which I will not explain in detail like error handling, data conversions, etc., which are similar to what is done in the SAP HANA demo.

This is the final M-script I came up with:

let
     Source = Folder.Files("E:\SAP\Big Data"),
     FilteredRows = Table.SelectRows(Source, each ([Extension] = "")),
     CombinedBinaries = Binary.Combine(FilteredRows[Content]),
     ImportedText = Table.FromColumns({Lines.FromBinary(CombinedBinaries)}),
     SplitColumnDelimiter = Table.SplitColumn(ImportedText ,"Column1",Splitter.SplitTextByDelimiter("#(0001)"),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7", "Column1.8"}),
     ChangedType = Table.TransformColumnTypes(SplitColumnDelimiter,{{"Column1.1", type text}, {"Column1.2", type text}, {"Column1.3", type number}, {"Column1.4", type number}, {"Column1.5", type number}, {"Column1.6", type number}, {"Column1.7", type number}, {"Column1.8", type number}}),
     SplitColumnDelimiter1 = Table.SplitColumn(ChangedType,"Column1.1",Splitter.SplitTextByDelimiter("."),{"Column1.1.1", "Column1.1.2"}),
     ChangedType1 = Table.TransformColumnTypes(SplitColumnDelimiter1,{{"Column1.1.1", type text}, {"Column1.1.2", type text}}),
     ReplacedValue = Table.ReplaceValue(ChangedType1,null,"wp",Replacer.ReplaceValue,{"Column1.1.2"}),
     InsertedCustom = Table.AddColumn(ReplacedValue, "PageHitsPerHour", each try Number.From([Column1.7]) otherwise 0),
     InsertedCustom1 = Table.AddColumn(InsertedCustom, "BytesDownloadedPerHour", each try Number.From([Column1.8]) otherwise 0),
     RemovedColumns = Table.RemoveColumns(InsertedCustom1,{"Column1.7", "Column1.8"}),
     RenamedColumns = Table.RenameColumns(RemovedColumns,{{"Column1.3", "Year"}, {"Column1.4", "Month"}, {"Column1.5", "Day"}, {"Column1.6", "Hour"}, {"Column1.2", "PageName"}, {"Column1.1.1", "LanguageCode"}, {"Column1.1.2", "ProjectCode"}}),
     InsertedCustom2 = Table.AddColumn(RenamedColumns, "Date", each #date([Year],[Month],[Day])),
     ChangedType2 = Table.TransformColumnTypes(InsertedCustom2,{{"PageHitsPerHour", type number}, {"BytesDownloadedPerHour", type number}, {"Date", type date}}),
     FilteredRows1 = Table.SelectRows(ChangedType2, each [PageHitsPerHour] < 53000000)
 in
     FilteredRows1

If you also did the demo on SAP HANA you found some quality issues in the data why you needed to remove rows with more than 53,000,000 PageHitsPerHour as this 5 rows mess up the whole analysis. The above M-script already handles this (last statement).

There are some more important things to mention here. First of all we are dealing with about 37M rows, so loading to datasheet will not work. Instead we need to load the data directly into Power Pivot. I had several memory issues when using 32bit Excel but after switching to a 64bit Excel everything went just fine. The import itself takes about 15 minutes which I think is OK for roughly 2GB of data and 37M rows.

Once the data is loaded into Power Pivot we need to load some additional data which are basically our lookup/dimension tables. They can also be imported from Wikipedia, in this case directly from the web:
http://wikipedia-proj-lang-codes.s3.amazonaws.com/UniqueProjectCodes.csv
http://wikipedia-proj-lang-codes.s3.amazonaws.com/UniqueLanguageCodes.csv

Both are very simple tables with only two columns. Again we can use Power Query to import them and add them to our Power Pivot data model.

The last thing to add is a time-dimension.SAP HANA has its predefined time-dimension. In the case of Power BI I simply used a linked table in Excel that holds all necessary days or actually all days of 2013. Adding all those tables and linking them we end up with this pretty simple Power Pivot model:

This is very similar to what you create during the demo as an analytical view in SAP HANA so both approaches are very similar here.

Now that the model is set up we can do our analysis using Power View and classic Excel Pivot Tables:

All together it took me about 2 hours to build the whole solution of which it took about 1h to download and process the data. The final workbook containing all the data has ~500MB which is mainly because of the PageName column which contains a high number of unique values which cannot be compressed very well. After removing the PageName column and some further tuning of the datamodel for compression I could bring the size down to 148 MB. This is quite OK – in numbers this is a compression of 12 from originally 1.74 GB

I hope you understand that I could not attach the whole workbook, instead I created a smaller workbook with only 50k rows by adding a TOP filter in the Power Query. This workbook can be downloaded below.

As I showed in this post, Power BI is capable of handling this kind of data very well. Both, in terms of data volume and also in terms of the type of data (unstructured/semi-structured data). Once the data is loaded into Power Pivot it can be analyzed just like any other data source using all Excel reporting capabilities.

Downloads:
Sample Excel Sheet: Wikidata_small.xlsx
Demo Script SAP HANA: Demo Script SAP HANA.docx
Power Query M-Script: Power Query M-Script.docx

Trigger Cube-Processing from Excel using Cube Actions

Posted on 2013-10-03 by Gerhard Brueckl — 12 Comments ↓

Recently I faced the requirement to give specific users of my Analysis Services Multidimensional cube the opportunity to process the cube whenever they want. There are several ways to achieve this:
– start a SQL Agent that processes the cube
– create some kind of WebService that does the processing
– create some custom .NET application that does the processing
– run some script (PowerShell, Command Line, ascmd.exe, …)

NOTE:
The post describes a solution which requires SSAS Server Admin rights. If you are interested in a solution which triggers an asynchronous processing without requiring SSAS Server Admin Rights please scroll down to the very bottom and download VS_Solution.zip
It basically runs the same script but in a separate thread in the context of the service account so the calling user does not need to have any specific rights on the server.

From a usability point-of-view none of the above is really convenient as all options involve some kind of external process or application. In this post I will show how to process a cube using cube actions which can be triggered from Excel Pivot Tables natively. So this is what we want to achieve:

This requires several steps:
1) create custom .Net assembly
2) add that custom .Net assembly to your Analysis Services database
3) add an action to your cube that calls the assembly

Creating a Custom .Net assembly:

First of all we need to create a new “Visual C#” project of type “Class Library”. To work with Analysis Services we need to add two references:

“Microsoft.AnalysisServices” refers to Microsoft.AnalysisServices.dll which can be found in the shared features folder of your installation (default is c:Program FilesMicrosoft SQL Server110SDKAssemblies)

“msmgdsrv” refers to msmgdsrv.dll which can found be in the OLAPbin-folder of your SSAS instance (default is c:Program FilesMicrosoft SQL ServerMSAS11.MSSQLSERVEROLAPbin)

Once these references are added, we can start to create our processing code:

using System;
using AdomdServer = Microsoft.AnalysisServices.AdomdServer;
using Microsoft.AnalysisServices;
using System.Data;
namespace ASSP_Processing
{
    public class Processing
    {
        public const string LoggingPrefix = “ASSP.ProcessObject: “;
        public enum ResultHandling
        {
            Datatable = 0,
            Exception = 1,
            NULL = 2
        }
        public static DataTable ProcessObject(string cubeName, ResultHandling resultHandling)
        {
            DataTable ret = new DataTable();
            ret.Columns.Add(“Result”);
            Server server = null;
            try
            {
                server = new Server();
                //connect to the current session…
                server.Connect(“*”); 
                AdomdServer.Context.CheckCancelled(); //could be a bit long running, so allow user to cancel
                Database db = server.Databases.GetByName(AdomdServer.Context.CurrentDatabaseName);
                string objectToProcessName = “#N/A”;
                ProcessableMajorObject objectToProcess = db;
                AdomdServer.Context.TraceEvent(0, 0, LoggingPrefix + “Database <“ + db.Name + “> found!”);
                objectToProcessName = “DB[“ + db.Name + “]”;
                if (!string.IsNullOrEmpty(cubeName))
                {
                    Cube cube = db.Cubes.GetByName(cubeName);
                    if (cube != null)
                    {
                        objectToProcess = cube;
                        AdomdServer.Context.TraceEvent(0, 0, LoggingPrefix + “Cube <“ + cubeName + “> found!”);
                        objectToProcessName = objectToProcessName + ” > Cube[“ + cubeName + “]”;
                    }
                }
                if (objectToProcess != null)
                {
                    AdomdServer.Context.TraceEvent(0, 0, LoggingPrefix + “Processing Object “ + objectToProcessName + ” …”);
                    objectToProcess.Process(ProcessType.ProcessFull);
                    ret.Rows.Add(new object[] { “S U C C E S S:    Object “ + objectToProcessName + ” successfully processed!” });
                    AdomdServer.Context.TraceEvent(0, 0, LoggingPrefix + “Finished Processing Object “ + objectToProcessName + “!”);
                }
            }
            finally
            {
                try
                {
                    if (server != null)
                        server.Disconnect();
                }
                catch { }
            }
            // if processing was successful a row has been added beforehand
            if (ret.Rows.Count == 0)
            {
                ret.Rows.Add(new object[] { “F A I L U R E:    Error while processing an object!” });
            }
            switch (resultHandling)
            {
                case ResultHandling.Datatable:
                    return ret;
                case ResultHandling.Exception:
                    throw new Exception(Environment.NewLine + Environment.NewLine + Environment.NewLine + ret.Rows[0][0].ToString());
                case ResultHandling.NULL:
                    return null;
            }
            return null;
        }
        public static DataTable ProcessObject(string cubeName, int resultHandling)
        {
            return ProcessObject(cubeName, (ResultHandling)resultHandling);
        }
        public static DataTable ProcessObject(string cubeName)
        {
            return ProcessObject(cubeName, 1);
        }
    }
}

There are two things here that are worth to point out. First of all we need to somehow establish a connection to the server that hosts the SSAS database in order to process an object. The first thing that comes into mind would be to create a new connection to the server and run some processing XMLA. Unfortunately this does not work here because as this would result in a deadlock. The assembly will be called as a Rowset-Action which in terms of locking is similar to a query. So if we run separate processing command within our Rowset-Action this processing command can never be committed as there is always a query running on the database. The query (=Rowset-Action) will wait until the function is finished and the function waits until the processing is committed resulting in a classical deadlock!

To avoid this we need to connect to the current users sessions using “*” as our connection string:

//connect to the current session…
server.Connect(“*”);

The second thing to point out is the return value of our function. In order to be used in a Rowset-action, the function has to return a DataTable object. The results would then be displayed similar to a drill through and a separate worksheet would pop up in Excel showing the result:

From my point-of-view this can be a bit annoying as after each processing this new worksheet pops up and you loose the focus of the pivot table. So I investigated a bit and found another way to display the output of the processing.

When a drill through / action throws an exception, this will also be displayed in Excel without actually displaying the result of the drill through / action in a separate worksheet. By default it may look like this message:

Well, not really much information here right?

To display more information, e.g. that the processing was successful (or not) we can throw our own exception in the code after processing is done:

I added a separate parameter to control this behavior, in the code above the default would be option 3 – custom Exception.

Once this project is built the DLL can be added to the SSAS database. It is necessary to grant the assembly “Unrestricted” permission set:

The last step would be to create the appropriate cube actions that call the assembly:

It is necessary that the action is of Type=Rowset, otherwise it would not be displayed in Excel! The Action Target is up the you, in this example it will be displayed on every cell. The Action expression is the actual call to our assembly passing the cube name as a parameter. The second parameter controls how the result is displayed and could also be omitted. In this example I added 3 separate actions, one for each result-type (as described above).

And that’s it! This simple steps allow you and your users to process any cube (or any other database object) from within your favorite client application, assuming it supports cube actions.

This is the final result where the three cube actions can be triggered from an Excel pivot table:

The attached zip-file includes everything that is necessary to recreate this example:
– SSAS project
– ASSP_Processing project

CubeProcessingAction.zip

Solution for Asynchronous Processing without Admin-Rights:
VS_Solution.zip

Share this:

Share this:

Share this:

Share this:

Share this:

Error-handling on row-level

Error-handling on cell-level

Error-handling on cell-level with error details

Share this:

Share this:

Share this:

Share this:

Share this: