## Restoring a SSAS Tabular Model to Power Pivot

It is a very common scenario to create a SSAS Tabular Model out of an Power Pivot Model contained in an Excel workbook. Microsoft even created an wizard (or actually a separate Visual Studio project) that supports you doing this. Even further, this process is also a major part of Microsoft’s strategy to cover Personal BI, Team BI and Corporate BI within one technology being xVelocity. This all works just fine but there may also be scenarios where you need to do it the other way round – converting a Tabular model to Power Pivot. Several use-cases come into my mind but I am sure that the most important one is to making data available offline for e.g. sales people to take it with them on their every day work. And in this blog post I will show how this can be done!

But before taking a closer look into how this can be accomplished, lets first see how the import from Power Pivot to SSAS Tabular works. To do this start SQL Server Profiler and connect to your tabular instance. Then create a new Tabular project in Visual Studio based on an existing Power Pivot workbook. At this point you will notice a lot of events happening on our SSAS Tabular server. The most important event for us is “Command End” with the EventSubclass “9 – restore”:

SSAS actually restores a backup from a “Model.abf” backup file which is located in our project directory that we just created:

So far so good – but where does this file come from?

Well, the origin of the file has to be our Excel workbook that we imported. Knowing that all new office formats ending with “x” (.xlsx, .docx, …) are basically ZIP files, we can inspect our original Excel workbook by simply rename it to “.zip”. This allows us to browse the Excel file structure:

We will find a folder called “xl” which contains a sub-folder called “model”. This folder contains one item called “item.data”. If you take a closer look at the file size you may realize that both, the “Model.abf” file that we restored and the “item.data” file from our Excel workbook have the exact same size:

A Coincidence? Not really!

What happens behind the scenes when you import a Power Pivot model into SSAS Tabular is that this “item.data” file gets copied into your project directory and is renamed to “Model.abf” and then restored to the SSAS Tabular workspace instance by using an standard database restore.

Having this information probably makes you think: If it works in one direction, why wouldn’t it also work the other way round? And it does!

So here are the steps that you need to do in order to restore your SSAS Tabular backup into an Excel Power Pivot workbook:

1. Create a backup of your SSAS Tabular database and rename it to “item.data”
2. Create an empty Excel workbook and add a simple linked table to the Excel data model (which is actually Power Pivot).
This is necessary to tell Excel that the workbook contains a Power Pivot model which has to be loaded once the file is opened.
3. Close the Excel workbook and rename it from “MyFile.xlsx” to “MyFile.xlsx.zip”
4. Open the .zip-file in Windows Explorer and locate the “\xl\model\”-folder
5. Replace the “item.data” file with the file that you created in step 1.
6. Rename the .zip-file back to “MyFile.xlsx”
7. Open the Excel Workbook
8. Voilá! You can now work with the data model as with any other Power Pivot model!

I tested this with a SSAS Tabular backup from SQL Server 2012 SP1 being restored to the streamed version of Excel from Office 365 with the latest version of Power Pivot. I assume that it also works with older versions but have not tested all combinations yet.

There are also some features that will not work, for example roles. If your Tabular database contains roles you will not be able to use this approach. Excel will complain that the Power Pivot model is broken. However, other Tabular features like partitions actually work with the little limitation that you cannot change them later on in the Power Pivot model or process them separately:

Another thing to note here is that only up to 3 partitions are allowed, otherwise you will get the same error as for roles. I think this is related to the limitation of 3 partitions for SQL Server Analysis Services Standard Edition as Chris Webb described here.

Besides these obvious features there are also some other cool things that you can do in Tabular which are not possible in Power Pivot. Most (or actually all) of them are accessible only by using BIDS Helper – a great THANK YOU to the developers of BIDS Helper at this point!
BIDS Helper enables you to add classical multidimensional features also to Tabular models which is not possible using standard Visual Studio only. Those include:

• DisplayFolders
• Actions

I tested it for DisplayFolders and Actions and both are working also in Power Pivot after the backup was restored and I further assume that all the other things will also work just fine.
Simply keep in mind that Power Pivot is basically a fully featured Analysis Services instance running within Excel!

For my (and your) convenience I also created a little PowerShell script that does all the work:

1. # Load the assembly with the ZipFile class
3. # Load the assembly to access Analysis Services
5. # Also install "Analysis Services PowerShell" according to http://technet.microsoft.com/en-us/library/hh213141.aspx
6.
7. # INPUT-Variables, change these to match your environment
8. $rootFolder = "D:\Test_PowerShell\" 9.$emptyExcelFile = $rootFolder + "EmptyExcel.xlsx" 10.$ssasServerName = "localhost\TAB2012"
11. $ssasDatabaseName = "AdventureWorks" 12. 13. # internal variables 14.$newExcelFile = $rootFolder +$ssasDatabaseName + ".xlsx"
15. $newExcelFileZip =$newExcelFile + ".zip"
16. $unzipFolder =$rootFolder + "TEMP_ExcelUnzipped"
17. $backupFile =$rootFolder + $ssasDatabaseName + ".abf" 18.$itemDestination = $unzipFolder + "\xl\model\item.data" 19. 20. # Copy the empty Excel file and rename it to ".zip" 21. Copy-Item -Path$emptyExcelFile -Destination $newExcelFileZip 22. 23. # Unzip the file using the ZipFile class 24. [System.IO.Compression.ZipFile]::ExtractToDirectory($newExcelFileZip, $unzipFolder) 25. 26. # Create a backup of the SSAS Tabular database 27. Backup-ASDatabase -Server$ssasServerName -Name $ssasDatabaseName -BackupFile$backupFile -AllowOverwrite -ApplyCompression
28.
29. # Copy the backup-file to our extracted Excel folder structure
30. Copy-Item -Path $backupFile -Destination$itemDestination -Force
31.
32. # Check if the target file exists and delete it
33. if (Test-Path -Path $newExcelFile) { Remove-Item -Path$newExcelFile }
34.
35. # Zip the folder-structure again using the ZipFile class and rename it to ".xlsx"
36. [System.IO.Compression.ZipFile]::CreateFromDirectory($unzipFolder,$newExcelFile)
37.
38. # Cleanup the unecessary files
39. Remove-Item -Path $unzipFolder -Recurse 40. Remove-Item -Path$backupFile
41. Remove-Item -Path $newExcelFileZip The last thing to mention here is that I don’t know if this is officially supported in any way by Microsoft – actually I am pretty sure it is not – so watch out what you are doing and don’t complain if something is not working as expected. ## Analysis Services Security: Multiple Roles in Tabular vs. Multidimensional – Part 2 In one of my recent posts I highlighted how multiple security roles are handled in tabular opposed to multidimensional Analysis Services models. In this post I will go into more detail and focus on how the security is evaluated for both models. I would like to continue using the same example as in the last post with the following roles: “Bikes” – is restricted to [ProductCategory] = “Bikes” “Brakes” – is restricted to [ProductSubCategory] = “Brakes” “DE” – is restricted to [Country] = “DE” I used the following MDX query to test both, the tabular and the multidimensional model: 1. SELECT 2. NON EMPTY {[Geography].[Country Region Name].[Country Region Name].MEMBERS} ON 0, 3. NON EMPTY {[Product].[Product Category Name].[Product Category Name].MEMBERS} ON 1 4. FROM [Model] 5. WHERE ( 6. [Measures].[Reseller Total Sales] 7. ) The last thing I mentioned was how the combination of roles “Bikes” and “DE” could be expressed in SQL: 1. SELECT 2. [ProductCategory], 3. [Country], 4. SUM([Reseller Sales]) 5. FROM <table> 6. WHERE [ProductCategory] = ‘Bikes’ 7. OR [Country] = ‘Germany’ 8. GROUP BY 9. [Product Category], 10. [Country] When running the previous MDX query on our tabular model using roles “Bikes” and “DE” we will see a very similar query being executed against our xVelocity Storage Engine: (Note: There are also some SE engine queries executed to get the elements for rows and columns but the query below is the main query) 1. SELECT 2. [Geography_1fa13899-0770-4069-b7cb-5ddf22473324].[EnglishCountryRegionName], 3. [Product_11920c93-05ae-4f1c-980e-466dfbcfca2a].[CalculatedColumn1 1], 4. SUM([Reseller Sales_fc635e72-28dc-4156-80d5-43b805f8df1c].[SalesAmount]) 5. FROM [Reseller Sales_fc635e72-28dc-4156-80d5-43b805f8df1c] 6. LEFT OUTER JOIN [Reseller_d52b9c6f-8d2d-4e23-ae4c-2fc57c1d968a] 7. ON [Reseller Sales_fc635e72-28dc-4156-80d5-43b805f8df1c].[ResellerKey] 8. =[Reseller_d52b9c6f-8d2d-4e23-ae4c-2fc57c1d968a].[ResellerKey] 9. LEFT OUTER JOIN [Geography_1fa13899-0770-4069-b7cb-5ddf22473324] 10. ON [Reseller_d52b9c6f-8d2d-4e23-ae4c-2fc57c1d968a].[GeographyKey] 11. =[Geography_1fa13899-0770-4069-b7cb-5ddf22473324].[GeographyKey] 12. LEFT OUTER JOIN [Product_11920c93-05ae-4f1c-980e-466dfbcfca2a] 13. ON [Reseller Sales_fc635e72-28dc-4156-80d5-43b805f8df1c].[ProductKey] 14. =[Product_11920c93-05ae-4f1c-980e-466dfbcfca2a].[ProductKey] 15. WHERE 16. (COALESCE((PFDATAID( [Product_11920c93-05ae-4f1c-980e-466dfbcfca2a].[CalculatedColumn1 1] ) = 6)) 17. OR 18. COALESCE((PFDATAID( [Geography_1fa13899-0770-4069-b7cb-5ddf22473324].[CountryRegionCode] ) = 5))); Well, not really readable so lets make it a bit nicer by removing those ugly GUIDs, etc: 1. SELECT 2. [Geography].[EnglishCountryRegionName], 3. [Product].[ProductCategory], 4. SUM([Reseller Sales].[SalesAmount]) 5. FROM [Reseller Sales] 6. LEFT OUTER JOIN [Reseller] 7. ON [Reseller Sales].[ResellerKey] = [Reseller].[ResellerKey] 8. LEFT OUTER JOIN [Geography] 9. ON [Reseller].[GeographyKey] = [Geography].[GeographyKey] 10. LEFT OUTER JOIN [Product] 11. ON [Reseller Sales].[ProductKey] = [Product].[ProductKey] 12. WHERE [Product].[ProductCategory] = "Bikes" 13. OR [Geography].[CountryRegionCode] = "Germany"; This looks very similar to our SQL query. The special thing about it is the WHERE clause which combines the restrictions of both roles using OR which is then propagated also to our [Reseller Sales] fact table and that’s the reason why we see what we want and expect to see – all sales that were either made with “Bikes” OR made in “Germany”: Another very important thing to note and remember here is that the security restrictions get propagated into and are evaluated within the query. This is done for each and every query (!) which is usually not a problem but may become crucial if you use dynamic security. To test this with dynamic security I introduced a new role called “CustData” which is going to replace our “Bikes” for this test. It is restricted on table ‘Product’ as: 1. =([Product Category Name] = IF( CUSTOMDATA() = "1", "Bikes", "Clothing")) So instead of using the connectionstring “…;Roles=Bikes,DE; …” I now used “…;Roles=CustData,DE;CustomData=1; …” which results in the exact same results of course. However, the query now changed to the following (already beautified) xVelocity query: 1. SELECT 2. [Geography].[EnglishCountryRegionName], 3. [Product].[ProductCategory], 4. SUM([Reseller Sales].[SalesAmount]) 5. FROM [Reseller Sales] 6. LEFT OUTER JOIN [Reseller] 7. ON [Reseller Sales].[ResellerKey] = [Reseller].[ResellerKey] 8. LEFT OUTER JOIN [Geography] 9. ON [Reseller].[GeographyKey] = [Geography].[GeographyKey] 10. LEFT OUTER JOIN [Product] 11. ON [Reseller Sales].[ProductKey] = [Product].[ProductKey] 12. WHERE [Product].$ROWFILTER IN '0x000000…000000000000000000000fffff00000000000ffffffff'));
13. OR [Geography].[CountryRegionCode] = "Germany";

Instead of using a direct filter on [ProductCategory] we now see a filter on $ROWFILTER ?!? ### ASSUMPTIONS START ### I have to admit that I am currently not entirely sure what this means but I assume the following: Preceding the main query another xVelocity query is executed which is important for us: 1. SELECT 2. [Product].[RowNumber], 3. [Product].[ProductCategory], 4. COUNT() 5. FROM [Product]; This query fetches each [RowNumber] and its associated [ProductCategory]. Internally the [RowNumber] column is created for every table. This is related to the columnar storage that xVelocity uses. Elaborating this in detail would go far beyond the context of this blog post. For more details on the RowNumber-column please refer too http://msdn.microsoft.com/en-us/library/hh622558(v=office.15).aspx which describes the Excel data model which is actually Power Pivot and therefore also applies to Tabular. (In general this link contains a lot of in-depth information on the tabular data model and the columnar storage concepts!) I further assume that our security-expression is then evaluated against this temporary table to create an bitmap index of which rows match the security-expression and which don’t. This result is then applied to our main query which using the WHERE clause [Product].$ROWFILTER IN ‘0x0000….’
For all subsequent queries the above query on [RowNumber] and [ProductCategory] is not executed again so I assume that the bitmap index gets cached internally by Analysis Services. I further assume that if the bitmap index gets cached it is also shared between users belonging to the same role which would be similar to multidimensional models.
### ASSUMPTIONS END ###

So the important take-away for tabular is that the security gets propagated into the query and down to the fact table. Combining multiple roles on this level using OR delivers the expected results.

For multidimensional models this is slightly different. You can define security on either the Database Dimension (which gets inherited down to all Cube Dimension) or you can define security on the Cube Dimension directly. Defining security on the Database Dimension already makes it very clear that the security restrictions are independent of any fact table/measure group. A Database Dimension may be used in several cubes so the engine cannot know in advance which measure group to use. Security for multidimensional models is defined on the multidimensional space defined by that dimension. If one role is not restricted on a dimension at all, the user will always see the whole dimension and its associated values, even if a second role would restrict that dimension. And this causes unexpected results as the user may see the whole cube.
In terms of SQL the query could be expressed as:

1. SELECT
2.     [ProductCategory],
3.     [Country],
4.     SUM([Reseller Sales])
5. FROM <table>
6. WHERE ( [ProductCategory] = 'Bikes' OR 1 = 1)
7.     OR ( 1 = 1  OR [Country] = 'Germany')
8. GROUP BY
9.     [Product Category],
10.     [Country]

The left side of the inner OR statements represents the role “Bikes” whereas the right part represents the “DE” role. It should be very obvious that due to the combination of both roles you finally see everything without any restriction!

Another important thing to point out is that security for multidimensional models is only evaluated once and not for every query. So even if you have complex dynamic security its evaluation only hits you once for the first user that connects to a role and is cached for all queries of that user and also shared with other users belonging to that role.

I hope this gives some more insights on how tabular and multidimensional handle multiple roles and their fundamental differences!

## Analysis Services Security: Multiple Roles in Tabular vs. Multidimensional

In terms of security tabular and multidimensional models of SQL Server Analysis Services are very similar. Both use Roles to handle specific security settings. Those Roles are then assigned to users and groups.  This is very trivial if a user only belongs to one Role – the user is allowed to see what is specified in the Role. But it can become very tricky if a user belongs to more than one role.

For both, tabular and multidimensional security settings of multiple roles are additive. This means that you will not see less if you are assigned multiple roles but only more. Lets assume we have the following Roles:
“Bikes” – is restricted to [ProductCategory] = “Bikes”
“Brakes” – is restricted to [ProductSubCategory] = “Brakes”
“DE” – is restricted to [Country] = “DE”

A user belonging to Roles “Bikes” and “Brakes” will see all products that belong to “Bikes” and all products that belong to “Brakes”. This is OK and returns the expected results as  both roles secure the same dimension/table. This applies to tabular and also multidimensional.

However, if roles secure different dimensions/tables it gets a bit more tricky. A user may belong to Roles “Bikes” and “DE”. For multidimensional this is a real problem as it finally result in the user seeing the whole cube! This is “by design” and can be explained as follows:
Role “Bikes” does not have any restriction on [Country] so all countries are visible, Role “DE” has no restriction on [ProductCategory] so all product categories are visible. As Roles are additive the user is allowed to see all countries and also all product categories, hence the whole cube:

Basically you would expect to see “Bikes”-sales for all countries and “Germany”-sales for all product categories but you end up seeing much more than this. If you have every faced this problem in real life you know that this is probably not the intended behavior your customers want to see. Usually Active Directory Groups are used and assigned to SSAS roles, so this can happen quite easily without anyone even noticing (except the user who is happy to see more )!
Chris Webb wrote an excellent blog post on how to deal with those kinds of security requirements in multidimensional models here.

For tabular this behavior is somehow similar. A user belonging to both roles is also allowed to see all countries and all product categories – this is because security settings are additive, same as for multidimensional. Even though this is true in terms of the structure (rows, columns) of the query we still get a different result in terms of values!
Here is the same query on a tabular model with the same security settings:

This is exactly what we and our customers would expect to see – all sales for “Germany” and also every sale related to “Bikes”! In tabular models security applied to a given table cascades down to all related tables – in this case to our fact table. If a table is indirectly secured by different Roles which put security on different tables those restrictions are combined using OR. In terms of SQL this could be expressed as:

1. SELECT
2.     [ProductCategory],
3.     [Country],
4.     SUM([Reseller Sales])
5. FROM <table>
6. WHERE [ProductCategory] = 'Bikes'
7.     OR [Country] = 'Germany'
8. GROUP BY
9.     [Product Category],
10.     [Country]

Further pivoting the result would show the same as the MDX query.

Thinking back to some of my multidimensional cubes where I had to deal with multiple Roles this “slight difference” would have been really handy and would have saved me a lot of time that I put into custom security solutions using bridge tables, assemblies, etc.

In my next post I will go into much more detail on how the tabular security model works so stay tuned!

UPDATE: Part 2 can be found here

## Applied Basket Analysis in Power Pivot using DAX

Lets take a look at the initial data model first:

First of all we do not want to modify this model but just extend it so that all previously created measures, calculations and, most important, the reports still work. So the only thing we do is to add our Product-tables again but with a different name. Note that I also added the Subcategory and Category tables in order to allow Basket Analysis also by the Product-Category hierarchy. As we further do not want to break anything we only use an inactive relationship to our ‘Internet Sales’ fact table.

After adding the tables the model looks as follows:

The next and actually last thing to do is to add the following calculated measure:

Sold in same Order :=
CALCULATE (
COUNTROWS ( ‘Internet Sales’ ),
CALCULATETABLE (
SUMMARIZE (
‘Internet Sales’,
‘Internet Sales’[Sales Order Number]
),
ALL ( ‘Product’ ),
USERELATIONSHIP ( ‘Internet Sales’[ProductKey], ‘Filtered Product’[ProductKey] )
)
)

(formatted using DAX Formatter)

The inner CALCULATETABLE returns a list/table of all [Sales Order Numbers] where a ‘Filtered Product’ was sold and uses this table to extend the filter on the ‘Internet Sales’ table. It is also important to use ALL(‘Product’) here otherwise we would have two filters on the same column ([ProductKey]) which would always result in an empty table. Doing a COUNTROWS finally returns all for all baskets where the filtered product was sold.
We could also change ‘Internet Sales'[Sales Order Number] to ‘Internet Sales'[CustomerKey] in order to analyze what other customers bought also in different baskets (This was done for Example 3). The whole SUMMARIZE-function could also be replaced by VALUES(‘Internet Sales'[Sales Order Number]). I used SUMMARIZE here as I had better experiences with it in terms of performance in the past, though, this may depend on your data. The calculation itself also works with all kind of columns and hierarchies, regardless whether its from table ‘Product’, ‘Filtered Product’, or any other table!

So what can we finally do with this neat formula?

1) Classic Basket Analysis – “Others also bought …”:

As we can see Hydration Packs are more often sold together with Mountain Bikes opposed to Road Bikes and Touring Bikes. We could also use a slicer on ‘Filtered Product Subcategory’=”Accessories” in order to see how often Accessories are sold together with other products. You may analyze by Color and Product Category:

As we can see people that buy black bikes are more likely to buy red helmets than blue helmets.

What may also be important for us is which products are sold together the most often? This can be achieved by pulling ‘Product’ on rows and ‘Filtered Product’ on columns. By further applying conditional formatting we can identify correlations pretty easy:

Water Bottles are very often sold together with Bottle Cages – well, not really a surprise. Again, you can also use all kind of hierarchies here for your analysis.
This is what the whole matrix looks like:

The big blank section in the middle are our Bikes. This tells us that there is no customer that bought two bikes in the same order/basket.

For this analysis I used an extended version of the calculation above to filter out values where ‘Product’ = ‘Filtered Product’ as of course every product is sold within its own basket:

IF (
MIN ( ‘Product’[ProductKey] )
<> MIN ( ‘Filtered Product’[ProductKey] ),
[Sold in same Order]
)

3) Find Customers that have not bough a common product yet
As we now know from the above analysis which products are very often bought together we can also analyze which customers do not fall in this pattern – e.g. customers who have bough a Water Bottle but have not bought a Bottle Cage yet. Again we can extend our base-calculation to achieve this:

Not Sold to same Customer :=
IF (
NOT ( ISBLANK ( [Sum SA] ) ) && NOT ( [Sold to same Customer] ),
“Not Sold Yet”
)

The first part checks if the selected ‘Product’ was sold to the customer at all and the second part checks if the ‘Filtered Product’ was not sold to the customer yet. In that case we return “Not Sold Yet”, and otherwise  BLANK() which is the default if the third parameter is omitted. That’s the result:

Aaron Phillips has bought a Water Bottle but no Mountain Bottle Cage nor a Road Bottle Cage – maybe we should send him some advertisement material on Bottle Cages?

As you can see there are a lot of analyses possible on top of that little measure that we created originally. All work with any kind of grouping or hierarchy that you may have and no change to your data model is necessary, just a little extension.

And that’s it – Basket Analysis made easy using Power Pivot and DAX!

Sample Workbook with all Examples: BasketAnalysis.xlsx

## Optimizing Columnar Storage for Measures

First of all I have to thank Marco Russo for his blog post on Optimizing High Cardinality Columns in Vertipaq and also his great session at SQL PASS Rally Nordic in Stockholm last year which taught me a lot about columnar storage in general. I highly recommend everyone to read the two mentioned resources before continuing here. Most of the ideas presented in this post are based on these concepts and require at least basic knowledge in columnar storage.

When writing one of my recent posts I ended up with a Power Pivot model with roughly 40M rows. It contained internet logs from Wikipedia, how often someone clicked on a given page per month and how many bytes got downloaded. As you can imagine those values can vary very heavily, especially the amount of bytes downloaded. So in the Power Pivot model we end up having a lot of distinct values in our column that we use for our measure. As you know from Marcos posts, the allocated memory and therefore also the  performance of columnar storage systems is directly related to the number of distinct values in a column – the more the worse. Marco already described an approach to split up a single column with a lot of distinct values into several columns with less distinct values to optimize storage. These concepts can also be used on columns that contain measures or numeric values in general. Splitting numeric values is quite easy, assuming your values range from 1 to 1,000,000 you can split this column into two by dividing the value by 1000 and using MOD 1000 for the second column. Instead of one column with the value 123,456 you end up with two columns with the values 123 and 456. In terms of storage this means that instead of 1,000,000 distinct values we only need to store 2 x 1,000 distinct values. Nothing new so far.

The trick is to combine those columns again at query time to get the original results as before the split. For some aggregations like SUM this is pretty straight forward, others are a bit more tricky. Though, in general the formulas are not really very complex and can be adopted very easily to handle any number of columns:

 Aggregation DAX Formula Value_SUM1 =SUMX(‘1M_Rows_splitted’, [Value_1000] * 1000 + [Value_1]) Value_SUM2 =SUM ( ‘1M_Rows_splitted'[Value_1000] ) * 1000    + SUM ( ‘1M_Rows_splitted'[Value_1] ) Value_MAX =MAXX(‘1M_Rows_splitted’, [Value_1000] * 1000 + [Value_1]) Value_MIN =MINX(‘1M_Rows_splitted’, [Value_1000] * 1000 + [Value_1]) Value_COUNT =COUNTX(‘1M_Rows_splitted’, [Value_1000] * 1000 + [Value_1]) Value_DISTINCTCOUNT =COUNTROWS (    SUMMARIZE (        ‘1M_Rows_splitted’,        ‘1M_Rows_splitted'[Value_1000],        ‘1M_Rows_splitted'[Value_1]))

As you can see you can still mimic most kind of aggregation even if the [Value]-column is split up.

Though, don’t exaggerate splitting your columns – too many may be a bit inconvenient to handle and may neglect the effect resulting in worse performance. Marco already showed that you can get a reduction of up to 90% in size, during my simple tests I came up with about the same numbers. Though, it very much depends on the number of distinct values that you actually have in your column!

I would not recommend to always use this approach for all your measure column – no, definitely not! First check how many distinct values your data/measures contain and decide afterwards. For 1 million distinct values it is probably worth it, for 10,000 you may reconsider using this approach. Most important here is to test this pattern with your own data, data model and queries! Test it in terms of size and of course also in terms of performance. It may be faster to split up columns but it may also be slower and it may be also different for each query that you execute against the tabular model / Power Pivot. Again, test with your own data, data model and queries to get representative results!

Here is a little test that you may run on your own to test this behavior. Simple create the following Power Query using M, load the result into Power Pivot and save the workbook. It basically creates a table with 1 million distinct values (0 to 999,999) and splits this column up into two. You can just copy the workbook, remove the last step “Remove Columns” and save it again to get the “original” workbook and Power Pivot model.

let
List1 = List.Numbers(0, 1000000),

TableFromList = Table.FromList(List1, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
RenamedColumns = Table.RenameColumns(TableFromList,{{"Column1", "Value"}}),
ChangedType1 = Table.TransformColumnTypes(RenamedColumns,{{"Value", type number}}),
InsertedCustom = Table.AddColumn(ChangedType1, "Value_1000", each Number.RoundDown([Value] / 1000)),
InsertedCustom1 = Table.AddColumn(InsertedCustom, "Value_1", each Number.Mod([Value], 1000)),
ChangedType = Table.TransformColumnTypes(InsertedCustom1,{{"Value_1000", type number}, {"Value_1", type number}}),
RemovedColumns = Table.RemoveColumns(ChangedType,{"Value"})
in
RemovedColumns

This is what I ended up with when you compare those two workbooks:

 # distinct values Final Size single column 1,000,000 14.4 MB two columns 2 x 1,000 1.4 MB

We also get a reduction in size of 90%! Though, this is a special scenario …
In the real world, taking my previous scenario with the Wikipedia data as an example, I ended up with a reduction to 50%, still very good though. But as you can see the reduction factor varies very much.

Sample workbook with Power Query: 1M_Rows_splitted.xlsx

## DAX vs. MDX: DataMembers in Parent-Child Hierarchies

Recently when browsing the MSDN PowerPivot Forums I came across this thread where the question was ask on how to show the value which is directly linked to an element in a parent-child hierarchy instead of the aggregation of all "children". In this post I am going to address this problem finally showing a proper solution.

First of all I would like to start with some background to so called "datamembers". The term "datamember" originates from parent-child hierarchies in multidimensional models. It is a dummy-member that is created automatically below each hierarchy-node to hold values that are linked to a node-member directly. This is of course only possible for parent-child hierarchies.

Take the following MDX-Query for example:

SELECT
[Measures].[Sales Amount Quota] ON 0,
Descendants(
[Employee].[Employees].&[290],
1,
SELF_AND_BEFORE) ON 1
WHERE [Date].[Calendar].[Calendar Year].&[2006]

As we can see, the value of "Amy E. Alberts" is not the Sum of its children but has a difference of \$108,000.00 This difference is because of the "datamember" that is also a child of every parent-child node which may be hidden. For multidimensional models his behavior can be changed by setting the MembersWithData-Property to "NonLeafDataVisible":

Executing the same query again we now get this results:

Now we see "Amy E. Alberts" twice – the node and the datamember.

Doing a similar query on a tabular model we will get this result by default:

As you can see, we still have the datamember but without any caption. Though, this can of course be changed when you flatten out the hierarchy using DAX by simply replacing BLANK() by "<my datamember caption>" in your calculated columns.

Anyway, this is not the topic of this post but just some background information. Now we will focus on how to query data that is directly associated to a node or its corresponding datamember in MDX and DAX.

### MDX:

For MDX this is quite easy as the datamember can be referenced by simply using ".datamember" of a given hierarchy node:

WITH

MEMBER [Measures].[DataMember] AS (
[Measures].[Sales Amount Quota],
[Employee].[Employees].currentmember.datamember
), FORMAT_STRING = 'Currency'

SELECT
{
[Measures].[Sales Amount Quota],
[Measures].[DataMember]
} ON 0,
Descendants(
[Employee].[Employees].&[290],
1,
SELF_AND_BEFORE) ON 1
WHERE [Date].[Calendar].[Calendar Year].&[2006]

The DataMember-Measure only shows values that are directly linked to the current node of the parent-child hierarchy. This is pretty straight forward and very easy to accomplish.

### DAX:

In this DAX this problem is a bit more tricky as we do not have that built-in ".datamember" function. Also navigation within hierarchies and parent-child hierarchies itself are not really supported in DAX. Though, there are several blogs out there that describe how to handle parent-child hierarchies by Alberto Ferrari (here) and Kasper de Jonge (here). Based on these solutions we can create our datamember-calculation.

First (and only) thing we need is to add a calculated column that stores the path depth of the current row. This can be achieved by using this formula:

=PATHLENGTH(
PATH(
Employee[EmployeeKey],
Employee[ParentEmployeeKey]
)
)

Using our previous example and selecting "Amy E. Alberts" together with our new column we get this:

As we can see there are two different Levels below "Amy E. Alberts" where Level=3 holds our datamember-value and Level=4 holds all the other values (=real child members). Lets add a calculated measure to help you make the final calculation more understandable:

MinLevel:=MIN(Employee[Level])

So, for our [DataMember]-calculation we simply have to extend the context to only include rows where [Level] = [MinLevel]

DataMember:=CALCULATE(
SUM('Sales Quota'[Sales Amount Quota]),
FILTER('Employee',
'Employee'[Level] = [MinLevel]))

Well, obviously our calculation is not returning the expected results. This is because of a very common pitfall which I am also very prone to. Whenever a calculated measure is used within an other calculated measure, a CALCULATE() gets wrapped around it thereby taking over the current context. In our case this means that ‘Employee'[Level] will always be the same as [MinLevel] as [MinLevel] gets evaluated for each row and doing MIN() on only one row of course always returns the value of the current row which we compared it to!

To solve this issue we have to place the original [MinLevel]-calculation directly into our [DataMember]-calculation:

DataMember:=CALCULATE(
SUM('Sales Quota'[Sales Amount Quota]),
FILTER('Employee',
'Employee'[Level] = MIN(Employee[Level])))

This change finally makes the calculation return the correct results also when browsing using the hierarchy:

Even though DAX is not very good in handling hierarchies, the [DataMember]-calculation can be accomplished quite easily. This calculation also works for hierarchies of any depth without needing to adopt it. In terms of performance it may be further improved to the following formula – though the impact will be minor:

DataMember_v2:=CALCULATE(
SUM('Sales Quota'[Sales Amount Quota]),
FILTER(VALUES('Employee'[Level]),
'Employee'[Level] = MIN(Employee[Level])))

## Universal Quantiles Calculation for PowerPivot using DAX

In my last post I showed a new approach on how to calculate the median in PowerPivot using DAX. In the comments the question was raised whether it is possible to make that calculation universal for all kind of Quantiles like Median, Quartiles, Percentiles, etc. and that’s what this post is about.

Lets analyze the final Median calculation from the last post:

Median SA Months:=CALCULATE([SumSA],
TOPN(
2-MOD([Cnt_Months], 2),
TOPN(
([Cnt_Months] + 1) / 2,
VALUES(‘Date’[Month]),
[SumSA]),
[SumSA],
1))
/
(2-MOD([Cnt_Months], 2))

The Median defines the value in the middle of an ordered set. To get the first half (+1 to handle even and odd sets) of the whole we are using TOPN function:

TOPN(
([Cnt_Months] + 1) / 2,
VALUES(‘Date’[Month]),
[SumSA]),

The important part here is the “divide by 2” to split the set in the middle to start our Median calculation. Instead of dividing by 2 we could also multiply by 0.5 where 0.5 would be the separator for our quantile (in this case the Median). This expression can be made dynamic. For the first Quartile we would use 0.25 for the second Quartile (=Median) we would use 0.5 and for the third Quartile we would use 0.75.

I created a little linked table to demonstrate the dynamic approach:

 Quantile SortOrder Percentage Median 1 0.5000 Quartile 1 401 0.2500 Quartile 2 402 0.5000 Quartile 3 403 0.7500 Sextile 1 601 0.1667 Sextile 2 602 0.3333 Sextile 3 603 0.5000 Sextile 4 604 0.6667 Sextile 5 605 0.8333

We also need to add a calculated measure that returns the percentage-value of the currently selected Quantile:

SelectedQuantile:=IF(
HASONEVALUE(Quantiles[Percentage]),
VALUES(Quantiles[Percentage]))

Now we can change our old Median-calculation to be dynamic by using the measure defined above:

Quantile SA Months:=CALCULATE([SumSA],
TOPN(
2 – MOD([Cnt_Months], 2),
TOPN(
ROUNDUP(([Cnt_Months] + 1) * [SelectedQuantile], 0),
VALUES(‘Date’[Month]),
[SumSA]),
[SumSA],
1))
/
(2 – MOD([Cnt_Months], 2))

We also need to explicitly add ROUNDUP() as “([Cnt_Months] + 1) * [SelectedQuantile]” may return any decimal places whereas the previous divide by 2 could only result in a x.5 which was rounded up automatically. And well, that’s all we need to change in order to make the calculation universal for all different Quantiles! The rest is the same logic that I already described for Median calculation.

## Calculating Median in PowerPivot using DAX

I just came across this blog post by Bill Anton where he discusses several approaches to calculated the Median of a given set in T-SQL, MDX and DAX. In the end of his post when it comes to the DAX calculation, he references several post by Marco, Alberto and Javier (post1, post2) that already address that kind of calculation in DAX. But he also claims that non of the solutions is “elegant”. Well, reason enough for me to try it on my own and here is what I came up with. Its up to you to decide whether this solution is more elegant than the others or not 🙂

In general, the median calculation always varies depending on the number of items and whether this number is even or odd.
For an even population the median is the mean of the values in the middle:
the median of {3, 5, 7, 9} is is (5 + 7)/2 = 6
For an odd population, the median is the value in the middle:
the median of {3, 5, 9} is 5

In both cases the values have to be ordered before the calculation. Note that it does not make a difference whether the values are sorted in ascending or descending order.

In this example, our set contains 12 items (=months) so we have to find the 2 items in the middle of the ordered set – December and February – and calculate the mean.

So, how can we address this problem using DAX? Most of the posts I mentioned above use some kind of combination of ranking – RANKX() – and filtering – FILTER(). For my approach I will use none of these but use TOPN instead (yes, I really like that function as you probably know if you followed my blog for some time).

In this special case, TOPN() can do both, ranking and filtering for us. But first of all we need to know how many items exist in our set:

Cnt_Months:=DISTINCTCOUNT(‘Date’[Month])

This value will be subsequently used in our next calculations.

To find the value(s) in the middle I use TOPN() twice, first to get the first half of the items (similar to TopCount) and then a second time to get the last values that we need for our median calculation (similar to BottomCount):

As the median calculation is different for even and odd sets, this also has to be considered in our calculation. For both calculations MOD()-function is used to distinguish both cases:

Items_TopCount:=IF(MOD([Cnt_Months],2) = 0,
([Cnt_Months] / 2) + 1,
([Cnt_Months] + 1) / 2)

For an even number of items (e.g. 12) we simply divide the count of items by 2 and add 1 which gives us a (12 / 2) + 1 = 7 for our sample.
For an odd number of items (e.g. 5) we first add 1 to our count of items and then divide by 2 which gives us (5 + 1) / 2 = 3

Items_BottomCount:=IF(MOD([Cnt_Months],2) = 0, 2, 1)

For an even number of items we have to consider the last 2 values whereas for an odd number of items we only have to consider the last value.

These calculations are then used in our median calculation:

Median SA Months:=CALCULATE([SumSA],
TOPN(
[Items_BottomCount],
TOPN(
[Items_TopCount],
VALUES(‘Date’[Month]),
[SumSA]),
[SumSA] * -1))
/
[Items_BottomCount]

As DAX has no built-in BOTTOMN()-function, we need to “abuse” the TOPN() function and multiply the OrderBy-value by “–1” to get the BOTTOMN() functionality. As you can see most of the logic is already handled by our [Items_TopCount] and [Items_BottomCount] measures and this pattern can be reused very easily.

Of course all these calculations can also be combined and the use of IF() can be avoided:

Median SA Months v2:=CALCULATE([SumSA],
TOPN(
2 – MOD([Cnt_Months], 2),
TOPN(
([Cnt_Months] + 1) / 2,
VALUES(‘Date’[Month]),
[SumSA]),
[SumSA] * -1))
/
(2 – MOD([Cnt_Months], 2))

Note: for an even population ([Cnt_Months] + 1) / 2 returns X.5 which is automatically rounded up when it is used in a function that expects a whole number. In our example this is what happens: (12 + 1) / 2 = 6.5 –> 7

These are the final results:

We could also use AVERAGEX() to calculate our median but I think that it is some kind of overhead to use AVERAGEX() just to divide by “1” or “2” depending on the number of items that our TOPN-functions return:

Median SA Months AvgX:=AVERAGEX(
TOPN(
2-MOD([Cnt_Months], 2),
TOPN(
([Cnt_Months] +1) / 2,
VALUES(‘Date’[Month]),
[SumSA]),
[SumSA] * -1),
[SumSA])

As you can see there are various approaches to calculate the median, its up to you which on you like most. I have not tested any of them in terms of performance over bigger sets – this may be topic for an upcoming post.

## Another Post about Calculating New and Returning Customers – Part 2

In my previous post I showed a new approach on how to calculate new (and returning) customers in PowerPivot/tabular using DAX. We ended up with a solution where we added the customers first order date as a calculated column to our customer-table. This column was then linked to our date-table with an inactive relationship. The final calculation used USERELATIONSHIP() to make use of this relationship as follows:

New Customers:=CALCULATE(
COUNTROWS(Customer),
USERELATIONSHIP(Customer[FirstOrderDate], ‘Date’[Date]))

This calculation performs really well as it does not have to touch the fact-table to get the count of new customers. And this is also the issue with the calculation as other filters are not reflected in the calculation:

Take row 2 as an example: we have 8 “Total Customers” of which 12 are “New Customers”. Obviously an error in the calculation. The PivotTable is filtered to Category=”Road Bikes” and we have 8 customers on the 2nd of February that bought a road bike. The “New Customers” calculation on the other hand is not related to the Subcategory and shows 12 as in total there were 12 new customers for all products.

To get our calculation working also with other filters we have to somehow relate it to our fact-table. So far we calculated the customers first order date only in the customer table. The customers first order may be related to several fact-rows, e.g. one row for each product the customer bought. Our “New Customers” calculation should only include customers that are active considering also all other filters.

To identify a customers first order in our fact-table we can again use a calculated column and also re-use our previous calculated column in our Customer-table that holds the customers first order date:

=NOT(
ISBLANK(
LOOKUPVALUE(
Customer[CustomerKey],
Customer[CustomerKey],
[CustomerKey],
Customer[FirstOrderDate],
[Order Date]
)))

This returns True for all fact-rows associated with a customers first order and False for all other rows.

The final “New Customers v2” calculation is quite simple then – in addition to the active filters we add a further filter to only select rows that are associated to a customers first order:

New Customers v2:=CALCULATE(
[Total Customers],
‘Internet Sales’[IsCustomersFirstOrder] = TRUE())

And this are the results:

As you can see there are still differences between “New Customers OLD” and “New Customers v2”. But is this really a problem with the new calculation? Lets analyze the issue taking customer “Desiree Dominguez” where we encounter the first difference as an example:

“Desiree Dominguez” had her first order on the 22th of June in 2006. So she is actually no “new customer” in 2008. The reason why the old calculation counts her as “new customer” is that it was the first time that she bought a product of subcategory “Road Bikes”. Whether this is correct or not is up to your business definition of a “new customer”. According to my experience it is more likely that “Desiree Dominguez” is not counted as a new customer in 2008 and so the “New Customer v2” actually returns the more accurate results.

An other option for this calculation is to rank the [Order Date] or [Sales Order Number] for each customer within the fact-table using the calculation below:

=RANKX(
FILTER(
ALL(‘Internet Sales’),
[CustomerKey] = EARLIER([CustomerKey])),
[Order Date],
[Order Date],
1,
DENSE
)

[Order Date] could be replaced by [Sales Order Number]. This makes sense if a customer can have multiple orders per day and you also want to distinguish further by [Sales Order Number]. The new field would also allow new analysis. For example the increase/decrease in sales from the second order compared to the first order and so on.

The “New Customer” calculation in this case would still be similar. We just have to filter on the new calculated column instead:

New Customers v3:=CALCULATE(
[Total Customers],
‘Internet Sales’[CustomersOrderNr] = 1)

The multidimensional model:

The whole logic of extending the fact-table to identify rows that can be associated with a customers first order can also be used in a multidimensional model. Once we prepared the fact-table accordingly the calculations are quite easy. The biggest issues here does not reside in the multidimensional model itself but in the ETL/relational layer as this kind of operation can be quite complex – or better say time-consuming in terms of ETL time.

At this point I will not focus on the necessary ETL steps but start with an already prepared fact-table and highlight the extensions that have to be made in the multidimensional model. The fact-table already got extended by a new column called [IsCustomersFirstOrder] similar to the one we created in tabular using a DAX calculated column. It has a value of 1 for rows associated with a customers first order and 0 for all other rows.

The next thing we have to do is to create a new table in our DSV to base our new dimension on. For simplicity I used this named query:

This table is then joined to the new fact-table:

The new dimension is quite simple as it only contains one attribute:

You may hide the whole dimension in the end as it may only be used to calculate our “new customers” and nowhere else and may only confuse the end-user.

Once we have added the dimension also to our cube we can create a new calculated measure to calculate our “new customers” as follows:

CREATE MEMBER CURRENTCUBE.[Measures].[New Customers] AS (
[Measures].[Customer Count],
[Is Customers First Order].[Is Customers First Order].&[1]
), ASSOCIATED_MEASURE_GROUP = ‘Internet Customers’
, FORMAT_STRING = ‘#,##0’;

The calculation is based on the existing [Customer Count]-measure which uses DistinctCount-aggregation. Similar to DAX with just extend the calculation by further limiting the cube-space where “Is customers First Order” = 1.

This approach also allows you to create aggregations if necessary to further improve performance. So this is probably also the best way in terms of query-performance to calculate the count of new customers in a multidimensional model.

## Another Post about Calculating New and Returning Customers

I know, this topic has already been addressed by quite a lot of people. Chris Webb blogged about it here(PowerPivot/DAX) and here(SSAS/MDX), Javier Guillén here, Alberto Ferrari mentions it in his video here and also PowerPivotPro blogged about it here. Still I think that there are some more things to say about it. In this post I will review the whole problem and come up with a new approach on how to solve this issue for both, tabular and multidimensional models with the best possible performance I could think of (hope I am not exaggerating here  🙂 )

OK, lets face the problem of calculating new customers first and define what a new customer for a given period actually is:

A new customer in Period X is a customer that has sales in Period X but did not have any other sales ever before. If Period X spans several smaller time periods
(e.g. Period X=January contains 31 days) then there must not be any sales before the earliest smaller time period (before 1st of January) for this customer to be counted as a new customer.

According to this definition the common approach can be divided into 2 steps:
1) find all customers that have sales till the last day in the selected period
2) subtract the number of customers that have sales till the day before the first day in the
selected period

First of all we need to create a measure that calculates our distinct customers.
For tabular it may be a simple calculated measure on your fact-table:

Total Customers:=DISTINCTCOUNT(‘Internet Sales’[CustomerKey])

For multidimensional models it should be a physical distinct count measure in your fact-table, ideally in a separate measure group.

How to solve 1) in tabular models

This is also straight forward as DAX has built-in functions that can do aggregation from the beginning of time. We use MAX(‘Date’[Date]) to get the last day in the current filter context:

Customers Till Now:=CALCULATE(
[Total Customers],
DATESBETWEEN(
‘Date’[Date],
BLANK(),
MAX(‘Date’[Date])))

How to solve 2) in tabular models

This is actually the same calculation as above, we only use MIN to get the first day in the current filter context and also subtractt “1” to get the day before the first day.

Previous Customers:=CALCULATE(
[Total Customers],
DATESBETWEEN(
‘Date’[Date],
BLANK(),
MIN(‘Date’[Date])-1))

To calculate our new customers we can simply subtract those two values:

New Customers OLD:=[Customers Till Now]-[Previous Customers]

How to solve 1) + 2) in multidimensional models

Please refer to Chris Webb’s blog here. The solution is pure MDX and is based on a combination of the range-operator “{null:[Date].[Calendar].currentmember}”, NONEMPTY() and COUNT().

Well, so far nothing new.

So lets describe the solution that I came up with. It is based on a different approach. To make the approach easily understandable, we have to rephrase the answer to our original question “What are new customers”?”:

A new customer in Period X is a customer that has his first sales in Period X.

According to this new definition we again have 2 steps:
1) Find the first date with sales for each customer
2) count the customers that had their first sales in the selected period

I will focus on tabular models. For multidimensional models most of the following steps have to be solved during ETL.

How to solve 1) in tabular models

This is pretty easy, we can simply create a calculated column in our Customer-table and get the first date on which the customer had sales:

=CALCULATE(MIN(‘Internet Sales’[Order Date]))

How to solve 2) in tabular models

The above create calculated column allows us to relate our ‘Date’-table directly to our ‘Customer’-table. As there is already an existing relationship between those tables via ‘Internet Sales’ we have to create an inactive relationship at this point:

Using this new relationship we can very easy calculate customers that had their first sales in the selected period:

New Customers:=CALCULATE(
COUNTROWS(Customer),
USERELATIONSHIP(Customer[FirstOrderDate], ‘Date’[Date]))

Pretty neat, isn’t it?
We can use COUNTROWS() opposed to a distinct count measure as our ‘Customer’-table only contains unique customers – so we can count each row in the current filter context.
Another nice thing is that we do not have to use any Time-Intelligence function like DATESBETWEEN which are usually resolved using FILTER that would iterate over the whole table. Further it also works with all columns of our ‘Date’-table, no matter whether it is [Calendar Year], [Fiscal Semester] or [Day Name of Week]. (Have you ever wondered how many new customers you acquired on Tuesdays? 🙂 )   And finally, using USERELATIONSHIP allows us to use the full power of xVelocity as native relationships are resolved there.

The results are of course the same as for [New Customers OLD]:

Though, there are still some issues with this calculation if there are filters on other tables:

As you can see, our new [New Customers] measure does not work in this situation as it is only related to our ‘Date’-table but not to ‘Product’.

I will address this issue in a follow-up post where I will also show how the final solution can be used for multidimensional models – Stay tuned!

UPDATE: Part2 can be found here