Another Post about Calculating New and Returning Customers – Part 2

In my previous post I showed a new approach on how to calculate new (and returning) customers in PowerPivot/tabular using DAX. We ended up with a solution where we added the customers first order date as a calculated column to our customer-table. This column was then linked to our date-table with an inactive relationship. The final calculation used USERELATIONSHIP() to make use of this relationship as follows:

New Customers:=CALCULATE(
COUNTROWS(Customer),
USERELATIONSHIP(Customer[FirstOrderDate], ‘Date’[Date]))

This calculation performs really well as it does not have to touch the fact-table to get the count of new customers. And this is also the issue with the calculation as other filters are not reflected in the calculation:

Issue_Old_vs_New

Take row 2 as an example: we have 8 “Total Customers” of which 12 are “New Customers”. Obviously an error in the calculation. The PivotTable is filtered to Category=”Road Bikes” and we have 8 customers on the 2nd of February that bought a road bike. The “New Customers” calculation on the other hand is not related to the Subcategory and shows 12 as in total there were 12 new customers for all products.

 

To get our calculation working also with other filters we have to somehow relate it to our fact-table. So far we calculated the customers first order date only in the customer table. The customers first order may be related to several fact-rows, e.g. one row for each product the customer bought. Our “New Customers” calculation should only include customers that are active considering also all other filters.

To identify a customers first order in our fact-table we can again use a calculated column and also re-use our previous calculated column in our Customer-table that holds the customers first order date:

=NOT(
ISBLANK(
LOOKUPVALUE(
Customer[CustomerKey],
Customer[CustomerKey],
[CustomerKey],
Customer[FirstOrderDate],
[Order Date]
)))

This returns True for all fact-rows associated with a customers first order and False for all other rows.

The final “New Customers v2” calculation is quite simple then – in addition to the active filters we add a further filter to only select rows that are associated to a customers first order:

New Customers v2:=CALCULATE(
[Total Customers],
‘Internet Sales’[IsCustomersFirstOrder] = TRUE())

 

And this are the results:

Final_Old_vs_NewV2

As you can see there are still differences between “New Customers OLD” and “New Customers v2”. But is this really a problem with the new calculation? Lets analyze the issue taking customer “Desiree Dominguez” where we encounter the first difference as an example:

Issue_Analyzed

“Desiree Dominguez” had her first order on the 22th of June in 2006. So she is actually no “new customer” in 2008. The reason why the old calculation counts her as “new customer” is that it was the first time that she bought a product of subcategory “Road Bikes”. Whether this is correct or not is up to your business definition of a “new customer”. According to my experience it is more likely that “Desiree Dominguez” is not counted as a new customer in 2008 and so the “New Customer v2” actually returns the more accurate results.

 

Additional stuff:

An other option for this calculation is to rank the [Order Date] or [Sales Order Number] for each customer within the fact-table using the calculation below:

=RANKX(
FILTER(
ALL(‘Internet Sales’),
[CustomerKey] = EARLIER([CustomerKey])),
[Order Date],
[Order Date],
1,
DENSE
)

[Order Date] could be replaced by [Sales Order Number]. This makes sense if a customer can have multiple orders per day and you also want to distinguish further by [Sales Order Number]. The new field would also allow new analysis. For example the increase/decrease in sales from the second order compared to the first order and so on.

The “New Customer” calculation in this case would still be similar. We just have to filter on the new calculated column instead:

New Customers v3:=CALCULATE(
[Total Customers],
‘Internet Sales’[CustomersOrderNr] = 1)

 

Download Final Model (Office 2013!)

 

 

The multidimensional model:

The whole logic of extending the fact-table to identify rows that can be associated with a customers first order can also be used in a multidimensional model. Once we prepared the fact-table accordingly the calculations are quite easy. The biggest issues here does not reside in the multidimensional model itself but in the ETL/relational layer as this kind of operation can be quite complex – or better say time-consuming in terms of ETL time.

At this point I will not focus on the necessary ETL steps but start with an already prepared fact-table and highlight the extensions that have to be made in the multidimensional model. The fact-table already got extended by a new column called [IsCustomersFirstOrder] similar to the one we created in tabular using a DAX calculated column. It has a value of 1 for rows associated with a customers first order and 0 for all other rows.

The next thing we have to do is to create a new table in our DSV to base our new dimension on. For simplicity I used this named query:

MD_NamedQuery_Dim

This table is then joined to the new fact-table:

MD_DSV

The new dimension is quite simple as it only contains one attribute:

MD_Dimension

You may hide the whole dimension in the end as it may only be used to calculate our “new customers” and nowhere else and may only confuse the end-user.

 

Once we have added the dimension also to our cube we can create a new calculated measure to calculate our “new customers” as follows:

CREATE MEMBER CURRENTCUBE.[Measures].[New Customers] AS (
[Measures].[Customer Count],
[Is Customers First Order].[Is Customers First Order].&[1]
), ASSOCIATED_MEASURE_GROUP = ‘Internet Customers’
, FORMAT_STRING = ‘#,##0’;

The calculation is based on the existing [Customer Count]-measure which uses DistinctCount-aggregation. Similar to DAX with just extend the calculation by further limiting the cube-space where “Is customers First Order” = 1.

This approach also allows you to create aggregations if necessary to further improve performance. So this is probably also the best way in terms of query-performance to calculate the count of new customers in a multidimensional model.

Another Post about Calculating New and Returning Customers

I know, this topic has already been addressed by quite a lot of people. Chris Webb blogged about it here(PowerPivot/DAX) and here(SSAS/MDX), Javier Guillén here, Alberto Ferrari mentions it in his video here and also PowerPivotPro blogged about it here. Still I think that there are some more things to say about it. In this post I will review the whole problem and come up with a new approach on how to solve this issue for both, tabular and multidimensional models with the best possible performance I could think of (hope I am not exaggerating here  🙂 )

OK, lets face the problem of calculating new customers first and define what a new customer for a given period actually is:

A new customer in Period X is a customer that has sales in Period X but did not have any other sales ever before. If Period X spans several smaller time periods
(e.g. Period X=January contains 31 days) then there must not be any sales before the earliest smaller time period (before 1st of January) for this customer to be counted as a new customer.

According to this definition the common approach can be divided into 2 steps:
1) find all customers that have sales till the last day in the selected period
2) subtract the number of customers that have sales till the day before the first day in the
selected period

 

First of all we need to create a measure that calculates our distinct customers.
For tabular it may be a simple calculated measure on your fact-table:

Total Customers:=DISTINCTCOUNT(‘Internet Sales’[CustomerKey])

For multidimensional models it should be a physical distinct count measure in your fact-table, ideally in a separate measure group.

How to solve 1) in tabular models

This is also straight forward as DAX has built-in functions that can do aggregation from the beginning of time. We use MAX(‘Date’[Date]) to get the last day in the current filter context:

Customers Till Now:=CALCULATE(
[Total Customers],
DATESBETWEEN(
‘Date’[Date],
BLANK(),
MAX(‘Date’[Date])))

 

 

How to solve 2) in tabular models

This is actually the same calculation as above, we only use MIN to get the first day in the current filter context and also subtractt “1” to get the day before the first day.

Previous Customers:=CALCULATE(
[Total Customers],
DATESBETWEEN(
‘Date’[Date],
BLANK(),
MIN(‘Date’[Date])-1))

 

To calculate our new customers we can simply subtract those two values:

New Customers OLD:=[Customers Till Now]-[Previous Customers]

 

 

How to solve 1) + 2) in multidimensional models

Please refer to Chris Webb’s blog here. The solution is pure MDX and is based on a combination of the range-operator “{null:[Date].[Calendar].currentmember}”, NONEMPTY() and COUNT().

 

 

Well, so far nothing new.

 

So lets describe the solution that I came up with. It is based on a different approach. To make the approach easily understandable, we have to rephrase the answer to our original question “What are new customers”?”:

A new customer in Period X is a customer that has his first sales in Period X.

According to this new definition we again have 2 steps:
1) Find the first date with sales for each customer
2) count the customers that had their first sales in the selected period

I will focus on tabular models. For multidimensional models most of the following steps have to be solved during ETL.

 

How to solve 1) in tabular models

This is pretty easy, we can simply create a calculated column in our Customer-table and get the first date on which the customer had sales:

=CALCULATE(MIN(‘Internet Sales’[Order Date]))

 

How to solve 2) in tabular models

The above create calculated column allows us to relate our ‘Date’-table directly to our ‘Customer’-table. As there is already an existing relationship between those tables via ‘Internet Sales’ we have to create an inactive relationship at this point:

Customer_Date_Relationship

Using this new relationship we can very easy calculate customers that had their first sales in the selected period:

New Customers:=CALCULATE(
COUNTROWS(Customer),
USERELATIONSHIP(Customer[FirstOrderDate], ‘Date’[Date]))

 

Pretty neat, isn’t it?
We can use COUNTROWS() opposed to a distinct count measure as our ‘Customer’-table only contains unique customers – so we can count each row in the current filter context.
Another nice thing is that we do not have to use any Time-Intelligence function like DATESBETWEEN which are usually resolved using FILTER that would iterate over the whole table. Further it also works with all columns of our ‘Date’-table, no matter whether it is [Calendar Year], [Fiscal Semester] or [Day Name of Week]. (Have you ever wondered how many new customers you acquired on Tuesdays? 🙂 )   And finally, using USERELATIONSHIP allows us to use the full power of xVelocity as native relationships are resolved there.

 

The results are of course the same as for [New Customers OLD]:

Result_Old_vs_New

 

Though, there are still some issues with this calculation if there are filters on other tables:

Issue_Old_vs_New

As you can see, our new [New Customers] measure does not work in this situation as it is only related to our ‘Date’-table but not to ‘Product’.

I will address this issue in a follow-up post where I will also show how the final solution can be used for multidimensional models – Stay tuned!

Download Final Model (Office 2013!)

 

UPDATE: Part2 can be found here

Fiscal Periods, Tabular Models and Time-Intelligence

I recently had to build a tabular model for a financial application and I would like to share my findings on this topic in this post. Financial applications tend to have “Periods” instead of dates, months, etc. Though, those Periods are usually tied to months – e.g. January = “Period01”, February = “Period02” and so on. In addition to those “monthly periods” there are usually also further periods like “Period13”, “Period14” etc. to store manually booked values that are necessary for closing a fiscal year. To get the years closing value (for a P&L account) you have to add up all periods (Period01 to Period14). In DAX this is usually done by using TOTALYTD() or any similar Time-Intelligence Function.

 

Here is what we want to achieve in the end. The final model should allow the End-user to create a report like this:

DesiredResult

This model allows us to analyze data by Year, by Month and of course also by Period. As you can see also the YTD is calculated correctly using DAX’s built-in Time-Intelligence functions.

However, to make use of Time-Intelligence functions a Date-table is required (more information: Time Intelligence Functions in DAX) but this will be covered later. Lets start off with a basic model without a Date-table.

For testing purposes I created this simple PowerPivot model:

BaseModel 

Sample of table ‘Facts’:

AccountID PeriodID Value
4 201201

41,155.59

2 201201

374,930.01

3 201211

525,545.15

5 201211

140,440.40

1 201212

16,514.36

5 201212

639,998.94

3 201213

-100,000.00

4 201213

20,000.00

5 201214

500,000.00

 

 

The first thing we need to do is to add a Date-table. This table should follow these rules:
– granularity=day –> one row for each date
– no gaps between the dates –> a contiguous range of dates
– do not use use the fact-table as your date-table –> always use an independent date-table
– the table must contain a column with the data type “Date” with unique values
– “Mark as Date-table”

A Date-table can be created using several approaches:
– Linked Table
– SQL view/table
– Azure Datamarket (e.g. Boyan Penev’s DateStream)
– …

(Creating an appropriate Date-table is not part of this post – for simplicity i used a Linked Table from my Excel workbook).

I further created calculated columns for Year, Month and MonthOfYear.

 

At this point we cannot link this table to our facts. We first have to create some kind of mapping between Periods and “real dates”. I decided to create a separate table for this purpose that links one Period to one Date. (Note: You may also put the whole logic into a calculated column of your fact-table.) This logic is straight forward for periods 1 to 11 which are simply mapped to the last (or first) date in that period. For Periods 12 and later this is a bit more tricky as we have to ensure that these periods are in the right order to be make our Time-Intelligence functions work correctly. So Period12 has to be before Period13, Period13 has to be before Period14, etc.

So I mapped Period16 (my sample has 16 Periods) to the 31st of December – the last date in the year as this is also the last period. Period 15 is mapped to the 30th of December – the second to last date. And so on, ending with Period12 mapped to the 27th of December:

PeriodID Date
201101 01/31/2011
201102 02/28/2011
201111 11/30/2011
201112 12/27/2011
201113 12/28/2011
201114 12/29/2011
201115 12/30/2011
201116 12/31/2011
201201 01/31/2012
201202 02/29/2012

I called the table ‘MapPeriodDate’.

This table is then added to the model and linked to our already existing Period-table (Note: The table could also be linked to the Facts-table directly using PeriodID). This allows us to create a new calculated column in our Facts-table to get the mapping-date for the current Period:

=RELATED(MapPeriodDate[Date])

 

The new column can now be used to link our Facts-table to our Date-Table:

FinalModel

Please take care in which direction you create the relationship between ‘Periods’ and ‘MapPeriodDate’ as otherwise the RELATED()-function may not work!

Once the Facts-table and the Date-table are connected you may consider hiding the unnecessary tables ‘Periods’ and ‘MapPeriodDate’ as all queries should now use the Date-table. Also the Date-column should be hidden so the lowest level of our Date-table should be [Period].

 

To get a [Period]-column in our Date-table we have to create some more calculated columns:

[Period_LookUp]
= LOOKUPVALUE(MapPeriodDate[PeriodID], MapPeriodDate[Date], [Date])

this returns the PeriodID if the current date also exists in the MapPeriodDate-table. Note that we only get a value for the last date in a month.

 

[Period]
= CALCULATE(MIN([Period_LookUp]), DATESBETWEEN('Date'[Date], [Date], BLANK()))

our final [Period]-calculation returns the first populated value of [Period_LookUp] after the current date. The first populated value for dates in January is the 31st which has a value of 201101 – our PeriodID!

 

The last step is to create our YTD-measures. This is now very easy as we can again use the built-in Time-Intelligence functions with this new Date-table:

ValueYTD:=TOTALYTD(SUM([Value]), 'Date'[Date])

And of course also all other Time-Intelligence functions now work out of the box:

ValuePYTD:=CALCULATE([ValueYTD], DATEADD('Date'[Date], 1, YEAR))

 

All those calculations work with Years, Months and also Periods and offer the same flexibility that you are used to from the original financial application.

 

Download Final Model (Office 2013!)

Dynamic ABC Analysis in PowerPivot using DAX

An ABC Analysis is a very common requirement for for business users. It classifies e.g. Items, Products or Customers into groups based on their sales and how much impact they had on the cumulated overall sales. This is done in several steps.

I just published a new version of the Dynamic ABC Analysis at www.daxpatterns.com. The article can be found here.

 

1) Order products by their sales in descending order
2) Cumulate the sales beginning with the best selling product till the current product
3) Calculate the percentage of the cumulated sales vs. total sales
4) Assign a Class according to the cumulated percentage

Marco Russo already blogged about this here. He does the classification in a calculated column based on the overall sales of each product. As calculated columns are processed when the data is loaded, this is not dynamic in terms of your filters that you may apply in the final report. If, for example, a customer was within Class A regarding total sales but had no sales last year then a report for last year that uses this classification may give you misleading results.

In this blog I will show how to do this kind of calculation on-the-fly always in the context of the current filters. I am using Adventure Works DW 2008 R2 (download) as my sample data and create a dynamic ABC analysis of the products.

The first thing we notice is that our product table is a slowly changing dimension of type 2 and there are several entries for the same product as every change is traced in the same table.

SCD2Table

So we want to do our classification on the ProductAlternateKey (=Business Key) column instead of our ProductKey (=Surrogate Key) column.

First we have to create a ranking of our products:

Rank CurrentProducts:=IF(HASONEVALUE(DimProduct[ProductAlternateKey]),
IF(NOT(ISBLANK([SUM SA])),
RANKX(
CALCULATETABLE(
VALUES(DimProduct[ProductAlternateKey]),
ALL(DimProduct[ProductAlternateKey])),
[SUM SA])))

Check if there is only one product in the current context and that this product also has sales. If this is the case we calculate our rank. We need to do the CALCULATETABLE to do the ranking within the currently applied filters on other columns of the DimProduct-table e.g. if a filter is applied to DimProduct[ProductSubcategoryKey] we want to see our ranking within that selected Subcategory and not against all Products.

I also created a measure [SUM SA] just to simplify the following expressions:

SUM SA:=SUM(FactInternetSales[SalesAmount])

 

The second step is to create a running total starting with the best-selling product/the product with the lowest rank:

CumSA CurrentProducts:=SUMX(
TOPN(
[Rank CurrentProducts],
CALCULATETABLE(
VALUES(DimProduct[ProductAlternateKey]),
ALL(DimProduct[ProductAlternateKey])),
[SUM SA]),
[SUM SA])

We use a combination of SUMX() and TOPN() here. TOPN() returns the top products ordered by [SUM SA]. Further we use our previously calculated rank to only get the products that have the same or more sales than the current product. For example if the current product has rank 3 we sum up the top 3 products to get our cumulated sum (=sum of the first 3 products) for this product. Again we need to use CALCULATETABLE() to retain other filters on the DimProduct-table.

 

The third step is pretty easy – we need to calculate percentage of the cumulated sales vs. the total sales:

CumSA% CurrentProducts:=
[CumSA CurrentProducts]
/
CALCULATE([SUM SA], ALL(DimProduct[ProductAlternateKey]))

This calculation is straight forward and should not need any further explanation.

The result of those calculations can be seen here:

Theory

 

To do our final classification we have to extend our model with a new table that holds our classes and their border-values:

Class LowerBoundary UpperBoundary
A 0 0.7
B 0.7 0.9
C 0.9 1

Class A should contain products which’s cumulated sales are between 0 and 0.7 – between 0% and 70%.
Class B should contain products which’s cumulated sales are between 0.7 and 0.9 – between 70% and 90%.
etc.

(This table can later be extended to support any number of classes and any boundaries between 0% and 100%.)

To get the boundaries of the selected class we create two measures that are later used in our final calculation:

MinLowerBoundary:=MIN([LowerBoundary])
MaxUpperBoundary:=MAX([UpperBoundary])

 

Our final calculation looks like this:

SA Classified Current:=IF(NOT(ISCROSSFILTERED(Classification[Class])),
[SUM SA],
CALCULATE(
[SUM SA],
FILTER(
VALUES(DimProduct[ProductAlternateKey]),
[MinLowerBoundary] < [CumSA% CurrentProducts]
&& [CumSA% CurrentProducts] <= [MaxUpperBoundary])))

If our Classification-table is not filtered, we just show our [SUM SA]-measure. Otherwise we extend the filter on our DimProduct[ProductAlternateKey] using our classification filtering out all products that do not fall within the borders of the currently selected class.

This measure allows us to see the changes of the classification of a specific product e.g. over time:

FinalResults_1

In 2006 our selected product was in Class C. For 2007 and 2008 it improved and is now in Class A. Still, overall it resides in Class B.

We may also analyze the impact of our promotions on the sales of our classified products:

FinalResults_2

Our Promotion “Touring-1000 Promotion” only had impact on products in Class C so we may consider to stop that promotion and invest more into the other promotions that affect all classes.

 

The classification can be used everywhere you need it – in the filter, on rows or on columns, even slicers work. The only drawback is that the on-the-fly calculation can take quite some time. If I find some time in the future i may try to further tune them and update this blog-post.

 

The example workbook can be downloaded here:

Though it is already in Office 2013 format an may not be opened with any previous versions of Excel/PowerPivot.
It also includes a second set of calculations that use the same logic as described above but does all the calculations without retaining any filters on the DimProducts-table. This allows you to filter on Class “A” and ProductSubcategory “Bike Racks” realizing that “Bike Racks” are not a Class “A” product or to see which Subcategories or Categories actually contain Class A, B or C products!

Relationships on columns with non-unique values and how to tune them

If you follow my blog frequently, you may have realized that in many of them I cover scenarios where I need to create relationships on columns with non-unique values. For example when handling SCD2 facts and dimensions or dealing with parallel hierarchies. In all the examples I use a combination of FILTER– and CONTAINS-function to create those non-unique relationships between tables. Recently I ran into performance issues using this approach and was thinking about on how to tune those kind of "relationships".

First of all I was taking a look at the DAX Query Plan using SQL Server Profiler to get some more information on what’s going on inside xVelocity and DAX. I used an calculation from my previous post as an example:

Value_Sender:=
CALCULATE([Value_SUM],
FILTER(
     'Facts',
     CONTAINS(
          'Sender',
          'Sender'[SenderID],
          'Facts'[SenderID]))
)

We had to use this calculation as SenderID was not unique in our Sender-table (and of course also not in our Facts-table). When we tried to create a relationship we got the error

Error

"The relationship cannot be created because each column contains duplicate values. Select at least one column that contains only unique values."

This makes absolutely sense as tabular models natively only support 1:n / n:1 relationships and therefore at least one column has to contain only unique vales.

The calculation from above using FILTER and CONTAINS still allowed us to link our tables on those non-unique column values. The drawback of the calculation is that it may not perform very well as it cannot use the full power of xVelocity that heavily relies on predefined relationships between columns and tables. But in our case we create the relationship on-the-fly not using xVelocity’s full potential. Further, by using CONTAINS, which is not a native xVelocity-function, the engine has to callback to DAX formula engine as we can see in the profiler:

VertiPaq SE Query / 0 – VertiPaq Scan (8 times of which 4 are of subclass Internal)

VertiPaq_SE_Query_CONTAINS

[CallbackDataID(….)] is always an indicator that the xVelocity engine could not resolve the query on its own but had to call functions from the DAX formula engine as Jeffrey Wang described here. This in the end results in poor query performance (at least compared to a pure xVelocity query).

DAX Query Plan / 1 – DAX VertiPaq Logical Plan (just for sake of completeness)

QueryPlan_CONTAINS

 

So the key to tune those kind of relationships is to relate the tables at design time to get full xVelocity performance at query time. As the tables cannot be related directly, we have to add an intermediate table containing only unique values – in our case unique SenderIDs. This table can then be used to create 1:n relationships with our Sender and our Facts table

SenderID
TOT
EUROPE
GER
FR
NA
USA
CAN
EXTERNAL
TOT_2
JV
HOLD

If you are using a SQL database as datasource you can simply create a view using SELECT DISTINCT or write the select statement directly in PowerPivot.

Once the table has been added we can create relationships as shown below:

Relationships

This design is similar to a many-to-many design with the only difference that we combined our intermediate table and our bridge table into one single table. The rest is similar to resolving many-to-many relationship as I described here. So once we have created these relationships we can change our calculations as follows:

Value_Sender_Bridge:=
CALCULATE(
     [Value_SUM],
     'SenderBridge',
     'Sender'
)

This calculation can be resolved by the xVelocity engine without any callbacks to DAX formula engine. Taking a look at the profiler trace proves this:

VertiPaq SE Query / 0 – VertiPaq Scan (4 times of which 2 are of subclass Internal)

VertiPaq_SE_Query_CROSSTABLE

None of the Veripaq SE queries used [CallbackDataID(….)] or any other complex function. Only joins have been used which can be handled by the xVelocity engine very easily:

DAX Query Plan / 1 – DAX VertiPaq Logical Plan (just for sake of completeness)

QueryPlan_CROSSTABLE

Also the logical plan could be drastically simplified containing only Sum_Vertipaq and Scan_Vertipaq operators telling us that the query can be executed in pure Vertiscan mode.

Details on Vertipaq operators and Vertiscan mode can be found here.

 

 

I have not run a lot of tests on bigger tables yet where I could have compared the overall performance but this may follow in a later post. With this post I just wanted to show how to link tables on columns containing non-unique values.

It would be great if tabular models would support those kind of relationships out of the box and that you can create them like any other relationships using drag&drop in the diagram view. This would have made many of my previous workshops at customers much easier Smile.

Consolidation and Intercompany Elimination with parallel hierarchies

In my last post I described and approach on how to calculate consolidated values (adjusted by intercompany eliminations) using PowerPivot and DAX. In this post I will extend this technique so it can also be used with parallel hierarchies as this is also very common in financial applications.

Switching from a single hierarchy to multiple parallel hierarchies has several drawbacks for tabular models:
1) business keys cannot be used to build the PC-hierarchy as they are not unique anymore
2) artificial keys and parent-keys are required
3) as business keys are not unique anymore, we also cannot create a relationship between facts and our dimension
4) PowerPivot tables do not support "UNION" to load several single hierarchies and combine them

All the issues described above have to be handled before the hierarchies can be loaded into PowerPivot. This is usually done in a relational database but this will not be covered in this post.

For simplicity I created a table in Excel that already has the correct format:

ID ParentID SenderID Name
1   TOT My Whole Company by Region
2 1 EUROPE Europe
3 2 GER Germany Company
4 2 FR French Company
5 1 NA North America
6 5 USA US Company
7 5 CAN Canadian Company
-99   EXTERNAL EXTERNAL
8   TOT_2 My Whole Company Legal View
9 8 JV Joint Ventures
10 9 FR French Company
11 9 USA US Company
12 8 HOLD My cool Holding
13 12 GER Germany Company
14 12 CAN Canadian Company

We have 2 parallel hierarchies – "My Whole Company by Region" and "My Whole Company Legal View". In PowerPivot this looks like this:

FullHierarchy

Each Company can be assigned to different nodes in every hierarchy.

As I already noted above, the SenderIDs are not unique anymore, and therefore we cannot create a relationship to our fact-table. So we have to handle this in our calculation. The approach is similar to handling SCD2 dimensions in PowerPivot what I described in previous posts (Part1 and Part2). We use CONTAINS-function to filter our fact-table based on the currently active SenderIDs in the dimension-table:

Value_Sender:=
CALCULATE([Value_SUM],
FILTER(
     'Facts',
     CONTAINS(
          'Sender',
          'Sender'[SenderID],
          'Facts'[SenderID]))
)

Actually this is the only difference in the calculation when we are dealing with multiple parallel hierarches. The other calculations are the same with the only difference that we have to use [Value_Sender] instead of [Value_SUM]:

Value_Internal:=
CALCULATE([Value_Sender],
FILTER(
     'Facts',
     CONTAINS(
          'Sender',
          'Sender'[SenderID],
          'Facts'[ReceiverID]))
)

and

Value_External:=
CALCULATE([Value_Sender],
FILTER(
     'Facts',
     NOT(
          CONTAINS(
               'Sender',
               'Sender'[SenderID],
               'Facts'[ReceiverID])))
)

As you can see this approach also works very well with multiple hierarchies. In the end we get the desired results:

FinalResults

Taking a look at "My cool Holding" we see that a value of 50 is eliminated as there have been sales from CAN to GER. On top-level of the hierarchy we see the same values being eliminated (170) as both hierarchies contain the same companies and therefore only sales to EXTERNAL can summed up to get the correct value.

As this technique operates on leaf-level it works with all possible hierarchies regardless of how many companies are assigned to which node or how deep the hierarchies are!

 

(Just to compare the values from above – the fact-table is the same as in the previous post: )

SenderID ReceiverID Value
FR GER 100
CAN GER 50
GER EXTERNAL 70
USA CAN 30
USA FR 10
USA EXTERNAL 90
CAN USA 50
CAN EXTERNAL 10

Consolidation and Intercompany Elimination made easy using PowerPivot and DAX

In my past years as a consultant for Microsoft BI I had to deal with financial models quite a lot. They are usually small in size but very complex in terms of calculations like currency conversion, margin-calculations, benchmarks and so on. Another topic I came across very frequently was consolidation and intercompany eliminations. Recently I had to deal with this topic again and thought of how this could be solved in PowerPivot/DAX.

First of all a little background information on what consolidation and intercompany eliminations are. In bigger companies with several subsidiaries those subsidiaries are usually structured in some way – e.g. by legal entity, by country etc..

Here is a little example of a corporate group called "My Whole Company" with subsidiaries in Germany, France, USA and Canada, structured by geography:

SampleHierarchy

These companies also make business with each other and also with external customers. The French Company and the Canadian Company sell something to the Germany Company, US Company sells goods to the Canadian Company and French Company and so on.

To get a consolidated value for higher nodes (e.g. Europe), transactions between companies within Europe have to be eliminated. Otherwise sales within Europe would be aggregated up even though the goods never left Europe and therefor cannot be added to Europe’s sales.

(I am not a financial guy so I focused on the technical stuff rather then on the business backgrounds. This whole example just demonstrates a general approach to solve this issue technically.)

 

Our fact-table contains columns for the SenderCompany, ReceiverCompany and Value:

SenderID ReceiverID Value
FR GER 100
CAN GER 50
GER EXTERNAL 70
USA CAN 30
USA FR 10
USA EXTERNAL 90
CAN USA 50
CAN EXTERNAL 10

For our Sender and Receiver-Table we can use the same source-table (details on this follow):

SenderID ParentSenderID Name
TOT   My Whole Company
EUROPE TOT Europe
GER EUROPE Germany Company
FR EUROPE French Company
NA TOT North America
USA NA US Company
CAN NA Canadian Company
EXTERNAL   EXTERNAL

Note that we use a parent-child hierarchy here as most financial applications use them to represent hierarchical structures. The parent-child hierarchy is resolved using PATH(), PATHITEM() and LOOKUPVALUE() functions as described here by Alberto Ferrari. The final table is added twice to our model, for Sender and Receiver:

PowerPivotModel

(for simplicity I removed all other columns of the fact-table and their corresponding dimension-tables).

The next step is to implement the business logic for consolidation and intercompany elimination. From the description above we can break this logic down into smaller pieces. To get a consolidated value, we have to:

  1. Select only Senders below the currently selected SenderHierarchy-Node
  2. Select only Receivers, that are NOT below the ReceiverHierarchy-Node that is equal to the currently selected SenderHierarchy-Node
  3. Sum up the remaining fact-rows
    Step 1 is done automatically for us because we defined a relationship between our Sender- and our Facts-table.
    Step 3 is easily done with the following DAX calculation combined with the right filters (defined in Step 2 below):
    Value_SUM:=SUM(Facts[Value])
    Step 2 is the tricky one.
    If you think of multidimensional models this is usually solved by a combination of AGGREGATE() and LINKMEMBER(). LINKMEMBER() is used to select the same node in the ReceiverHierarchy that is currently selected in the SenderHierarchy and AGGREGATE() is then used to aggregate all except the leaves below that node. This complex calculation may result in bad performance and is also not easily readable. Though there are also other implementations using many-to-many relationships and so on that perform better, but also make the model even more complex.
    Fortunately there is DAX  that can deal with this problem very easily as we will see now. All of the following calculations use the CONTAINS-function. It can be used to filter a table based on an other column’s values, similar to the IN-operator in T-SQL. First of all we will start with an example where we calculate all sales within a selected node. If the node "Europe" is selected, we want to see all sales within "Europe", meaning where Sender and Receiver are below the node "Europe":
    Value_Internal:=
    CALCULATE(
         [Value_SUM],
         FILTER(
              'Receiver',
              CONTAINS(
                   'Sender', 
                   'Sender'[SenderID], 
                   'Receiver'[ReceiverID])))

    The Receiver-table is filtered using the CONTAINS-function. As FILTER operates row-by-row, each ‘Receiver’[ReceiverID] is checked whether its value can also be found in the currently selected SenderIDs. In T-SQL this would be similar to:

    SELECT SUM([Value]
    FROM Facts
    INNER JOIN Sender
        ON Facts.SenderID = Sender.SenderID
    INNER JOIN Receiver
        ON Facts.ReceiverID = Receiver.ReceiverID
    WHERE Receiver.ReceiverID IN (SELECT SenderID 
                                  FROM Sender 
                                  WHERE Level2='Europe')
    AND Sender.Level2 = 'Europe'

    Once you understand how the T-SQL, the DAX-calculation should also be clear.

      This was the calculation we can use to calculate our internal sales. To calculate the consolidated sales for our "Europe"-node we have to do the opposite – select only Receivers that are NOT below the currently selected Sender-node. In T-SQL we would simply use "NOT IN" – in DAX this is similar. All we have to do is to wrap a NOT() around our CONTAINS-function:
      Value_External:=
      CALCULATE(
           [Value_SUM],
           FILTER(
                'Receiver',
                NOT(
                     CONTAINS(
                          'Sender', 
                          'Sender'[SenderID], 
                          'Receiver'[ReceiverID]))))

      And this is already our final calculation to eliminate intercompany sales and get a consolidated value for any Sender-node we select!

      ConsolidatedValues1

    Europe has total sales of 170 of which 100 have been internal (FR to GER) and 70 have been external (GER to EXTERNAL). North America has a total value of 240 of which 80 have been internal (USA to CAN 30 and CAN to USA 50) and 160 have been sold external to either Europe or external customers.

     

    The calculation can be further simplified in terms of usability. You may already realized that we did not refer to our hierarchy within any of our calculation. We always used the leaf-level (SenderID/ReceiverID) to calculated our values. To solve the original problem in a multidimensional model we had to create a Receiver-dimension and select/deselect members based on the current selection in the Sender-hierarchy. This is not necessary anymore as we only use leaf-elements in our calculation. So we do not need a Receiver-hierarchy and further also no Receiver-table in our PowerPivot-model as we already have the ReceiverID in our Facts-table:

    Value_External_Facts:=
    CALCULATE(
         [Value_SUM]; 
         FILTER(
              'Facts'; 
              NOT(
                   CONTAINS(
                        'Sender'; 
                        'Sender'[SenderID];
                        'Facts'[ReceiverID]))))

     

    There may also be reasons that the user wants to select the Receiver-Node on its own. For example to see sales from European to external customers (EXTERNAL).  To do this you have to adopt the calculation as follows:

    Value_Selected:=CALCULATE(
         [Value_SUM];
          FILTER(
              ALL(Receiver); 
              NOT(
                   CONTAINS(
                        'Receiver'; 
                        'Receiver'[ReceiverID];
                        'Receiver'[ReceiverID]))))

    This calculation may look strange on first sight but does exactly what we need. It removes all filters from the Receiver-table and applies a new filter that is the opposite of the currently existing filter using a combination of NOT() and CONTAINS(). The 3rd parameter of the CONTAINS-function always refers to the row-context of the FILTER-function whereas the 1st and the 2nd parameter are always based on the current query-context.

    These are the results:

    ConsolidatedValues_Selected

    You could also use the Value_SUM-calculation to get the opposite value – what has been sold from the current Sender-node to any Receiver located in Europe. Our Canadian Company for example has sold 50 to GER which is below Europe and 60 to companies outside Europe – in this case 50 to USA and 10 to EXTERNAL.

     

    As you can see using DAX makes these kind of calculations very easy compared to multidimensional models. An other important aspect is that financial models are usually very small and therefor should fit into memory very easily. Also the independence of hierarchies is an important aspect here.

    You can also download the Excel-Workbook with the sample-calculations here:

     

    UPDATE:

    In Part2 I will show how to handle multiple parallel hierarchies.

    Resolving Many to Many relationships leveraging DAX Cross Table Filtering

    If you ever had to deal with many-to-many relationships in PowerPivot then I am quite sure that you came across the blog-post Many-to-Many relationships in PowerPivot by Marco Russo and PowerPivot and Many to Many Relationships by Alberto Ferrari. Both posts describe how this issue can be solved using DAX and provide examples and also very good background information on that topic.

    I recently had to struggle with many-to-many relationships again at one of my customers who complained that many-to-many relationships are just too complex. So I rethought of the problem and searched for alternatives. During my investigations I  also found Jeffrey Wang’s blog-post The Logic behind the Magic of DAX Cross Table Filtering again – a must read blog for all people interested in BISM (tabular and multidimensional), DAX or MDX. In the middle of the post he describes the single operations the CALCULATE-function performs:

    Calculate function performs the following operations:
    1. Create a new filter context by cloning the existing one.
    2. Move current rows in the row context to the new filter context one by one and apply blocking semantics against all previous tables.
    3. Evaluate each setfilter argument in the old filter context and then add setfilter tables to the new filter context one by one and apply blocking semantics against all tables that exist in the new filter context before the first setfilter table is added.
    4. Evaluate the first argument in the newly constructed filter context.

    (the single steps are described in more details in his post)

    Important for us is the fact, that you can pass tables as arguments to the CALCULATE-function and those tables are automatically filtered by the current context. Even more important is that this filtering works in both directions of a relationships. So adding a table that has an exiting relationship with any table in the current context is similar to a JOIN in terms of SQL. Filters applied to the newly joined tables are also propagated through all other tables, regardless of the direction of the relationship.

    In his Question #1 Jeffrey counts the number of Subcategories for a given Product as an example (which is always 1 as there is a 1:n relationship between DimSubCategory and DimProduct). To get the correct value he uses the following calculation to extend the filter context by DimProduct and thereby also filtering DimProductSubcategory indirectly:

    CALCULATE(COUNTROWS(DimProductSubcategory), DimProduct)

     

    Knowing that we can use CALCULATE to resolve 1:n relationships in both directions we can also use this approach to solve m:n relationships pretty easy!

    Alberto Ferrari uses an example where the facts are related to Individuals. Those Individuals can be assigned to 1 or more Targets. This mapping is done using a bridge table to model the many-to-many relationship:

    M2M_AlbertoFerrari

    As you can see there is no "real" many-to-many relationship in the model as it has already been split up into a bridge-table using 1:n and n:1 relationships. Adding the information from above to this model we come up with a pretty easy DAX calculations which resolves the many-to-many relationship. Here is a little example where we simply count the rows in our Audience table:

    RowCount:=COUNTROWS('Audience')

    This RowCount is currently not filtered by table Targets as there is no chain of 1:n relationships between Targets and Audience. Only filters applied to directly related tables (Individuals, Time, Calendar and Networks)  are considered when the calculation is evaluated.

    By wrapping a CALCULATE-function around our calculation and adding the tables that participate in the many-to-many relationship as parameters we explicitly extend the filter context for our calculation. As filters on those "extended tables" also impact the current filter-context, the value for our Targets also changes according to the Individuals belonging to the current Target:

    RowCount_M2M:=CALCULATE(
         [RowCount], 
         'Individuals', 
         'TargetsForIndividuals', 
         'Targets')

    Finally, to resolve the many-to-many relationship for our calculation all we have to do is to explicitly add all tables of the many-to-many relationship to the filter-context of our calculation by using the CALCULATE-function. The rest is done automatically by DAX’s Cross Table Filtering Logic!

    The calculation can be further extended to only apply this logic when there is a filter on table Targets, otherwise we do not have to resolve the many-to-many relationship:

    RowCount_M2M:=IF(ISCROSSFILTERED('Targets'[Target]),
    CALCULATE(
         [RowCount], 
         'Individuals', 
         'TargetsForIndividuals'),
    [RowCount])

    Doing this ensures that the more complex calculation is only executed when there is a filter on table Targets. Further we already know that Targets is already part of the current context and therefore does not have to be added again for our calculation.

    In the end we come up with a solution where we only have to add the intermediate table (Individuals) and the bridge table (TargetsForIndividuals) to our CALCULATE-functions to resolve the many-to-many relationship – pretty nice, isn’t it?

    I think this approach should also be very easy to understand for people that are familiar with SQL and relational databases and just switched to tabular modeling.

    The PowerPivot workbook with samples for all approaches can be downloaded here:

    ISO 8601 Week in DAX

    I recently built a PowerPiovt model where I had to display weeks according to ISO 8601. As I came across this frequently in the past when I developed SQL Server databases (prior to 2008) I was aware that ISO weeks can also be calculated using some complex logics. When I discussed this with some of my colleagues during a training, one of the attendees told me, that this can be solved in Excel very easily using Excels WEEKNUM()-function. This function takes to arguments:

    Serial_num is a date within the week. Dates should be entered by using the DATE function, or as results of other formulas or functions. For example, use DATE(2008,5,23) for the 23rd day of May, 2008. Problems can occur if dates are entered as text.

    Return_type is a number that determines on which day the week begins. The default is 1.

    Return_type Week Begins
    1 Week begins on Sunday. Weekdays are numbered 1 through 7.
    2 Week begins on Monday. Weekdays are numbered 1 through 7.

    According to Excels Online-Help the second parameter only supports values 1 and 2. But this is not 100% true. You can also use the value 21 as second parameter, and guess what – it now returns weeknumbers according to ISO 8601. As many DAX-functions are derived from Excel (including WEEKNUM()-function), this also works with DAX’s WEEKNUM()-function!

    So by creating a calculated column as

    =WEEKNUM([Date];21)

    you get the ISO week number for the current [Date].

    If you also want to calculate “ISO years” to build clean hierarchies you may want to use this formula in an other calculated column:

    =IF([ISOWeek]<5 && [CalendarWeek] > 50;
         [Year]+1;
         IF([ISOWeek]>50 && [CalendarWeek]<5; 
              [Year]-1; 
              [Year]))

     

    I think we can learn a lot from experienced Excel-users here as most tricks also work in PowerPivot/DAX!

    Handling SCD2 snowflake-schemas with PowerPivot

    In my previous post I wrote about how to use PowerPivot on top of a relational database that is modeled as a star-schema with slowly changing dimension type 2 (SCD2) historization. In this post I will advance this approach by adding other tables tables that are not related to our facts but have a relationship to our current dimension tables based on an other attribute.

    I will take the example from my previous post and extend it by a new table called "WarehouseOwner". This table looks as follows:

    WarehouseOwner

    Warehouse
    Owner_SID
    Warehouse
    Owner_BK
    Manager VALID_FROM VALID_TO
    1 John Peter 2012-01-01 2012-01-03
    2 Dave Peter 2012-01-01  
    3 John Dan 2012-01-03  

    This table holds information about the warehouse owner and its manager. In this example the manager of John changed from Peter to Dan with the 3rd of January. Remember that we have a WarehouseOwner-Column also in our Warehouse-table.

    Logically this new table can be related to our Warehouse-table but as PowerPivot only supports 1:n relationships this is natively not supported and we have to use DAX to simulate this relationship. Most of the steps are similar to the steps that we have already done for our other SCD2-tables. First of all we create a ‘IsValid’-measure to identify rows that are valid for the selected referencedate:

    WarehouseOwner_IsValid:=COUNTROWS(
    FILTER(
         WarehouseOwner; 
         (WarehouseOwner[VALID_FROM] <= 'Date'[ReferenceDate] 
            || ISBLANK(WarehouseOwner[VALID_FROM]))
         && (WarehouseOwner[VALID_TO] > 'Date'[ReferenceDate] 
            || ISBLANK(WarehouseOwner[VALID_TO]))))>0

    The next step would be to mimic a relationship between our Inventory-table and our WarehouseOwner-table. But opposed to our old star-schema where each table was directly related to the Inventory-table, this is not possible with our new snowflake-schema anymore. So we have to relate our WarehouseOwner-table with our facts using the Warehouse-table. Without any relationship the result would look as follows:

    WarehouseOwner_unrelated

    This result is similar to a crossjoin of our Managers and our Warehouse-Owners regardless of time and validity. But we need to analyze which Manager was responsible for which warehouse owners and respectively their warehouse at any given point in time. To relate our tables we have to identify the combinations that are valid for the selected referencedate. If you reconsider these requirements, you will realize that this is similar to the relationships that we had to define to link our other tables to the inventory-table.  Again we can use CONTAINTS()-function in combination with FILTER() and a IsValid-Measure but in this case we use them in our Warehouse-table to extend our Warehouse_IsValid-Measure:

    Warehouse_IsValid_New:=COUNTROWS(
    FILTER(
         Warehouse; 
         (Warehouse[VALID_FROM] <= 'Date'[ReferenceDate] 
    		|| ISBLANK(Warehouse[VALID_FROM]))
         && (Warehouse[VALID_TO] > 'Date'[ReferenceDate] 
    		|| ISBLANK(Warehouse[VALID_TO]))
         && CONTAINS(
    		FILTER(
    			VALUES(WarehouseOwner[WarehouseOwner_BK]);
    			WarehouseOwner[WarehouseOwner_IsValid]); 
    		WarehouseOwner[WarehouseOwner_BK]; 
    		Warehouse[WarehouseOwner_BK])))>0

    Using our new Warehouse_IsValid-measure in a PivotTable reveals which combinations are valid for a given time:

    WarehouseOwner_IsValid

    Actually this is already all we have to do here to get the desired result, we do not have to touch any other formula in our model! By replacing our Warehouse_IsValid-Measure with our previous Amount-Measure we can see the final result:

    WarehouseOwner_related

    When Dan becomes the Manager of John with the 3rd of January also the facts that are related to warehouses belonging to John are associated with Dan. As with the 4th of January John also becomes the owner of WH2 (now owning both/all warehouses) Dave and also its manager Peter do not have any facts related to them anymore.

    As you can see this approach can be very easily extended by new tables even if the they are modeled in a snowflake-schema. All you have to do is to modify the IsValid-Measure of the main-table that relates to the new table.

     

    Stay tuned for the next post where I will analyze the performance of this solution with large amounts of data!

    Download Final Model