Another Post about Calculating New and Returning Customers – Part 2

In my previous post I showed a new approach on how to calculate new (and returning) customers in PowerPivot/tabular using DAX. We ended up with a solution where we added the customers first order date as a calculated column to our customer-table. This column was then linked to our date-table with an inactive relationship. The final calculation used USERELATIONSHIP() to make use of this relationship as follows:

New Customers:=CALCULATE(
    COUNTROWS(Customer),
    USERELATIONSHIP(Customer[FirstOrderDate], 'Date'[Date]))

This calculation performs really well as it does not have to touch the fact-table to get the count of new customers. And this is also the issue with the calculation as other filters are not reflected in the calculation:

Issue_Old_vs_New

Take row 2 as an example: we have 8 “Total Customers” of which 12 are “New Customers”. Obviously an error in the calculation. The PivotTable is filtered to Category=”Road Bikes” and we have 8 customers on the 2nd of February that bought a road bike. The “New Customers” calculation on the other hand is not related to the Subcategory and shows 12 as in total there were 12 new customers for all products.

 

To get our calculation working also with other filters we have to somehow relate it to our fact-table. So far we calculated the customers first order date only in the customer table. The customers first order may be related to several fact-rows, e.g. one row for each product the customer bought. Our “New Customers” calculation should only include customers that are active considering also all other filters.

To identify a customers first order in our fact-table we can again use a calculated column and also re-use our previous calculated column in our Customer-table that holds the customers first order date:

=NOT(
    ISBLANK(
        LOOKUPVALUE(
            Customer[CustomerKey],
            Customer[CustomerKey],
            [CustomerKey],
            Customer[FirstOrderDate],
            [Order Date]
)))

This returns True for all fact-rows associated with a customers first order and False for all other rows.

The final “New Customers v2” calculation is quite simple then – in addition to the active filters we add a further filter to only select rows that are associated to a customers first order:

New Customers v2:=CALCULATE(
    [Total Customers],
    'Internet Sales'[IsCustomersFirstOrder] = TRUE())

 

And this are the results:

Final_Old_vs_NewV2

As you can see there are still differences between “New Customers OLD” and “New Customers v2”. But is this really a problem with the new calculation? Lets analyze the issue taking customer “Desiree Dominguez” where we encounter the first difference as an example:

Issue_Analyzed

“Desiree Dominguez” had her first order on the 22th of June in 2006. So she is actually no “new customer” in 2008. The reason why the old calculation counts her as “new customer” is that it was the first time that she bought a product of subcategory “Road Bikes”. Whether this is correct or not is up to your business definition of a “new customer”. According to my experience it is more likely that “Desiree Dominguez” is not counted as a new customer in 2008 and so the “New Customer v2” actually returns the more accurate results.

 

Additional stuff:

An other option for this calculation is to rank the [Order Date] or [Sales Order Number] for each customer within the fact-table using the calculation below:

=RANKX(
    FILTER(
        ALL('Internet Sales'),
        [CustomerKey] = EARLIER([CustomerKey])),
    [Order Date],
    [Order Date],
    1,
    DENSE
)

[Order Date] could be replaced by [Sales Order Number]. This makes sense if a customer can have multiple orders per day and you also want to distinguish further by [Sales Order Number]. The new field would also allow new analysis. For example the increase/decrease in sales from the second order compared to the first order and so on.

The “New Customer” calculation in this case would still be similar. We just have to filter on the new calculated column instead:

New Customers v3:=CALCULATE(
    [Total Customers],
    'Internet Sales'[CustomersOrderNr] = 1)

 

Download Final Model (Office 2013!)

 

The multidimensional model:

The whole logic of extending the fact-table to identify rows that can be associated with a customers first order can also be used in a multidimensional model. Once we prepared the fact-table accordingly the calculations are quite easy. The biggest issues here does not reside in the multidimensional model itself but in the ETL/relational layer as this kind of operation can be quite complex – or better say time-consuming in terms of ETL time.

At this point I will not focus on the necessary ETL steps but start with an already prepared fact-table and highlight the extensions that have to be made in the multidimensional model. The fact-table already got extended by a new column called [IsCustomersFirstOrder] similar to the one we created in tabular using a DAX calculated column. It has a value of 1 for rows associated with a customers first order and 0 for all other rows.

The next thing we have to do is to create a new table in our DSV to base our new dimension on. For simplicity I used this named query:

MD_NamedQuery_Dim

This table is then joined to the new fact-table:

MD_DSV

The new dimension is quite simple as it only contains one attribute:

MD_Dimension

You may hide the whole dimension in the end as it may only be used to calculate our “new customers” and nowhere else and may only confuse the end-user.

 

Once we have added the dimension also to our cube we can create a new calculated measure to calculate our “new customers” as follows:

CREATE MEMBER CURRENTCUBE.[Measures].[New Customers] AS (
[Measures].[Customer Count],
[Is Customers First Order].[Is Customers First Order].&[1]
), ASSOCIATED_MEASURE_GROUP = 'Internet Customers'
, FORMAT_STRING = '#,##0';

The calculation is based on the existing [Customer Count]-measure which uses DistinctCount-aggregation. Similar to DAX with just extend the calculation by further limiting the cube-space where “Is customers First Order” = 1.

This approach also allows you to create aggregations if necessary to further improve performance. So this is probably also the best way in terms of query-performance to calculate the count of new customers in a multidimensional model.

Another Post about Calculating New and Returning Customers

I know, this topic has already been addressed by quite a lot of people. Chris Webb blogged about it here(PowerPivot/DAX) and here(SSAS/MDX), Javier Guillén here, Alberto Ferrari mentions it in his video here and also PowerPivotPro blogged about it here. Still I think that there are some more things to say about it. In this post I will review the whole problem and come up with a new approach on how to solve this issue for both, tabular and multidimensional models with the best possible performance I could think of (hope I am not exaggerating here  :) )

OK, lets face the problem of calculating new customers first and define what a new customer for a given period actually is:

A new customer in Period X is a customer that has sales in Period X but did not have any other sales ever before. If Period X spans several smaller time periods
(e.g. Period X=January contains 31 days) then there must not be any sales before the earliest smaller time period (before 1st of January) for this customer to be counted as a new customer.

According to this definition the common approach can be divided into 2 steps:
1) find all customers that have sales till the last day in the selected period
2) subtract the number of customers that have sales till the day before the first day in the
     selected period

 

First of all we need to create a measure that calculates our distinct customers.
For tabular it may be a simple calculated measure on your fact-table:

Total Customers:=DISTINCTCOUNT('Internet Sales'[CustomerKey])

For multidimensional models it should be a physical distinct count measure in your fact-table, ideally in a separate measure group.

How to solve 1) in tabular models

This is also straight forward as DAX has built-in functions that can do aggregation from the beginning of time. We use MAX(‘Date’[Date]) to get the last day in the current filter context:

Customers Till Now:=CALCULATE(
    [Total Customers],
    DATESBETWEEN(
        'Date'[Date],
        BLANK(),
        MAX('Date'[Date])))

How to solve 2) in tabular models

This is actually the same calculation as above, we only use MIN to get the first day in the current filter context and also subtractt “1” to get the day before the first day.

Previous Customers:=CALCULATE(
    [Total Customers],
    DATESBETWEEN(
        'Date'[Date],
        BLANK(),
        MIN('Date'[Date])-1))

To calculate our new customers we can simply subtract those two values:

New Customers OLD:=[Customers Till Now]-[Previous Customers]

How to solve 1) + 2) in multidimensional models

Please refer to Chris Webb’s blog here. The solution is pure MDX and is based on a combination of the range-operator “{null:[Date].[Calendar].currentmember}”, NONEMPTY() and COUNT().

 

 

Well, so far nothing new.

 

So lets describe the solution that I came up with. It is based on a different approach. To make the approach easily understandable, we have to rephrase the answer to our original question “What are new customers"?”:

A new customer in Period X is a customer that has his first sales in Period X.

According to this new definition we again have 2 steps:
1) Find the first date with sales for each customer
2) count the customers that had their first sales in the selected period

I will focus on tabular models. For multidimensional models most of the following steps have to be solved during ETL.

 

How to solve 1) in tabular models

This is pretty easy, we can simply create a calculated column in our Customer-table and get the first date on which the customer had sales:

=CALCULATE(MIN('Internet Sales'[Order Date]))

 

How to solve 2) in tabular models

The above create calculated column allows us to relate our ‘Date’-table directly to our ‘Customer’-table. As there is already an existing relationship between those tables via ‘Internet Sales’ we have to create an inactive relationship at this point:

Customer_Date_Relationship

Using this new relationship we can very easy calculate customers that had their first sales in the selected period:

New Customers:=CALCULATE(
    COUNTROWS(Customer),
    USERELATIONSHIP(Customer[FirstOrderDate], 'Date'[Date]))

 

Pretty neat, isn’t it?
We can use COUNTROWS() opposed to a distinct count measure as our ‘Customer’-table only contains unique customers – so we can count each row in the current filter context.
Another nice thing is that we do not have to use any Time-Intelligence function like DATESBETWEEN which are usually resolved using FILTER that would iterate over the whole table. Further it also works with all columns of our ‘Date’-table, no matter whether it is [Calendar Year], [Fiscal Semester] or [Day Name of Week]. (Have you ever wondered how many new customers you acquired on Tuesdays? :) )   And finally, using USERELATIONSHIP allows us to use the full power of xVelocity as native relationships are resolved there.

 

The results are of course the same as for [New Customers OLD]:

Result_Old_vs_New

 

Though, there are still some issues with this calculation if there are filters on other tables:

Issue_Old_vs_New

As you can see, our new [New Customers] measure does not work in this situation as it is only related to our ‘Date’-table but not to ‘Product’.

I will address this issue in a follow-up post where I will also show how the final solution can be used for multidimensional models – Stay tuned!

Download Final Model (Office 2013!)

 

UPDATE: Part2 can be found here