Calculating Pearson Correlation Coefficient using DAX

The original request for this calculation came from one of my blog readers who dropped me a mail asking if it possible to calculated the Pearson Correlation Coefficient (PCC or PPMCC) in his PowerPivot model. In case you wonder what the Pearson Correlation Coefficient is and how it can be calculated – as I did in the beginning –  these links What is PCC, How to calculate PCC are very helpful and also offer some examples and videos explaining everything you need to know about it. I highly recommend to read the articles before you proceed here as I will not go into the mathematical details of the calculation again in this blog which is dedicated to the DAX implementation of the PCC.

Anyway, as I know your time is precious, I will try to sum up its purpose for you: “The Pearson Correlation Coefficient calculates the correlation between two variables over a given set of items. The result is a number between -1 and 1. A value higher than 0.5 (or lower than –0.5) indicate a strong relationship whereas numbers towards 0 imply weak to no relationship.”
Pearson_Graphic
The two values we want to correlate are our axes, whereas the single dots represent our set of items. The PCC calculates the trend within this chart represented as an arrow above.

The mathematical formula that defines the Pearson Correlation Coefficient is the following:
Pearson_Formula

The PCC can be used to calculate the correlation between two measures which can be associated with the same customer. A measure can be anything here, the age of a customer, it’s sales, the number of visits, etc. but also things like sales with red products vs. sales with blue products. As you can imagine, this can be a very powerful statistical KPI for any analytical data model. To demonstrate the calculation we will try to correlate the order quantity of a customer with it’s sales amount. The order quantity will be our [MeasureX] and the sales will be our [MeasureY], and the set that we will calculate the PCC over are our customers. To make the whole calculation more I split it up into separate measures:

  1. MeasureX := SUM(‘Internet Sales’[Order Quantity])
  2. MeasureY := SUM(‘Internet Sales’[Sales Amount])

Based on these measures we can define further measures which are necessary for the calculation of our PCC. The calculations are tied to a set if items, in our case the single customers:

  1. Sum_XY := SUMX(VALUES(Customer[Customer Id]), [MeasureX] * [MeasureY])
  2. Sum_X2 := SUMX(VALUES(Customer[Customer Id]), [MeasureX] * [MeasureX])
  3. Sum_Y2 := SUMX(VALUES(Customer[Customer Id]), [MeasureY] * [MeasureY])
  4. Count_Items := DISTINCTCOUNT(Customer[Customer Id])

Now that we have calculated the various summations over our base measures, it is time to create the numerator and denominator for our final calculation:

  1. Pearson_Numerator :=
  2.     ([Count_Items] * [Sum_XY]) – ([MeasureX] * [MeasureY])
  3. Pearson_Denominator_X :=
  4.     ([Count_Items] * [Sum_X2]) – ([MeasureX] * [MeasureX])
  5. Pearson_Denominator_Y :=
  6.     ([Count_Items] * [Sum_Y2]) – ([MeasureY] * [MeasureY])
  7. Pearson_Denominator :=
  8.     SQRT([Pearson_Denominator_X] * [Pearson_Denominator_Y])

Having these helper-measures in place the final calculation for our PCC is straight forward:

  1. Pearson := DIVIDE([Pearson_Numerator], [Pearson_Denominator])

This [Pearson]-measure can then be used together with any attribute in our model – e.g. the Calendar Year in order to track the changes of the Pearson Correlation Coefficient over years:
Pearson_by_Year
For those of you who are familiar with the Adventure Works sample DB, this numbers should not be surprising. In 2005 and 2006 the Adventure Works company only sold bikes and usually a customer only buys one bike – so we have a pretty strong correlation here. However, in 2007 they also started selling Clothing and Accessories which are in general cheaper than Bikes but are sold more often.
Pearson_Sales_Categories_Years

This has impact on our Pearson-value which is very obvious in the screenshots above.

As you probably also realized, the Grand Total of our Pearson calculation cannot be directly related to the single years and may also be the complete opposite of the single values. This effect is called Simpson’s Paradox and is the expected behavior here.

[MeasuresX] and [MeasureY] can be exchanged by any other DAX measures which makes this calculation really powerful. Also, the set of items over which we want to calculated the correlation can be exchanged quite easily. Below you can download the sample Excel workbook but also a DAX query which could be used in Reporting Services or any other tool that allows execution of DAX queries.

Sample Workbook (Excel 2013): Pearson.xlsx
DAX Query: Pearson_SSRS.dax

6 thoughts on “Calculating Pearson Correlation Coefficient using DAX

  1. Dear Gerhard,

    Great job!!!!

    Thank very much for your post, it has been very useful to me.
    I have combined your correlation calculation with a calculation of the linear regression slope, I created myself, with great benefits for my analysis.

    I wonder if there is any way to select in DAX the two variable you want to calculate and the n term of the correlation from a Pivot Table/ Slicer/ Parametric table to make the correlation calculation more flexible and dynamic.

    Thank very much

    • Hi Alberto,

      I am afraid there is not much we can do here in order to make it more flexible – you always need two fixed measures and a fixed column for the calculation.
      However, you could create a a dummy-table for which holds all your measures and use SWITCH() in order to select the value of the measure which you want to return
      something similar to this:
      http://blogs.msdn.com/b/analysisservices/archive/2010/04/12/time-intelligence-functions-in-dax.aspx
      =IF( COUNTROWS( VALUES( DimPeriod[Period]))=1,
      IF( VALUES( DimPeriod[Period]) = “Current”, [Sales],
      IF( VALUES( DimPeriod[Period]) = “MTD”, [Sales](DATESMTD(DimDate[Datekey])),
      IF( VALUES( DimPeriod[Period]) = “QTD”, [Sales](DATESQTD(DimDate[Datekey])),
      IF( VALUES( DimPeriod[Period]) = “YTD”, [Sales](DATESYTD(DimDate[Datekey])),
      IF( VALUES( DimPeriod[Period]) = “LastYear”, [Sales](DATEADD(DimDate[Datekey],-1,YEAR)),
      IF( VALUES( DimPeriod[Period]) = “PriorYearMTD”, [Sales](DATEADD(DATESMTD(DimDate[Datekey]),-1,YEAR)),
      IF( VALUES( DimPeriod[Period]) = “PriorYearQTD”, [Sales](DATEADD(DATESQTD(DimDate[Datekey]),-1,YEAR)),
      IF( VALUES( DimPeriod[Period]) = “PriorYearYTD”, [Sales](DATEADD(DATESYTD(DimDate[Datekey]),-1,YEAR)),
      BLANK())))))))),[Sales])

      you can then use this measure to drive your [MeasureX]

      hope that helps,
      -gerhard

  2. Gerhard, thank you very much for this article — high quality and very helpful.

    I’ve noticed that the Pearson measure doesn’t work right if measures X and Y are not SUMs. For example, if you change SUM to AVERAGE, the correlation coefficient will be wrong.

    To address that, I would introduce two new measures:
    Sum_X := SUMX(VALUES(Customer[Customer Id]), [MeasureX])
    Sum_Y := SUMX(VALUES(Customer[Customer Id]), [MeasureX])

    I would also modify the existing measures slightly:
    Pearson_Numerator:=([Count_Items] * [Sum_XY]) – ([Sum_X] * [Sum_Y])
    Pearson_Denominator_X:=([Count_Items] * [Sum_X2]) – ([Sum_X] ^ 2)
    Pearson_Denominator_Y:=([Count_Items] * [Sum_Y2]) – ([Sum_Y] ^ 2)

    This way, the correlation coefficient will be calculated correctly even if measures X and Y are not SUMs.

    • yes, you are of course right
      i just wanted to show the general approach of how PCC can be calculated in DAX.
      With DAX 2.0 and the support of variables, it would be even simpler to put everything into a single measure and only have to specify every input parameter (items, MeasureX, MeasureY) once

      I might update the post in the future when I have some more time but for the time being I will leave it up to the reader to do this

      thanks,
      -gerhard

Leave a Reply