Call for Speakers: SQLSaturday#280 Vienna

The SQL PASS Austria chapter is organizing a SQL Saturday in Vienna. It will be held on the March 6th 2014. (For those of you how are checking their calendars right now – yes it is actually a Thursday!) So register and be part of the first SQLSaturday that is held on a Thursday Smile!!!

SQLSaturdayBanner

Also the Call for Speakers is opened until January 5th, so make sure to submit your sessions by then.

Hope to see you there!

Using Power Query to analyze SSAS Disk Usage

Some time ago Bob Duffy blogged about on how to use Power Pivot to analyze the disk usage of multidimensional Analysis Services models (here). He uses a an VBA macro to pull meta data like filename, path, extension, etc. from the file system or to be more specific from the data directory of Analysis Services. Analysis Services stores all its data in different files with specific extensions so it is possible to link those files to multidimensional objects in terms of attributes, facts, aggregations, etc. Based on this data we can analyze  how our data is distributed. Do we have too big dimensions? Which attribute uses the most space? Do our facts consume most of the space (very likely)? If yes, how much of it is real data and how big are my aggregations – if they are processed at all?!? – These are very common and also important things to know for an Analysis Services developer.

So Bob Duffy’s solution can be really useful. The only thing I did not like about it was the fact that it uses a VBA macro to get the data. This made me think and I came up with the idea of using Power Query to get this data. Btw, make sure to check out the latest release, there have been a lot of improvements recently!

With Power Query you have to option to load multiple files from a folder and also from its sub folders. When we do this on our Analysis Services data directory, we get a list of ALL files together with their full path, filename, extension and most important in this case their size which can be found by expanding the Attributes-record:
PQ_Source_FileList

The final Power Query does also a lot of other things to prepare the data so it can be later joined to our FileExtensions-table that holds detailed information for each file extension. This table currently looks like below but can be extended by any other columns that may be necessary and/or useful for you:

FileType FileType_Description ObjectType ObjectTypeSort ObjectTypeDetails
ahstore Attribute Hash Store Dimensions 20 Attribute
asstore Attribute String Store Dimensions 20 Attribute
astore Attribute Store Dimensions 20 Attribute
bsstore BLOB String Store Dimensions 20 BLOB
bstore BLOB Store Dimensions 20 BLOB
dstore Hierarchy Decoding Store Dimensions 20 Hierarchy
khstore Key Hash Store Dimensions 20 Key
ksstore Key String Store Dimensions 20 Key
kstore Key Store Dimensions 20 Key
lstore Structure Store Dimensions 20 Others
ostore Order Store Dimensions 20 Others
sstore Set Store Dimensions 20 Others
ustore ustore Dimensions 20 Others
xml XML Configuration 999 Configuration
fact.data Basedata Facts 10 Basedata
fact.data.hdr Basedata Header Facts 10 Basedata
fact.map Basedata Index Facts 10 Basedata
fact.map.hdr Basedata Index Header Facts 10 Basedata
rigid.data Rigid Aggregation Data Facts 10 Aggregations
rigid.data.hdr Rigid Aggregation Data Header Facts 10 Aggregations
rigid.map Rigid Aggregation Index Facts 10 Aggregations
rigid.map.hdr Rigid Aggregation Index Header Facts 10 Aggregations
flex.data Flexible Aggregation Data Facts 10 Aggregations
flex.data.hdr Flexible Aggregation Data Header Facts 10 Aggregations
flex.map Flexible Aggregation Index Facts 10 Aggregations
flex.map.hdr Flexible Aggregation Index Header Facts 10 Aggregations
string.data String Data (Distinct Count?) Facts 10 Basedata
cnt.bin Binary Configuration 999 Binaries
mrg.ccmap mrg.ccmap DataMining 999 DataMining
mrg.ccstat mrg.ccstat DataMining 999 DataMining
nb.ccmap nb.ccmap DataMining 999 DataMining
nb.ccstat nb.ccstat DataMining 999 DataMining
dt dt DataMining 999 DataMining
dtavl dtavl DataMining 999 DataMining
dtstr dtstr DataMining 999 DataMining
dmdimhstore dmdimhstore DataMining 999 DataMining
dmdimstore dmdimstore DataMining 999 DataMining
bin Binary Configuration 999 Binaries
OTHERS Others Others 99999 Others

As you can see the extension may contain 1, 2 or 3 parts. The more parts the more specific this file extension is. If you checked the result of the Power Query it also contains 3 columns, FileExtension1, FileExtension2 and FileExtension3. To join the two tables we first need to load both tables into Power Pivot. The next step is to create a proper column on which we can base our relationship. If the 3-part extension is found in the file extensions table, we use it, otherwise we check the 2-part extension and afterwards the 1-part extension and in case nothing matches we use “OTHERS”:

=SWITCH(TRUE(),
CONTAINS(FileTypes, FileTypes[FileType], [FileExtension3]), [FileExtension3],
CONTAINS(FileTypes, FileTypes[FileType], [FileExtension2]), [FileExtension2],
CONTAINS(FileTypes, FileTypes[FileType], [FileExtension1]), [FileExtension1],
"OTHERS")

Then we can create a relationship between or PQ table and our file extension table. I also created some other calculated columns, hierarchies and measures for usability. And this is the final outcome:
PivotTable_Report

You can very easily see, how big your facts are, the distribution between base-data and Aggregations, the Dimensions sizes and you can drill down to each individual file! You can of course also create a Power View report if you want to. All visualizations are up to you, this is just a very simple example of a report.

Enjoy playing around with it!

Downloads:
(please note that I added a filter on the Database name as a last step of the Power Query to only show Adventure Works databases! In order to get all databases you need to remove this filter!)

SSAS Disk Analysis Workbook: SSAS_DiskAnalysis.xlsx

SAP HANA’s Big Data Scenario with Power BI

While browsing the web for any BI related topics I recently came across this blog post about SAP HANA and how it can be used to analyze Big Data. Its actually pretty cool, SAP together with Amazon Web Services (AWS) offer a free try out of their tools for 4 hours which you can use to rebuild a predefined demo. The demo itself is very well explained and document with videos and scripts and gives some good insights on how to deal with Big Data in SAP HANA. Basically it is divided into 3 steps:
1) Load Wikipedia data (Pagehits, etc.) from Hadoop/Hive (~2GB of flatfiles)
2) Create a data mart with SAP HANA
3) Analyze results with SAP Lumira

Having done a lot recently with Power BI and its tools I asked myself if this would also be possible with Power BI? So I first did the demo on SAP HANA and afterwards I was quite sure that I could do the same also with Power BI.

And this was the initiation of this blog post where we will do the same demo but instead of SAP HANA we will use only tools of the Power BI suite. Basically we only use 3 of our Power tools:
1) Power Query to load the data
2) Power Pivot to build the “data mart”
3) Power View to analyze the data

First of all we need to load the data. The demo uses data from Wikipedia where for each year, month, day and hour the number of pagehits and bytesdownloaded are monitored per page: http://dumps.wikimedia.org/other/pagecounts-raw/

This data was then moved to a Hadoop/Hive cluster on AWS S3 store which was made public available. The transformed files can be downloaded here (~250MB per file):
http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=03/000000
http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=03/000001
http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=03/000002
http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=04/000000
http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=04/000001
http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=04/000002
http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=05/000000
http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=05/000001
http://wikipedia-pagecounts-hive-results.s3.amazonaws.com/year=2013/month=05/000002

FileList

Once we have downloaded the files we can start loading them using Power Query’s “Load from Folder” source. It is basically textdata which needs to be split into several columns first. The delimiter used is 0x0001 which in Power Query’s M-language needs to be resolved to “#(0001)”:

= Table.SplitColumn(ImportedText ,"Column1",
    Splitter.SplitTextByDelimiter("#(0001)"),
    {"Column1.1", "Column1.2", "Column1.3", "Column1.4",
     "Column1.5", "Column1.6", "Column1.7", "Column1.8"})

The columns are defined as PROJECTCODE, PAGENAME, YEAR, MONTH, DAY, HOUR, PAGEHITCOUNTFORHOUR and BYTESDOWNLOADEDFORHOUR where PROJECTCODE can be further split into language code and the real project code. There is some more logic which I will not explain in detail like error handling, data conversions, etc., which are similar to what is done in the SAP HANA demo.

This is the final M-script I came up with:

let
    Source = Folder.Files("E:\SAP\Big Data"),
    FilteredRows = Table.SelectRows(Source, each ([Extension] = "")),
    CombinedBinaries = Binary.Combine(FilteredRows[Content]),
    ImportedText = Table.FromColumns({Lines.FromBinary(CombinedBinaries)}),
    SplitColumnDelimiter = Table.SplitColumn(ImportedText ,"Column1",Splitter.SplitTextByDelimiter("#(0001)"),{"Column1.1", "Column1.2", "Column1.3", "Column1.4", "Column1.5", "Column1.6", "Column1.7", "Column1.8"}),
    ChangedType = Table.TransformColumnTypes(SplitColumnDelimiter,{{"Column1.1", type text}, {"Column1.2", type text}, {"Column1.3", type number}, {"Column1.4", type number}, {"Column1.5", type number}, {"Column1.6", type number}, {"Column1.7", type number}, {"Column1.8", type number}}),
    SplitColumnDelimiter1 = Table.SplitColumn(ChangedType,"Column1.1",Splitter.SplitTextByDelimiter("."),{"Column1.1.1", "Column1.1.2"}),
    ChangedType1 = Table.TransformColumnTypes(SplitColumnDelimiter1,{{"Column1.1.1", type text}, {"Column1.1.2", type text}}),
    ReplacedValue = Table.ReplaceValue(ChangedType1,null,"wp",Replacer.ReplaceValue,{"Column1.1.2"}),
    InsertedCustom = Table.AddColumn(ReplacedValue, "PageHitsPerHour", each try Number.From([Column1.7]) otherwise 0),
    InsertedCustom1 = Table.AddColumn(InsertedCustom, "BytesDownloadedPerHour", each try Number.From([Column1.8]) otherwise 0),
    RemovedColumns = Table.RemoveColumns(InsertedCustom1,{"Column1.7", "Column1.8"}),
    RenamedColumns = Table.RenameColumns(RemovedColumns,{{"Column1.3", "Year"}, {"Column1.4", "Month"}, {"Column1.5", "Day"}, {"Column1.6", "Hour"}, {"Column1.2", "PageName"}, {"Column1.1.1", "LanguageCode"}, {"Column1.1.2", "ProjectCode"}}),
    InsertedCustom2 = Table.AddColumn(RenamedColumns, "Date", each #date([Year],[Month],[Day])),
    ChangedType2 = Table.TransformColumnTypes(InsertedCustom2,{{"PageHitsPerHour", type number}, {"BytesDownloadedPerHour", type number}, {"Date", type date}}),
    FilteredRows1 = Table.SelectRows(ChangedType2, each [PageHitsPerHour] < 53000000)
in
    FilteredRows1

If you also did the demo on SAP HANA you found some quality issues in the data why you needed to remove rows with more than 53,000,000 PageHitsPerHour as this 5 rows mess up the whole analysis. The above M-script already handles this (last statement).

There are some more important things to mention here. First of all we are dealing with about 37M rows, so loading to datasheet will not work. Instead we need to load the data directly into Power Pivot. I had several memory issues when using 32bit Excel but after switching to a 64bit Excel everything went just fine. The import itself takes about 15 minutes which I think is OK for roughly 2GB of data and 37M rows. 

Once the data is loaded into Power Pivot we need to load some additional data which are basically our lookup/dimension tables. They can also be imported from Wikipedia, in this case directly from the web:
http://wikipedia-proj-lang-codes.s3.amazonaws.com/UniqueProjectCodes.csv
http://wikipedia-proj-lang-codes.s3.amazonaws.com/UniqueLanguageCodes.csv

Both are very simple tables with only two columns. Again we can use Power Query to import them and add them to our Power Pivot data model.

The last thing to add is a time-dimension.SAP HANA has its predefined time-dimension. In the case of Power BI I simply used a linked table in Excel that holds all necessary days or actually all days of 2013. Adding all those tables and linking them we end up with this pretty simple Power Pivot model:

PowerPivot_Diagram

This is very similar to what you create during the demo as an analytical view in SAP HANA so both approaches are very similar here.

Now that the model is set up we can do our analysis using Power View and classic Excel Pivot Tables:

PowerView_Analysis

PivotTable_Report

All together it took me about 2 hours to build the whole solution of which it took about 1h to download and process the data. The final workbook containing all the data has ~500MB which is mainly because of the PageName column which contains a high number of unique values which cannot be compressed very well. After removing the PageName column and some further tuning of the datamodel for compression I could bring the size down to 148 MB. This is quite OK – in numbers this is a compression of 12 from originally 1.74 GB

I hope you understand that I could not attach the whole workbook, instead I created a smaller workbook with only 50k rows by adding a TOP filter in the Power Query. This workbook can be downloaded below.

As I showed in this post, Power BI is capable of handling this kind of data very well. Both, in terms of data volume and also in terms of the type of data (unstructured/semi-structured data). Once the data is loaded into Power Pivot it can be analyzed just like any other data source using all Excel reporting capabilities.

Downloads:
Sample Excel Sheet: Wikidata_small.xlsx
Demo Script SAP HANA: Demo Script SAP HANA.docx
Power Query M-Script: Power Query M-Script.docx