Visualizing Spark Execution Plans

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

During my research I found these two excellent blog posts (1, 2) by Semyon Sinchenko who was already dealing with Spark execution plans and how to extract valuable information from them. I took a lot of inspiration and ideas from there to build my show_plan function.

In the end I wanted to achieve three goals:

  1. and easy to use function that can be used with any Spark Dataframe (including SparkSQL)
  2. a lightweight setup that works with all Spark engines
  3. an interactive, visual representation of the execution plan (still working on the interactive part)

Installation as of now is via sc.addPyFile from my GitHub repository Fabric.Toolbox. For now thats fine I think and if the function gets more popular I will probably create a PIP package for it.

sc.addPyFile("https://raw.githubusercontent.com/gbrueckl/Fabric.Toolbox/main/DataEngineering/Library/VisualizeExecutionPlan.py")
from VisualizeExecutionPlan import show_plan

Next would be the definition of your Spark dataframe. As mentioned above, you can use any Spark dataframe regardless of how you created it (PySpark, SQL, …). For simplicity and transparency I used a SQL query in my example:

my_df = spark.sql("""
SELECT fs.*, dc.CurrencyName, ds.StoreName
FROM contoso.factsales_part fs
INNER JOIN contoso.dimcurrency dc
    ON fs.CurrencyKey = dc.CurrencyKey
LEFT JOIN contoso.dimstore ds
    ON fs.StoreKey = ds.StoreKey
WHERE fs.DateKey >= to_timestamp('2008-06-13', 'yyyy-MM-dd')
""")

display(my_df)

You can now simply pass the variable that represents your dataframe into the show_plan function:

show_plan(my_df)

As you can see, the function is very easy to install and use, its basically just 3 lines of code to give you a visual representation of your execution plan!

For Databricks, the code is slightly different to missing preinstalled libraries and limited capabilities of display() function. First we need to install graphviz using %sh and %pip. This is also partially documented in the official Databricks documentation.

%sh
sudo apt-get install -y python3-dev graphviz libgraphviz-dev pkg-config

Instead of pygraphviz as described in the docs, we install the regular graphviz package:

%pip install graphviz

Adding the library and creating the test dataframe is the same as in Fabric.

sc.addPyFile("https://raw.githubusercontent.com/gbrueckl/Fabric.Toolbox/main/DataEngineering/Library/VisualizeExecutionPlan.py")
from VisualizeExecutionPlan import *
my_df = spark.sql("""
SELECT fs.*, dc.CurrencyName, ds.StoreName
FROM contoso.factsales_part fs
INNER JOIN contoso.dimcurrency dc
    ON fs.CurrencyKey = dc.CurrencyKey
LEFT JOIN contoso.dimstore ds
    ON fs.StoreKey = ds.StoreKey
WHERE fs.DateKey >= to_timestamp('2008-06-13', 'yyyy-MM-dd')
""")

display(my_df)

Finally we need to pass the displayHTML function as a second parameter to the show_plan function:

show_plan(my_df, displayHTML)

Information for the final output is take from the physical execution plan and is enriched with data from the optimized logical execution plan which for example contains the estimated sizes. Things like the type of join (e.g. BroadcastHasJoin) is taken from the physical plan.

It is worth mentioning that the sizes are based on the table statistics and become unreliable after joins are involved. However, I think they still play in import role in performance tuning so it made sense to me to also include them in the visual representation of the plan.

There is still a lot of room for improvements like installation via PIP, interactive visualization, highlighting of important things like partition filters, etc. and I could not yet test all potential scenarios (I mainly used Delta Lake tables for my tests). So I would really appreciate any feedback to make the show_plan function more robust and user friendly. Feedback is best provided via the underlying GitHub repository Fabric.Toolbox.

Announcing the MS Fabric Users Slack Channel

Are you a Microsoft Fabric user looking to supercharge your collaboration and networking? I am excited to announce the new Slack channel MS Fabric Users which I just created, tailor-made for people who want to engage with a dynamic and supportive community around Microsoft Fabric. With this channel, I want to create a space that’s built for speed, ease, and connectivity.

Why Did I Choose Slack For Communication:

I am already using different Slack channels for other technologies I use on my daily basis (Delta Lake, Databricks Users, VSCode Development, …) and when it comes to interactive collaboration, I think Slack is much more efficient in getting answers to your immediate questions than traditional forums etc. Here are some other reasons why I think that’s the case:

1. Real-Time Communication: 

Slack’s instant messaging platform means you can get answers to your questions and feedback on your projects without waiting for forum replies. Connect with peers in seconds, not minutes or hours.

2. Better Organization with Channels:

Create and participate in topic-specific channels that keep discussions focused and relevant. Whether you’re interested in theming, components, or best practices, channels make it easy to find and partake in conversations that matter to you.

3. Enhanced Collaboration:

With Slack, collaboration is not just about talking – it’s about doing. Share code snippets, files, and resources effortlessly with the drag-and-drop interface. Pair that with integrations like GitHub and Trello, and you’ve got a powerful toolkit right at your fingertips.

4. Accessibility On-the-Go:

Stay connected with the community wherever you are. The Slack mobile app provides a seamless experience, ensuring you never miss out on important discussions no matter where you work from.

5. Advanced Search Capabilities:

Slack’s powerful search function makes it easy to find relevant conversations, shared files, and announcements. No more sifting through pages of forum posts to find the information you need.

Join the Fabric Slack Community:

I believe community is key to learning and growing as a developer. By using Slack, you can enhance the way you connect, engage, and support each other as Microsoft Fabric users.

To get started, join the Slack channel today! Simply visit MS Fabric Users to sign up and dive into the conversations happening right now!

Using VARCHAR() in Microsoft Fabric Lakehouses and SQL Endpoints

Defining data types and knowing the schema of your data has always been a crucial factor for performant data platforms, especially when it comes to string datatypes which can potentially consume a lot of space and memory. For Lakehouses in general (not only Fabric Lakehouses), there is usually only one data type for text data which is a generic STRING of an arbitrary length. In terms of Apache Spark, this is StringType(). While this applies to Spark dataframes, this is not entirely true for Spark tables – here is what the docs say:

  • String type
    • StringType: Represents character string values.
    • VarcharType(length): A variant of StringType which has a length limitation. Data writing will fail if the input string exceeds the length limitation. Note: this type can only be used in table schema, not functions/operators.
    • CharType(length): A variant of VarcharType(length) which is fixed length. Reading column of type CharType(n) always returns string values of length n. Char type column comparison will pad the short one to the longer length.

As stated, there are multiple ways to define a text column and while you cannot use VarcharType/CharType in your Spark dataframe, you can still use it to define the output tables of your lakehouse. Inspired by this blog post by Kyle Hale, I was running some similar tests on Microsoft Fabric.

In general I can say that the results are basically the same as the ones that Kyle got. This was kind of expected as in both cases Spark and Delta Lake was used to run the tests. However, for me it was also interesting to see what impact this data type change had on other components of the Fabric ecosystem, particularly the SQL Endpoint associated with my Lakehouse. Here is the very simple Spark code I used for testing. I am writing a dataframe with an IntegerType() and a StringType() to a new table which will create this table in the lakehouse for you:

schema = T.StructType([
    T.StructField("charLength", T.IntegerType()), 
    T.StructField("value_max_len_10", T.StringType())
])

df = spark.createDataFrame([[(10),('abcdefghij')]], schema)

df.write.mode("append").saveAsTable("datatypes_default")

If you have a look at the table created in your Lakehouse via the SQL Endpoint using e.g. SQL Server Management Studio or Azure Data Studio, you will realize that text column is associated with a datatype VARCHAR(8000). For most columns, this is utterly oversized and can have a huge impact on performance as the size of the columns is used by optimizer to built an efficient execution plan.

Lets see what happens if we write the very same dataframe into an already existing table that was created using VARCHAR(10) instead:

%%sql
CREATE TABLE datatypes_typed (
    charLength INT,
    value_max_len_10 VARCHAR(10)
)
# writing the same dataframe as before but now to the pre-defined table
df.write.mode("append").saveAsTable("datatypes_typed")

Checking the SQL Endpoint again you will see that the data type of the column in the new table is now VARCHAR(40) – quite a big improvement over VARCHAR(8000) !

The reason why it is VARCHAR(40) and not VARCHAR(10) is described in this excellent post from Greg Low and I have to admit, reading it completely changed the way I look at string datatypes in SQL Server!

I have to admit that I do not yet know why it is VARCHAR(40) and not VARCHAR(10) but for the time being I am already happy with those results. I also tried other lengths but it seems to always show 4 times the defined length in the lakehouse table. Once I find out more about this, I will update the blog post!