Visualizing Spark Execution Plans – Gerhard Brueckl on BI & Data

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

During my research I found these two excellent blog posts (1, 2) by Semyon Sinchenko who was already dealing with Spark execution plans and how to extract valuable information from them. I took a lot of inspiration and ideas from there to build my show_plan function.

In the end I wanted to achieve three goals:

and easy to use function that can be used with any Spark Dataframe (including SparkSQL)
a lightweight setup that works with all Spark engines
an interactive, visual representation of the execution plan (still working on the interactive part)

Installation as of now is via sc.addPyFile from my GitHub repository Fabric.Toolbox. For now thats fine I think and if the function gets more popular I will probably create a PIP package for it.

sc.addPyFile("https://raw.githubusercontent.com/gbrueckl/Fabric.Toolbox/main/DataEngineering/Library/VisualizeExecutionPlan.py")
from VisualizeExecutionPlan import show_plan

Next would be the definition of your Spark dataframe. As mentioned above, you can use any Spark dataframe regardless of how you created it (PySpark, SQL, …). For simplicity and transparency I used a SQL query in my example:

my_df = spark.sql("""
SELECT fs.*, dc.CurrencyName, ds.StoreName
FROM contoso.factsales_part fs
INNER JOIN contoso.dimcurrency dc
    ON fs.CurrencyKey = dc.CurrencyKey
LEFT JOIN contoso.dimstore ds
    ON fs.StoreKey = ds.StoreKey
WHERE fs.DateKey >= to_timestamp('2008-06-13', 'yyyy-MM-dd')
""")

display(my_df)

You can now simply pass the variable that represents your dataframe into the show_plan function:

show_plan(my_df)

As you can see, the function is very easy to install and use, its basically just 3 lines of code to give you a visual representation of your execution plan!

For Databricks, the code is slightly different to missing preinstalled libraries and limited capabilities of display() function. First we need to install graphviz using %sh and %pip. This is also partially documented in the official Databricks documentation.

%sh
sudo apt-get install -y python3-dev graphviz libgraphviz-dev pkg-config

Instead of pygraphviz as described in the docs, we install the regular graphviz package:

%pip install graphviz

Adding the library and creating the test dataframe is the same as in Fabric.

sc.addPyFile("https://raw.githubusercontent.com/gbrueckl/Fabric.Toolbox/main/DataEngineering/Library/VisualizeExecutionPlan.py")
from VisualizeExecutionPlan import *

my_df = spark.sql("""
SELECT fs.*, dc.CurrencyName, ds.StoreName
FROM contoso.factsales_part fs
INNER JOIN contoso.dimcurrency dc
    ON fs.CurrencyKey = dc.CurrencyKey
LEFT JOIN contoso.dimstore ds
    ON fs.StoreKey = ds.StoreKey
WHERE fs.DateKey >= to_timestamp('2008-06-13', 'yyyy-MM-dd')
""")

display(my_df)

Finally we need to pass the displayHTML function as a second parameter to the show_plan function:

show_plan(my_df, displayHTML)

Information for the final output is take from the physical execution plan and is enriched with data from the optimized logical execution plan which for example contains the estimated sizes. Things like the type of join (e.g. BroadcastHasJoin) is taken from the physical plan.

It is worth mentioning that the sizes are based on the table statistics and become unreliable after joins are involved. However, I think they still play in import role in performance tuning so it made sense to me to also include them in the visual representation of the plan.

There is still a lot of room for improvements like installation via PIP, interactive visualization, highlighting of important things like partition filters, etc. and I could not yet test all potential scenarios (I mainly used Delta Lake tables for my tests). So I would really appreciate any feedback to make the show_plan function more robust and user friendly. Feedback is best provided via the underlying GitHub repository Fabric.Toolbox.

11 Replies to “Visualizing Spark Execution Plans”

Pingback: Visualizing a Spark Execution Plan – Curated SQL

Please add support for Azure Synapse

Reply ↓

Gerhard Brueckl on 2024-05-02 at 11:19 said:

it should also work for Azure Synapse in the very same way
basically for every Spark environment
if it is not, please report back in the github repository

Reply ↓

Hi Gerhard,
This looks good, can this be used in Databricks? I am getting an error while using it.

ValueError: ‘EB, ColumnStat: N/A’ is not in list
File , line 1
—-> 1 show_plan(src_ais_message_location_df)
File /local_disk0/spark-af0febb1-f7f6-46f8-ab1b-598aa8907cec/userFiles-b8e5059f-33b6-4683-8526-d9ee96e55fb0/VisualizeExecutionPlan.py:120, in PlanNode.get_size_in_bytes(self)
117 units = units.replace(“i”, “”)
119 size_names = [“B”, “KB”, “MB”, “GB”, “TB”, “PB”, “EB”, “ZB”, “YB”]
–> 120 i = size_names.index(units)
121 p = math.pow(1024, i)
123 size = size * p

Reply ↓

Gerhard Brueckl on 2024-05-13 at 11:58 said:

thanks for the feedback – that particular issue is fixed now
However, there is still some issue displaying the Tree properly in Databricks due to some missing dependencies which are not installed by default
I will investigate and update the blogpost/code once I found a solution

Reply ↓
Gerhard Brueckl on 2024-05-13 at 21:00 said:

ok, so the code is fixed now to also work on Databricks
However, you first need to install `graphviz` as described here:
https://kb.databricks.com/libraries/install-pygraphviz.html
instead of `pip install pygraphviz` you need to use `%pip install graphviz`

also the final function call is slightly different:
`show_plan(my_df, displayHTML)`
wher you need to pass the displayHTML function as a second argument

there is now also a `get_plan_viz(my_df)` which returns the graphviz object in case you want to further modify it

Reply ↓
- Gerhard Brueckl on 2024-05-14 at 10:10 said:
  
  I also updated the blog post to include the instructions to make it work on Databricks
  
  Reply ↓

How to use in on-premise spark cluster? How can visualise the plan in pyspark shell? Could you please share some demo? This is really cool feature you created.

Reply ↓

Gerhard Brueckl on 2024-08-20 at 20:39 said:

well, you just need to make sure that the required packages are installed in your environment – similar to what I described for DAtabricks
in a PySpark shell you cannot visualize it this way obviously, you could however simply run df.explain()

-gerhard

Reply ↓

Hi Gerhard, I’m currently trying to understand why some spark SQL queries aren’t working as fast as I’d like, so this is a great article.

One question I have – once you see the explain plan, is there some way to turn that into action items? Assuming the queries can’t really be changed, I’m hoping for something that could make “environment” recommendations like:

– change environment to have “fewer worker threads with more memory” or “more worker threads with less memory”
– partition table X using columns x.1, x.2 and table Y using y.13 and y.14
– …etc. Sorry I’m not spark proficient so that’s why I’m looking for this!

Do you know of any good way to take the explain plan and use it for these types of decisions?

Thanks!
Scott

Reply ↓

Gerhard Brueckl on 2024-11-11 at 08:02 said:

Thats a tough question and I am afraid there is no easy answer – performance tuning of Spark is a difficult topic.
However, I try to answer your specific questions:
– the execution plan does not know on which cluster it will be running hence this information is not available nor can an optimal configuration be derived from the plan
– you can partition by columns used in JOINs or GROUP BYs but this might have negative impact on other queries

the key to good performance is to know your data, metadata and structure of your data

-gerhard

Reply ↓

Share this:

11 Replies to “Visualizing Spark Execution Plans”

Leave a Reply Cancel reply