Visualizing Spark Execution Plans

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

During my research I found these two excellent blog posts (1, 2) by Semyon Sinchenko who was already dealing with Spark execution plans and how to extract valuable information from them. I took a lot of inspiration and ideas from there to build my show_plan function.

In the end I wanted to achieve three goals:

  1. and easy to use function that can be used with any Spark Dataframe (including SparkSQL)
  2. a lightweight setup that works with all Spark engines
  3. an interactive, visual representation of the execution plan (still working on the interactive part)

Installation as of now is via sc.addPyFile from my GitHub repository Fabric.Toolbox. For now thats fine I think and if the function gets more popular I will probably create a PIP package for it.

sc.addPyFile("https://raw.githubusercontent.com/gbrueckl/Fabric.Toolbox/main/DataEngineering/Library/VisualizeExecutionPlan.py")
from VisualizeExecutionPlan import show_plan

Next would be the definition of your Spark dataframe. As mentioned above, you can use any Spark dataframe regardless of how you created it (PySpark, SQL, …). For simplicity and transparency I used a SQL query in my example:

my_df = spark.sql("""
SELECT fs.*, dc.CurrencyName, ds.StoreName
FROM contoso.factsales_part fs
INNER JOIN contoso.dimcurrency dc
    ON fs.CurrencyKey = dc.CurrencyKey
LEFT JOIN contoso.dimstore ds
    ON fs.StoreKey = ds.StoreKey
WHERE fs.DateKey >= to_timestamp('2008-06-13', 'yyyy-MM-dd')
""")

display(my_df)

You can now simply pass the variable that represents your dataframe into the show_plan function:

show_plan(my_df)

As you can see, the function is very easy to install and use, its basically just 3 lines of code to give you a visual representation of your execution plan!

For Databricks, the code is slightly different to missing preinstalled libraries and limited capabilities of display() function. First we need to install graphviz using %sh and %pip. This is also partially documented in the official Databricks documentation.

%sh
sudo apt-get install -y python3-dev graphviz libgraphviz-dev pkg-config

Instead of pygraphviz as described in the docs, we install the regular graphviz package:

%pip install graphviz

Adding the library and creating the test dataframe is the same as in Fabric.

sc.addPyFile("https://raw.githubusercontent.com/gbrueckl/Fabric.Toolbox/main/DataEngineering/Library/VisualizeExecutionPlan.py")
from VisualizeExecutionPlan import *
my_df = spark.sql("""
SELECT fs.*, dc.CurrencyName, ds.StoreName
FROM contoso.factsales_part fs
INNER JOIN contoso.dimcurrency dc
    ON fs.CurrencyKey = dc.CurrencyKey
LEFT JOIN contoso.dimstore ds
    ON fs.StoreKey = ds.StoreKey
WHERE fs.DateKey >= to_timestamp('2008-06-13', 'yyyy-MM-dd')
""")

display(my_df)

Finally we need to pass the displayHTML function as a second parameter to the show_plan function:

show_plan(my_df, displayHTML)

Information for the final output is take from the physical execution plan and is enriched with data from the optimized logical execution plan which for example contains the estimated sizes. Things like the type of join (e.g. BroadcastHasJoin) is taken from the physical plan.

It is worth mentioning that the sizes are based on the table statistics and become unreliable after joins are involved. However, I think they still play in import role in performance tuning so it made sense to me to also include them in the visual representation of the plan.

There is still a lot of room for improvements like installation via PIP, interactive visualization, highlighting of important things like partition filters, etc. and I could not yet test all potential scenarios (I mainly used Delta Lake tables for my tests). So I would really appreciate any feedback to make the show_plan function more robust and user friendly. Feedback is best provided via the underlying GitHub repository Fabric.Toolbox.

11 Replies to “Visualizing Spark Execution Plans”

  1. Pingback: Visualizing a Spark Execution Plan – Curated SQL

    • it should also work for Azure Synapse in the very same way
      basically for every Spark environment
      if it is not, please report back in the github repository

  2. Hi Gerhard,
    This looks good, can this be used in Databricks? I am getting an error while using it.

    ValueError: ‘EB, ColumnStat: N/A’ is not in list
    File , line 1
    —-> 1 show_plan(src_ais_message_location_df)
    File /local_disk0/spark-af0febb1-f7f6-46f8-ab1b-598aa8907cec/userFiles-b8e5059f-33b6-4683-8526-d9ee96e55fb0/VisualizeExecutionPlan.py:120, in PlanNode.get_size_in_bytes(self)
    117 units = units.replace(“i”, “”)
    119 size_names = [“B”, “KB”, “MB”, “GB”, “TB”, “PB”, “EB”, “ZB”, “YB”]
    –> 120 i = size_names.index(units)
    121 p = math.pow(1024, i)
    123 size = size * p

    • thanks for the feedback – that particular issue is fixed now
      However, there is still some issue displaying the Tree properly in Databricks due to some missing dependencies which are not installed by default
      I will investigate and update the blogpost/code once I found a solution

    • ok, so the code is fixed now to also work on Databricks
      However, you first need to install `graphviz` as described here:
      https://kb.databricks.com/libraries/install-pygraphviz.html
      instead of `pip install pygraphviz` you need to use `%pip install graphviz`

      also the final function call is slightly different:
      `show_plan(my_df, displayHTML)`
      wher you need to pass the displayHTML function as a second argument

      there is now also a `get_plan_viz(my_df)` which returns the graphviz object in case you want to further modify it

      • I also updated the blog post to include the instructions to make it work on Databricks

  3. How to use in on-premise spark cluster? How can visualise the plan in pyspark shell? Could you please share some demo? This is really cool feature you created.

    • well, you just need to make sure that the required packages are installed in your environment – similar to what I described for DAtabricks
      in a PySpark shell you cannot visualize it this way obviously, you could however simply run df.explain()

      -gerhard

  4. Hi Gerhard, I’m currently trying to understand why some spark SQL queries aren’t working as fast as I’d like, so this is a great article.

    One question I have – once you see the explain plan, is there some way to turn that into action items? Assuming the queries can’t really be changed, I’m hoping for something that could make “environment” recommendations like:

    – change environment to have “fewer worker threads with more memory” or “more worker threads with less memory”
    – partition table X using columns x.1, x.2 and table Y using y.13 and y.14
    – …etc. Sorry I’m not spark proficient so that’s why I’m looking for this!

    Do you know of any good way to take the explain plan and use it for these types of decisions?

    Thanks!
    Scott

    • Thats a tough question and I am afraid there is no easy answer – performance tuning of Spark is a difficult topic.
      However, I try to answer your specific questions:
      – the execution plan does not know on which cluster it will be running hence this information is not available nor can an optimal configuration be derived from the plan
      – you can partition by columns used in JOINs or GROUP BYs but this might have negative impact on other queries

      the key to good performance is to know your data, metadata and structure of your data

      -gerhard

Leave a Reply

Your email address will not be published. Required fields are marked *

*