How to optimize PySpark pivot operations

Context

Before starting to work with Azure Synapse Analytics, I primarily used pandas for data wrangling, so I am still learning a lot about PySpark. Also, I believe more people are going to start using PySpark, especially with the new kid on the block (Microsoft Fabric).

One of the things that I encountered using PySpark is that simple pivoting operations were taking very long, considering a small DataFrame (less than 30.000 rows). That’s when I discovered a small but powerful trick to drastically improve the performance of pivoting operations in PySpark.

Disclaimer: I am still learning PySpark and I believe this post can be beneficial for others on their learning journey. If you are a PySpark expert, please leave a comment sharing your experience handling this problem :)

Understanding the problem

Let’s first understand the problem and what we want to achieve. See below the initial data and how the result should look like:

I know you might be thinking this is a very simple data transformation, I feel you! That was my thoughts as well, until I ran into performance problems in PySpark.

Let’s get into the performance problem and how to solve it.

Pivoting in PySpark

See below the PySpark code to pivot the DataFrame “df”. The code below is reshaping the DataFrame by grouping it based on the columns (“ProductId”, “StoreId”, “OrderData”, “DeliveryDate”, “Status”). Then, it performs a pivot on the “Country” column and finally aggregates the values in each group using the sum function.

Optimizing pivot operations in PySpark

But why is the code above taking so long?
Well, Spark needs to discover all unique values in the pivot column before pivoting the DataFrame. This process may involve more work behind the scenes, especially if there are various different unique values in the column to pivot, leading to slower performance.

So, what if we provide a list with unique values from the column we want to pivot?
In this case, you need to know all unique values in the column beforehand. See the code below providing the unique values from the column to pivot:

In this case, we tell Spark exactly which columns to pivot by using the “columns_to_pivot” list, resulting in a faster and more efficient pivot operation (less than 0.2 seconds to perform the pivot operation).

Conclusion

In this post we talked about how to optimize pivot operations in PySpark. By providing explicit instructions to Spark regarding the columns for pivoting, we achieved a much better performance.

Last updated on January 18, 2024

Previous
Previous

Performance impact of visual calculations using DirectQuery in Power BI

Next
Next

Milestone trend analysis visual in Power BI with Deneb