DATA TRAVELING - Guilherme Joaquim's Blog

View Original

How to add a custom Python package in Azure Synapse Analytics

What I particularly appreciate about Azure Synapse is its capability to execute notebooks, allowing you to write code in your preferred programming language, such as Python, Scala, SQL, and R.

In this post we are going to talk about:

  • Why should I create a custom Python package and its advantages and disadvantages

  • How to create a custom Python package (whl.file)

  • How to add the custom Python package in Azure Synapse manually using the User Interface

  • How to add the custom Python package in Azure Synapse through Azure PowerShell cmdlets and REST APIs

Special thanks to Sabri Aidi that came out with this idea of creating a custom Python package during our project.

Why should I create a custom Python package?

Before we begin showing how to do it, let’s first talk about the benefits of having your own custom Python package in Azure Synapse.

  • If you have multiple notebooks that use the same Python function(s) over and over again, it might be a good idea to create your custom package and just import your package with all your functions. No more copy and paste and you avoid potential errors when copying the functions

  • You can import all your Python libraries (such as Pandas, Numpy, Polars, etc) in your custom package and you don’t have to worry about importing individual libraries in your notebooks

  • You can centralize the function(s) you are using in your notebooks. Here I would recommend to have your Python package in your repository (Azure DevOps, GitHub…)

One possible disadvantage that I noticed using custom Python package is that it takes some minutes to upload your package (we are going to talk more about that here, stay tuned). And yes, it is possible to automate this process through Azure PowerShell cmdlets and REST APIs.

How to create the custom Python package

There are multiples ways of doing that but the focus here is to create a .whl file, or wheels. A Python wheel is a file format used to distribute Python packages. It is a compressed archive file that contains the Python package's code, dependencies, and metadata.

There are already very good articles describing how to do that in greater detail. For this reason, I will assume that you have all the required Python libraries, if not please refer to here or here.

You can find below the steps to create the .whl file.

  1. Create a folder in your local computer (for example: pypackages)

  2. Open the created folder in Visual Studio Code

  3. Create two python files, setup and pypackages (you can choose the name of your package here)

    1. The setup.py and pypackages.py files below are just an example. Feel free to extend the setup.py file if needed and in the pypackages.py is where you can import all your libraries, modules and create your functions

  4. Open your Terminal and run the following script: python setup.py bdist_wheel

  5. Go to the folder “dist” and there you will find the whl file

See the screenshots below:

Note: The setup.py and pypackages.py files above are just an example. Feel free to extend the setup.py file if needed and in the pypackages.py is where you can import all your libraries, modules and create your functions

The next step I would do before going to Azure Synapse is to add the files created in your repository for version control.

How to add the custom Python package in Azure Synapse manually using the User Interface

Open your Azure Synapse Studio to upload the whl.file manually and follow the screenshot below:

After uploading the whl.file in your workspace you have to go to your Apache Spark pools in order to assign the workspace package. See the screenshots below:

The upload can take a few minutes to complete. After that you can just go to your notebook and “from pypackages import *”. You would replace “pypackages” with the name of your package.

How to add the custom Python package in Azure Synapse through Azure PowerShell cmdlets and REST APIs

I know what you might be thinking. I don’t want to go to the Azure Synapse User Interface every time to update/upload the whl.file. Well, there is a solution to automate this process. This option offers a convenient choice for automating library/package management without the need to access Azure Synapse directly.

You can find below the steps you have to do:

  1. Make sure you have Azure PowerShell installed. If not, please refer to the Microsoft documentation

  2. Open your PowerShell (I use Windows PowerShell) as administrador

  3. Connect your Azure account with the following code: Connect-AzAccount -TenantId YourTenantId

  4. After that you can upload your whl.file file using the following code: New-AzSynapseWorkspacePackage -WorkspaceName ContosoWorkspace -Package ".\ContosoPackage.whl

    1. For more details refer to the Microsoft documentation. It is already very good explained there - so there is no reasons to repeat here

See below the screenshot of an example where I added the whl.file into my Workspace package:

You can follow the same approach to assign the uploaded package into your Apache Spark pools. But this time you would use another PowerShell code that you can easily find in the Microsoft documentation. In the documentation, you can find many other examples how to automate tasks.

Conclusion

In this article, we discussed the advantages of having a custom Python package in Azure Synapse. We explored the process of creating the whl.file and demonstrated how to incorporate it into Azure Synapse, both through manual steps and using Azure PowerShell cmdlets and REST APIs.

Utilizing a custom Python package proves highly beneficial, especially when you find yourself employing the same Python function(s) across multiple notebooks - a practice I use in my daily work. But, you have to be aware that it can take some minutes to upload the whl.file manually in your Apache Spark Pool.

Further reading

Thanks for reading! Let me know your thoughts in the comments below.

Last updated on September 13, 2023