Search code examples
azurepysparkdatabricksazure-databricks

Any benefits of using Pyspark code over SQL in Azure databricks?


I am working on something where I have a SQL code in place already. Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.

I want to know - Is there any recommended way or best practice to work with Azure databricks ? Should we re-write the code in PySpark for the better performance?

Note : End results from the previous SQL code has no bugs. Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code. Now I am looking for suggestions to understand the best practices and how it will make a difference.

Looking for your help. Thanks !

Expecting - Along with the migration from on prem to Azure. I am looking for some best practices for better performance.


Solution

  • After getting help on the posted question and doing some research I came up with below response --

    • It does not matter which language do you choose (SQL or python). Since it uses Spark cluster, so Sparks distributes it across cluster. It depends on specific use cases where to use what.
    • Both SQL and PySpark dataframe intermediate results gets stored in memory.
    • In a same notebook we can use both the languages depending upon the situation.

    Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)

    Note: There may be some optimization techniques in both the languages which we can use to make the performance better.

    Summary : Choose language based on the use cases. Both has the distributed processing because its running on Spark cluster.

    Thank you !