Search code examples
apache-spark-sqlparquetazure-databricks

Azure Databricks - Write to parquet file using spark.sql with union and subqueries


Issue:

I'm trying to write to parquet file using spark.sql, however I encounter issues when having unions or subqueries. I know there's some syntax I can't seem to figure out.

Ex.

%python
    
df = spark.sql("SELECT
          sha2(Code, 256) as COUNTRY_SK,
          Code as COUNTRY_CODE,
          Name as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
       
        FROM raw.EXTR_COUNTRY)
       
    UNION ALL
    
    SELECT
          -1 as COUNTRY_SK,
          'Unknown' as COUNTRY_CODE,
          'Unknown' as COUNTRY_NAME,
          current_date() as EXTRACT_DATE")
          
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country", 
   mode="overwrite")

WHEN doing a simple query I have no issues at all, such as:

%python
    
df = spark.sql("select * from raw.EXTR_COUNTRY")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country/", 
   mode="overwrite")

Solution

  • The quotes solved the issue, the sql-script itself wasn't the issue. So using tripple quotes (""" or ''') solved the issue.

     %python
            
        df = spark.sql("""SELECT
                  sha2(Code, 256) as COUNTRY_SK,
                  Code as COUNTRY_CODE,
                  Name as COUNTRY_NAME,
                  current_date() as EXTRACT_DATE
               
                FROM raw.EXTR_COUNTRY)
               
            UNION ALL
            
            SELECT
                  -1 as COUNTRY_SK,
                  'Unknown' as COUNTRY_CODE,
                  'Unknown' as COUNTRY_NAME,
                  current_date() as EXTRACT_DATE""")
                  
        df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country", 
           mode="overwrite")