Search code examples
pysparkdatabricksazure-databricks

Databricks pyspark readstream to read data from azure blob storage which is written in partitioned structure


I have a process which fetch data from RDBMS source and writes to Azure blob storage the data is written in partitioned structure like,

Storage account
   |_ container
      |_ data-load
         |_ updateddate_p=yyyymmdd
            |_timestamp_p=16382882
               |_data-file.orc
            |_timestamp_p=16382885
               |_data-file-1.orc

Now in Databricks, I mount the Azure storage (classic) to the cluster and use readstream to read the orc file data as stream.

Is there a way to read the orc data with the partitioned info.

So the stream dataframe has the partitioned information when reading.

So when using the writestream to write the data, I should be able to fetch the updateddate_p and timestamp_p in the readstream dataframe with the values.


Solution

    • Added the partition column part of the schema
     {
                "metadata": {},
                "name": "updateddate_p",
                "nullable": true,
                "type": "integer"
            },
            {
                "metadata": {},
                "name": "timestamp_p",
                "nullable": true,
                "type": "long"
            },
    
    
    • Used the read stream as such, and the partition loaded as expected
    with open(f'custom-scehma.json', 'r') as f:
        schema_definition= T.StructType.fromJson(json.loads(f.read()))
    
    
    data_readstream = ( 
        spark.readStream
        .format("orc")
        .schema(schema_definition)
        .option("pathGlobFilter", "*.orc")
        .load(f'{mounted_input_data_path_loc}/') 
      )
    
    # write stream used partitionBy(column1,column2..
    

    PySpark Reference Documentation