I have a process which fetch data from RDBMS source and writes to Azure blob storage the data is written in partitioned structure like,
Storage account
|_ container
|_ data-load
|_ updateddate_p=yyyymmdd
|_timestamp_p=16382882
|_data-file.orc
|_timestamp_p=16382885
|_data-file-1.orc
Now in Databricks, I mount the Azure storage (classic) to the cluster and use readstream
to read the orc file data as stream.
Is there a way to read the orc data with the partitioned info.
So the stream dataframe has the partitioned information when reading.
So when using the writestream to write the data, I should be able to fetch the updateddate_p
and timestamp_p
in the readstream dataframe with the values.
{
"metadata": {},
"name": "updateddate_p",
"nullable": true,
"type": "integer"
},
{
"metadata": {},
"name": "timestamp_p",
"nullable": true,
"type": "long"
},
with open(f'custom-scehma.json', 'r') as f:
schema_definition= T.StructType.fromJson(json.loads(f.read()))
data_readstream = (
spark.readStream
.format("orc")
.schema(schema_definition)
.option("pathGlobFilter", "*.orc")
.load(f'{mounted_input_data_path_loc}/')
)
# write stream used partitionBy(column1,column2..