I am using databricks Autoloader with Azure blob storage. I had a column name change in the source data and am curious to the best method to handle this change without changing the downstream column names.
What are some methods of handling these changes?
Thanks in advance.
I have not attempted any changes as I am trying to figure out the best way to handle this or if Autoloader has any built in features I do not know about.
Below are some approaches you can follow based on the data you have.
In case you know the exact column that is changing, you can follow the approach below.
column_mapping = {
"new_column_name": "old_column_name",
# ... other mappings
}
df_transformed = df.select(
*[col(new).alias(old) for new, old in column_mapping.items()]
)
Here, you create a dictionary mapping for new columns and old columns, then you do select with alias on the dataframe.
If there are only a few column name changes, you can simply use withColumnRenamed
like below.
df_renamed = df.withColumnRenamed("new_column_name", "old_column_name")
Lastly, if you don't know the changing columns, you can give the Auto Loader options cloudFiles.schemaEvolutionMode
as addNewColumns
, which adds new columns if the schema doesn't match, and mergeSchema
as true
. Then, follow any one of the above ways to rename them.
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
.option("mergeSchema", "true") \
.load("/path/to/blob/storage")
This approach will solve your issue.