In Pyspark, I have 2 dataframe. 1st dataframe say df1 is empty dataframe created from a schema. 2nd dataframe df2 is non-empty dataframe filled from a csv file. Now I want to merge such that all below scenarios are covered.
I tried iterating through the empty dataframe columns' field property. schema = some schema of few column
for field in scehma.fields:
if field.name in df2.columns:
final_df = df1.withColumn(field.name, df2[field.name].cast(field.dataType))
Basically, you have an input dataframe with columns A, B and C and a reference schema with columns A, B, D. And you want your output dataframe to match this "reference" schema by dropping column C and adding a "NULL" column D.
Assuming df_ref
is your reference dataframe and df
is your csv file.
from pyspark.sql import functions as F
# Adding missing columns
for col in df_ref.columns:
if col not in df.columns:
df = df.withColumn(col, F.lit(None))
# select only ref's columns
df.select(df_ref.columns)