python azure pyspark databricks azure-databricks

Merge 2 dataframes in Pyspark

In Pyspark, I have 2 dataframe. 1st dataframe say df1 is empty dataframe created from a schema. 2nd dataframe df2 is non-empty dataframe filled from a csv file. Now I want to merge such that all below scenarios are covered.

If both dataframes contain same number if columns, merge them.
If 2nd dataframe contains additional columns than drop those columns
If 2nd dataframe conatains lesser columns then populate those column with null values.

I tried iterating through the empty dataframe columns' field property. schema = some schema of few column

for field in scehma.fields:
    if field.name in df2.columns:
       final_df = df1.withColumn(field.name, df2[field.name].cast(field.dataType))

Solution

Basically, you have an input dataframe with columns A, B and C and a reference schema with columns A, B, D. And you want your output dataframe to match this "reference" schema by dropping column C and adding a "NULL" column D.

Assuming df_ref is your reference dataframe and df is your csv file.

from pyspark.sql import functions as F


# Adding missing columns
for col in df_ref.columns: 
    if col not in df.columns:
        df = df.withColumn(col, F.lit(None))

# select only ref's columns
df.select(df_ref.columns)