Search code examples
pythonapache-sparkpysparkapache-spark-sql

Comparing schema of dataframe using Pyspark


I have a data frame (df). For showing its schema I use:

from pyspark.sql.functions import *
df1.printSchema()

And I get the following result:

#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)

Sometimes the schema changes (the column type or name):

df2.printSchema()


 #root
        # |-- name: array (nullable = true)
        # |-- gender: integer (nullable = true)
        # |-- age: long (nullable = true)

I would like to compare between the two schemas (df1 and df2) and get only the differences in types and columns names (Sometimes the column can move to another position). The results should be a table (or data frame) something like this:

   column                df1          df2     diff                       
    name:               string       array     type                             
    gender:              N/A         integer   new column 

(age column is the same and didn't change. In case of omission of column there will be indication 'omitted') How can I do it if efficiently if I have many columns in each?


Solution

  • You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below

    pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
    pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])
    

    and then join those two pandas dataframes through 'outer' join?