I have a data frame (df). For showing its schema I use:
from pyspark.sql.functions import *
df1.printSchema()
And I get the following result:
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Sometimes the schema changes (the column type or name):
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
I would like to compare between the two schemas (df1
and df2
) and get only the differences in types and columns names (Sometimes the column can move to another position).
The results should be a table (or data frame) something like this:
column df1 df2 diff
name: string array type
gender: N/A integer new column
(age
column is the same and didn't change. In case of omission of column there will be indication 'omitted'
)
How can I do it if efficiently if I have many columns in each?
You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])
and then join those two pandas dataframes through 'outer' join?