I am trying to compare the schema of 2 dataframes. Basically, the columns and the types are the same, but the "nullable" can be different:
Dataframe A
StructType(List(
StructField(ClientId,StringType,True),
StructField(PublicId,StringType,True),
StructField(ExternalIds,ArrayType(StructType(List(
StructField(AppId,StringType,True),
StructField(ExtId,StringType,True),
)),True),True),
....
Dataframe B
StructType(List(
StructField(ClientId,StringType,True),
StructField(PublicId,StringType,False),
StructField(ExternalIds,ArrayType(StructType(List(
StructField(AppId,StringType,True),
StructField(ExtId,StringType,False),
)),True),True),
....
When I do df_A.schema == df_B.schema
, result if False
obviously.
But I would like to ignore the "nullable" parameter, whether it is false or true, if the structure is the same, it should return True
.
Is it possible ?
Using your example of the following two DataFrame schemas:
df_A.printSchema()
#root
# |-- ClientId: string (nullable = true)
# |-- PublicId: string (nullable = true)
# |-- PartyType: string (nullable = true)
df_B.printSchema()
#root
# |-- ClientId: string (nullable = true)
# |-- PublicId: string (nullable = true)
# |-- PartyType: string (nullable = false)
and assuming that the fields are in the same order, you can access the name
and dataType
of each of the fields in the schemas and zip them to compare:
print(
all(
(a.name, a.dataType) == (b.name, b.dataType)
for a,b in zip(df_A.schema, df_B.schema)
)
)
#True
If they are not in the same order, you can compare the sorted fields:
print(
all(
(a.name, a.dataType) == (b.name, b.dataType)
for a,b in zip(
sorted(df_A.schema, key=lambda x: (x.name, x.dataType)),
sorted(df_B.schema, key=lambda x: (x.name, x.dataType))
)
)
)
#True
If it's possible that the two DataFrames could have differing number of columns, you could first compare the schema lengths as a short circuiting check- if this fails, don't bother iterating through the fields:
print(len(df_A.schema) == len(df_B.schema))
#True