compare schema ignoring nullable

I am trying to compare the schema of 2 dataframes. Basically, the columns and the types are the same, but the "nullable" can be different:

Dataframe A

StructType(List(
StructField(ClientId,StringType,True),
StructField(PublicId,StringType,True),
StructField(ExternalIds,ArrayType(StructType(List(
    StructField(AppId,StringType,True),
    StructField(ExtId,StringType,True),
)),True),True),
....

Dataframe B

StructType(List(
StructField(ClientId,StringType,True),
StructField(PublicId,StringType,False),
StructField(ExternalIds,ArrayType(StructType(List(
    StructField(AppId,StringType,True),
    StructField(ExtId,StringType,False),
)),True),True),
....

When I do df_A.schema == df_B.schema, result if False obviously. But I would like to ignore the "nullable" parameter, whether it is false or true, if the structure is the same, it should return True.

Is it possible ?

Solution

Using your example of the following two DataFrame schemas:

df_A.printSchema()
#root
# |-- ClientId: string (nullable = true)
# |-- PublicId: string (nullable = true)
# |-- PartyType: string (nullable = true)

df_B.printSchema()
#root
# |-- ClientId: string (nullable = true)
# |-- PublicId: string (nullable = true)
# |-- PartyType: string (nullable = false)

and assuming that the fields are in the same order, you can access the name and dataType of each of the fields in the schemas and zip them to compare:

print(
    all(
        (a.name, a.dataType) == (b.name, b.dataType) 
        for a,b in zip(df_A.schema, df_B.schema)
    )
)
#True

If they are not in the same order, you can compare the sorted fields:

print(
    all(
        (a.name, a.dataType) == (b.name, b.dataType) 
        for a,b in zip(
            sorted(df_A.schema, key=lambda x: (x.name, x.dataType)), 
            sorted(df_B.schema, key=lambda x: (x.name, x.dataType))
        )
    )
)
#True

If it's possible that the two DataFrames could have differing number of columns, you could first compare the schema lengths as a short circuiting check- if this fails, don't bother iterating through the fields:

print(len(df_A.schema) == len(df_B.schema))
#True