Search code examples
apache-sparkpyspark

compare schema ignoring nullable


I am trying to compare the schema of 2 dataframes. Basically, the columns and the types are the same, but the "nullable" can be different:

Dataframe A

StructType(List(
StructField(ClientId,StringType,True),
StructField(PublicId,StringType,True),
StructField(ExternalIds,ArrayType(StructType(List(
    StructField(AppId,StringType,True),
    StructField(ExtId,StringType,True),
)),True),True),
....

Dataframe B

StructType(List(
StructField(ClientId,StringType,True),
StructField(PublicId,StringType,False),
StructField(ExternalIds,ArrayType(StructType(List(
    StructField(AppId,StringType,True),
    StructField(ExtId,StringType,False),
)),True),True),
....

When I do df_A.schema == df_B.schema, result if False obviously. But I would like to ignore the "nullable" parameter, whether it is false or true, if the structure is the same, it should return True.

Is it possible ?


Solution

  • Using your example of the following two DataFrame schemas:

    df_A.printSchema()
    #root
    # |-- ClientId: string (nullable = true)
    # |-- PublicId: string (nullable = true)
    # |-- PartyType: string (nullable = true)
    
    df_B.printSchema()
    #root
    # |-- ClientId: string (nullable = true)
    # |-- PublicId: string (nullable = true)
    # |-- PartyType: string (nullable = false)
    

    and assuming that the fields are in the same order, you can access the name and dataType of each of the fields in the schemas and zip them to compare:

    print(
        all(
            (a.name, a.dataType) == (b.name, b.dataType) 
            for a,b in zip(df_A.schema, df_B.schema)
        )
    )
    #True
    

    If they are not in the same order, you can compare the sorted fields:

    print(
        all(
            (a.name, a.dataType) == (b.name, b.dataType) 
            for a,b in zip(
                sorted(df_A.schema, key=lambda x: (x.name, x.dataType)), 
                sorted(df_B.schema, key=lambda x: (x.name, x.dataType))
            )
        )
    )
    #True
    

    If it's possible that the two DataFrames could have differing number of columns, you could first compare the schema lengths as a short circuiting check- if this fails, don't bother iterating through the fields:

    print(len(df_A.schema) == len(df_B.schema))
    #True