Search code examples
scalaapache-spark

Compare two schema (column name + nullable) in Spark


I know how to compare two lists in Scala using zip + forall.

My question is how do we compare two DataFrame schemas. That is, we want to match column names with their nullable property.

My idea is to use hash map to store {column name: nullable}, and do the comparison. I guess it works, but is there any other idiomatic way?


Solution

  • First you should retrieve the elements you want to compare as Tom Lous said in his answer:

    val s1 = df1.schema.fields.map(f => (f.name, f.nullable))
    val s2 = df2.schema.fields.map(f => (f.name, f.nullable))
    

    Then you can just make use of the diff method from Lists, which will return the differences, if that method returns and empty list, then there is no difference, otherwise there is:

    s1.diff(s2).isEmpty
    

    returns: true if no difference was found, false otherwise

    Consider that the diff method returns no difference when a field is present in one list but not in the other one. So you may need to attach a second condition to compare lengths

    s1.diff(s2).isEmpty && s1.length == s2.length