Search code examples
dataframescalaapache-sparkbigdata

Looping thorough a list of columns and enriching datastet


I have a dataset, and I want to iterate over the list of columns and enrich the dataset with two new columns - status and message.

If for any row, any column is null, then status should be set as Failure, and message should contain the list of columns which are null:

Col : list[String] = df.coulmns.toList

df = df.withColumn("status" ,functions.lit("Success"))
       .withColumn("message",functions.lit(" "))
col.forEach(f=>{

val SuccessDf = df.filter(col(f).isNotNull)
var failureDf = df.filter(col(f).isNull)
                 
failureDf = failureDf.withColumn("status" ,functions.lit("Failure"))
.withColumn("message", concat(col("message"),f))

df = SuccessDf.union(failureDf) 
}
)

Problem with above code is, it's taking forever to run dataset is about 40K and about 18 columns

Sample message:

" Column : " + columnName + " is null"

Solution

  • Please check below solution.

    scala> df.show(30, false) // sample data
    +----------------------------+----------+------+---+----------+----------+------+-----------+
    |birthDate                   |firstName |gender|id |lastName  |middleName|salary|ssn        |
    +----------------------------+----------+------+---+----------+----------+------+-----------+
    |1955-07-02T04:00:00.000+0000|Pennie    |null  |1  |Hirschmann|Carry     |56172 |981-43-9345|
    |1992-02-08T05:00:00.000+0000|An        |Female|2  |Cowper    |Amira     |40203 |null       |
    |null                        |Quyen     |Female|3  |Dome      |Marlen    |53417 |957-57-8246|
    |1990-04-11T04:00:00.000+0000|Coralie   |Female|4  |null      |null      |94727 |963-39-4885|
    |1980-01-16T05:00:00.000+0000|Terrie    |Female|5  |Bonar     |Wava      |79908 |964-49-8051|
    |1990-11-24T05:00:00.000+0000|null      |Female|6  |null      |null      |64652 |954-59-9172|
    |1970-12-19T05:00:00.000+0000|Geri      |Female|7  |Mosby     |Tambra    |38195 |968-16-4020|
    |null                        |Patria    |null  |8  |null      |Nancy     |null  |null       |
    |1967-11-17T05:00:00.000+0000|Terese    |Female|9  |Tocque    |Alfredia  |91294 |967-48-7309|
    |null                        |Wava      |null  |10 |null      |null      |56521 |null       |
    |1979-09-17T04:00:00.000+0000|Sophie    |Female|11 |Hearn     |Emerita   |90920 |977-66-4483|
    |1959-01-31T05:00:00.000+0000|Jodie     |Female|12 |Laneham   |Tabetha   |90634 |923-24-9769|
    |1974-02-19T04:00:00.000+0000|Marietta  |Female|13 |Yansons   |Mandi     |93162 |900-34-8083|
    |1960-09-26T04:00:00.000+0000|Caridad   |Female|14 |Snelle    |Maire     |38859 |992-11-7062|
    |1960-01-29T05:00:00.000+0000|Yasmine   |Female|15 |Edworthye |Meg       |76220 |922-12-9862|
    |1986-12-05T05:00:00.000+0000|Chan      |Female|16 |Hartas    |Jani      |75050 |995-51-3115|
    |1961-09-29T04:00:00.000+0000|Evangeline|Female|17 |Casserley |Wanetta   |62814 |926-61-3526|
    |1980-02-14T05:00:00.000+0000|Elnora    |Female|18 |Lipman    |Kecia     |71350 |950-23-9739|
    |1978-11-14T05:00:00.000+0000|Adelle    |Female|19 |Grigoriev |Kathyrn   |60600 |923-23-5984|
    |1973-11-24T05:00:00.000+0000|Mica      |Female|20 |Challens  |Zandra    |51071 |918-66-1232|
    +----------------------------+----------+------+---+----------+----------+------+-----------+
    
    scala> val statusExpr = when(
            df.columns.map(c => col(c).isNull).reduce(_ or _), 
            lit("Failure")
        )
        .otherwise(lit(""))
    
    scala> val messageExpr = concat_ws(
            ", ", 
            df.columns.map(c => when(col(c).isNull, lit(s"Column: ${c} isNull"))):_*
        )
    
    scala> df
           .withColumn("status", statusExpr)
           .withColumn("message", messageExpr)
           .show(30, false)
    

    Final Output

    +----------------------------+----------+------+---+----------+----------+------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------+
    |birthDate                   |firstName |gender|id |lastName  |middleName|salary|ssn        |status |message                                                                                                                |
    +----------------------------+----------+------+---+----------+----------+------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------+
    |1955-07-02T04:00:00.000+0000|Pennie    |null  |1  |Hirschmann|Carry     |56172 |981-43-9345|Failure|Column: gender isNull                                                                                                  |
    |1992-02-08T05:00:00.000+0000|An        |Female|2  |Cowper    |Amira     |40203 |null       |Failure|Column: ssn isNull                                                                                                     |
    |null                        |Quyen     |Female|3  |Dome      |Marlen    |53417 |957-57-8246|Failure|Column: birthDate isNull                                                                                               |
    |1990-04-11T04:00:00.000+0000|Coralie   |Female|4  |null      |null      |94727 |963-39-4885|Failure|Column: lastName isNull, Column: middleName isNull                                                                     |
    |1980-01-16T05:00:00.000+0000|Terrie    |Female|5  |Bonar     |Wava      |79908 |964-49-8051|       |                                                                                                                       |
    |1990-11-24T05:00:00.000+0000|null      |Female|6  |null      |null      |64652 |954-59-9172|Failure|Column: firstName isNull, Column: lastName isNull, Column: middleName isNull                                           |
    |1970-12-19T05:00:00.000+0000|Geri      |Female|7  |Mosby     |Tambra    |38195 |968-16-4020|       |                                                                                                                       |
    |null                        |Patria    |null  |8  |null      |Nancy     |null  |null       |Failure|Column: birthDate isNull, Column: gender isNull, Column: lastName isNull, Column: salary isNull, Column: ssn isNull    |
    |1967-11-17T05:00:00.000+0000|Terese    |Female|9  |Tocque    |Alfredia  |91294 |967-48-7309|       |                                                                                                                       |
    |null                        |Wava      |null  |10 |null      |null      |56521 |null       |Failure|Column: birthDate isNull, Column: gender isNull, Column: lastName isNull, Column: middleName isNull, Column: ssn isNull|
    |1979-09-17T04:00:00.000+0000|Sophie    |Female|11 |Hearn     |Emerita   |90920 |977-66-4483|       |                                                                                                                       |
    |1959-01-31T05:00:00.000+0000|Jodie     |Female|12 |Laneham   |Tabetha   |90634 |923-24-9769|       |                                                                                                                       |
    |1974-02-19T04:00:00.000+0000|Marietta  |Female|13 |Yansons   |Mandi     |93162 |900-34-8083|       |                                                                                                                       |
    |1960-09-26T04:00:00.000+0000|Caridad   |Female|14 |Snelle    |Maire     |38859 |992-11-7062|       |                                                                                                                       |
    |1960-01-29T05:00:00.000+0000|Yasmine   |Female|15 |Edworthye |Meg       |76220 |922-12-9862|       |                                                                                                                       |
    |1986-12-05T05:00:00.000+0000|Chan      |Female|16 |Hartas    |Jani      |75050 |995-51-3115|       |                                                                                                                       |
    |1961-09-29T04:00:00.000+0000|Evangeline|Female|17 |Casserley |Wanetta   |62814 |926-61-3526|       |                                                                                                                       |
    |1980-02-14T05:00:00.000+0000|Elnora    |Female|18 |Lipman    |Kecia     |71350 |950-23-9739|       |                                                                                                                       |
    |1978-11-14T05:00:00.000+0000|Adelle    |Female|19 |Grigoriev |Kathyrn   |60600 |923-23-5984|       |                                                                                                                       |
    |1973-11-24T05:00:00.000+0000|Mica      |Female|20 |Challens  |Zandra    |51071 |918-66-1232|       |                                                                                                                       |
    +----------------------------+----------+------+---+----------+----------+------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------+