Search code examples
scaladataframeapache-sparknulldrop

Drop records from dataframe based on the null present for any column


How to select only the rows with IDs for which there are no rows which contain nulls.

  +---------+------+-----+
  |AccountID|  Name|Price|
  +---------+------+-----+
  |       11|miguel| null|
  |       11|luisa |   21|
  |       12|  hary|   90|
  |       13|  null|   99|
  |       14|  marg|   90|
  |       14|  null|   99|
  +---------+------+-----+

  DF.na.drop().show()

  +---------+------+-----+
  |AccountID|  Name|Price|
  +---------+------+-----+
  |       11|luisa |   21|
  |       12|  hary|   90|
  |       14|  marg|   90|
  +---------+------+-----+

na.drop() removed all rows that have null values on any column of DataFrame. I would want to know how can I drop other record having AccountID 11. Delete all the records for same AccountID if null was present in any column/records for that particular AccountID

In this case output should be only AccountID 12.

As 11 and 14 AccountID's were having null values.

  +---------+------+-----+
  |AccountID|  Name|Price|
  +---------+------+-----+
  |       12|  hary|   90|
  +---------+------+-----+

Solution

  • To paraphrase your question, you want to select only the rows with IDs for which there are no rows which contain nulls.

    // select rows which have a null
    val idsWithNulls = DF.filter(col("Name").isNull || col("Price").isNull)
    // based on those ids remove 'bad' ids
    DF.join(idsWithNulls, Seq("AccountId"), "left_anti")
    

    The result will be:

    +---------+----+-----+
    |AccountId|Name|Price|
    +---------+----+-----+
    |       12|hary|   90|
    +---------+----+-----+