Search code examples
scalaapache-sparknullnullablenosuchelementexception

Getting undefined type of value in a Spark Dataframe


Unable to filter values containing null. I am trying multiple operations on an empty Spark DataSet.

case class SourceWithoutFlag( id:String, phone:String, name:String)
case class Target(id:String, phone:String, name:String, start_date:String, end_date:String, flag:String)

The code is described below :-

var target = spark.emptyDataset[Target]
val source: Dataset[SourceWithoutFlag] = spark
    .read.option("header", true).csv(sourceFile).as[SourceWithoutFlag]
println("New Data Read")
source.show(Int.MaxValue)

var operationRecordCheck = source
    .select("id")
    .withColumnRenamed("id","ids")
operationRecordCheck = target
    .join(operationRecordCheck, target("id") ===
                    operationRecordCheck("ids"),"full_outer")

operationRecordCheck.show
var insertRecordId = operationRecordCheck
    .where(isnull($"id"))
    .select("ids")
insertRecordId.show

Here I am reading source Dataset which contains these values

New Data Read
+---+---------+------+
| id|    phone|  name|
+---+---------+------+
|999|987654321|Jhoney|
|888|876543210|Stuart|
|444|576543210|Brocli|
|555|487654321|Advock|
+---+---------+------+

and another Dataset target which is an empty Dataset

+---+-----+----+----------+--------+----+
| id|phone|name|start_date|end_date|flag|
+---+-----+----+----------+--------+----+
+---+-----+----+----------+--------+----+

Now I am performing a join of these two DataSets, getting this result operationRecordCheck

+----+-----+----+----------+--------+----+---+
|  id|phone|name|start_date|end_date|flag|ids|
+----+-----+----+----------+--------+----+---+
|null| null|null|      null|    null|null|999|
|null| null|null|      null|    null|null|888|
|null| null|null|      null|    null|null|444|
|null| null|null|      null|    null|null|555|
+----+-----+----+----------+--------+----+---+

But when I am checking the cell value is null or not it is giving an exception.

Exception in thread "main" java.util.NoSuchElementException: None.get

The cause of the exception is

operationRecordCheck
        .where(isnull($"id"))
        .select("ids")

I just want to apply the sql query SELECT ids FROM operationRecordCheck WHERE id IS null; on the operationRecordCheck Dataset but it is not considering my Dataset Values as null.

I have also tried isnan($"id") , $"id".isNull , $"id".isNaN , $"id".isNotNull , $"id" === "" , $"id" === null but it is not giving me the proper result.

Appreciate the help :)


Solution

  • I ran into a very similar-looking problem recently (same error message, similar spark-based data manipulation with a join and then later a filter, and the failure traceable to the filter step). In my case, the failure was averted by adding a Dataset.cache() call just before the filter/'where' call. I think an analogous change in your code would look like this:

    operationRecordCheck
            .cache()
            .where(isnull($"id"))
            .select("ids")