Unable to filter values containing null
.
I am trying multiple operations on an empty Spark DataSet.
case class SourceWithoutFlag( id:String, phone:String, name:String)
case class Target(id:String, phone:String, name:String, start_date:String, end_date:String, flag:String)
The code is described below :-
var target = spark.emptyDataset[Target]
val source: Dataset[SourceWithoutFlag] = spark
.read.option("header", true).csv(sourceFile).as[SourceWithoutFlag]
println("New Data Read")
source.show(Int.MaxValue)
var operationRecordCheck = source
.select("id")
.withColumnRenamed("id","ids")
operationRecordCheck = target
.join(operationRecordCheck, target("id") ===
operationRecordCheck("ids"),"full_outer")
operationRecordCheck.show
var insertRecordId = operationRecordCheck
.where(isnull($"id"))
.select("ids")
insertRecordId.show
Here I am reading source
Dataset which contains these values
New Data Read
+---+---------+------+
| id| phone| name|
+---+---------+------+
|999|987654321|Jhoney|
|888|876543210|Stuart|
|444|576543210|Brocli|
|555|487654321|Advock|
+---+---------+------+
and another Dataset target
which is an empty Dataset
+---+-----+----+----------+--------+----+
| id|phone|name|start_date|end_date|flag|
+---+-----+----+----------+--------+----+
+---+-----+----+----------+--------+----+
Now I am performing a join of these two DataSets, getting this result operationRecordCheck
+----+-----+----+----------+--------+----+---+
| id|phone|name|start_date|end_date|flag|ids|
+----+-----+----+----------+--------+----+---+
|null| null|null| null| null|null|999|
|null| null|null| null| null|null|888|
|null| null|null| null| null|null|444|
|null| null|null| null| null|null|555|
+----+-----+----+----------+--------+----+---+
But when I am checking the cell value is null or not it is giving an exception.
Exception in thread "main" java.util.NoSuchElementException: None.get
The cause of the exception is
operationRecordCheck
.where(isnull($"id"))
.select("ids")
I just want to apply the sql query SELECT ids FROM operationRecordCheck WHERE id IS null;
on the operationRecordCheck Dataset but it is not considering my Dataset Values as null
.
I have also tried isnan($"id")
, $"id".isNull
, $"id".isNaN
, $"id".isNotNull
, $"id" === ""
, $"id" === null
but it is not giving me the proper result.
Appreciate the help :)
I ran into a very similar-looking problem recently (same error message, similar spark-based data manipulation with a join and then later a filter, and the failure traceable to the filter step). In my case, the failure was averted by adding a Dataset.cache() call just before the filter/'where' call. I think an analogous change in your code would look like this:
operationRecordCheck
.cache()
.where(isnull($"id"))
.select("ids")