In the dataset I have None or NA values for some string columns, I want to count how many of these null values does the dataset contain? According to that decide which approach to use for missing values.
I tried very in an efficient way of filtering the Dataframe for each column with or expression. I want to filter them in a more efficient and neat way. It will be better to do it without converting it to rdd, but if it is not possible using dataframe to do this kind of filtering rdd way is also acceptable.
I found this thread Spark SQL filter multiple fields so similar to my question, but I want a more neat and elegant way to write this because I have so many columns
// trainDataFull is my dataframe
val nullValues = Array("NA", "None")
val filtered = trainDataFull.filter(trainDataFull("Alley").isin(nullValues:_*) ||
trainDataFull("MSZoning").isin(nullValues:_*) ||
trainDataFull("Street").isin(nullValues:_*) ||
trainDataFull("LotShape").isin(nullValues:_*) ||
trainDataFull("LandContour").isin(nullValues:_*) ||
trainDataFull("Utilities").isin(nullValues:_*) ||
trainDataFull("LotConfig").isin(nullValues:_*) ||
trainDataFull("LandSlope").isin(nullValues:_*) ||
trainDataFull("Neighborhood").isin(nullValues:_*) ||
trainDataFull("Condition1").isin(nullValues:_*) ||
trainDataFull("Condition2").isin(nullValues:_*) ||
trainDataFull("BldgType").isin(nullValues:_*) ||
trainDataFull("HouseStyle").isin(nullValues:_*) ||
trainDataFull("RoofStyle").isin(nullValues:_*) ||
trainDataFull("RoofMatl").isin(nullValues:_*) ||
trainDataFull("Exterior1st").isin(nullValues:_*) ||
trainDataFull("Exterior2nd").isin(nullValues:_*) ||
trainDataFull("MasVnrType").isin(nullValues:_*) ||
trainDataFull("MasVnrArea").isin(nullValues:_*) ||
trainDataFull("ExterQual").isin(nullValues:_*) ||
trainDataFull("MasVnrArea").isin(nullValues:_*) ||
trainDataFull("ExterQual").isin(nullValues:_*) ||
trainDataFull("ExterCond").isin(nullValues:_*) ||
trainDataFull("Foundation").isin(nullValues:_*) ||
trainDataFull("BsmtQual").isin(nullValues:_*) ||
trainDataFull("BsmtCond").isin(nullValues:_*) ||
trainDataFull("BsmtExposure").isin(nullValues:_*)
)
I want to see which column has how many null values.
You can always generate the query programatically
val nullValues = Array("NA", "None")
val df = Seq(("NA", "Foo"), ("None", "NA")).toDF("MSZoning", "Street")
val columns = df.schema.collect {
case StructField(name, StringType, _, _) =>
sum(when(col(name).isInCollection(nullValues), 1)).as(name)
}
df.select(columns:_*).show()
Output:
+--------+------+
|MSZoning|Street|
+--------+------+
| 2| 1|
+--------+------+