Search code examples
scalaapache-sparkapache-spark-sqlspark-streaming

Spark (Scala): How to turn an Array[Row] into either a DataSet[Row] or a DataFrame?


I have an Array[Row] and I want to turn it into either a Dataset[Row] or DataFrame.

How did I come up with an Array of Rows?

Well, I was trying to clear nulls from my dataset:

  • without having to filter EACH column (I have a lot) and..
  • without using the .na.drop() function from DataFrameNaFunctions because it fails to detect when a cell actually has the string "null".

So, I came up with the following line to filter out null in all columns.

val outDF = inputDF.columns.flatMap { col => inputDF.filter(col + "!='' AND " + col + "!='null'").collect() }

Problem is, outDF is an Array[Row], hence the question! Any ideas welcome!


Solution

  • I'm posting the answer as per my comment.

    df.na.drop(df.columns).where("'null' not in ("+df.columns.mkString(",")+")")