Search code examples
apache-sparkapache-spark-sqlspark-java

How to concatenate null columns in spark dataframe in java?


I am working with spark in java and I want to create a column which is a concatenation of all other column values separated by comma. I have tried few ways to do this but couldn't find a solution.

For example-

col1 | col2 | col3 | result
 1   | john | 2.3  | 1,john,2.3

The problem that I'm facing is if a column value is null then simply nothing should be there but so far I have not achieved this.

For example-

col1 | col2 | col3 | result
 1   | null | 2.3  | 1,,2.3  -> this is what I'm trying to achieve

What I have tried-

1.) I tried concat function of spark sql but if a column value is null then entire value of concat function will be null, which is not something i want.

2.) concat_ws function simply ignore the null values hence can't be used. Hence in my 2nd example the result would be something like 1,2.3

3.) using coalesce require matching data types so if I do something like coalesce(col_name,'') under concat function and col_name is number data type then it throws data types mismatch error.

4.) using case when statement also require same data type in THEN & ELSE conditions. So if I say case when column_1 is null THEN '' ELSE column1 END will throw error if column1 is number because '' is empty string and column_1 is number.

one way it can be achieved is creating map_function. But is there a way to do it in spark sql way?


Solution

  • You can cast all your fields to String so that you can run NVL on them and set them to empty string '' if they're null.

    To generate an expression for all the columns in your dataframe automatically, you can use map function:

    df.show()
    //+----+----+----+
    //|col1|col2|col3|
    //+----+----+----+
    //|   1|John| 2.3|
    //|   2|null| 2.3|
    //+----+----+----+
    
    
    List<String> columns = df.columns() ; 
    // List("col1", "col2", "col3")
    
    String nvlExpr = columns.stream().map(i -> "nvl(cast ("+i+" as string),'')").collect(joining(", ", "concat_ws(','," , ")"));
    
    //concat_ws(',',nvl(cast (col1 as string), ''), nvl(cast (col2 as string), ''),nvl(cast (col3 as string), ''))
    
    df.withColumn("result", expr(nvlExpr)).show()
    
    //+----+----+----+----------+
    //|col1|col2|col3|    result|
    //+----+----+----+----------+
    //|   1|John| 2.3|1,John,2.3|
    //|   2|null| 2.3|    2,,2.3|
    //+----+----+----+----------+