Search code examples
scalaapache-sparkapache-spark-sqlrddcase-when

How to make two columns from 1 column while dividing data between them in spark?


val rdd= sc.parallelize(List(41,42,43,44,45,46,47,48,49,50))

val df = rdd.toDF("numbers")

val now = what.select(when($"numbers" % 2===0,$"numbers").otherwise("").as("Even"),
                      when($"numbers"%2===1,$"numbers").otherwise("").as("Odd"))
                      .orderBy("Even","Odd").show
+----+---+
|Even|Odd|
+----+---+
|    | 41|
|    | 43|
|    | 45|
|    | 47|
|    | 49|
|  42|   |
|  44|   |
|  46|   |
|  48|   |
|  50|   |
+----+---+

I want to remove the empty value in both even and odd column, How can I do that?
Expected Output:

+----+---+
|Even|Odd|
+----+---+
|  42| 41|
|  44| 43|
|  46| 45|
|  48| 47|
|  50| 49|
+----+---+

Solution

  • Not sure what your use case is here, but you can create separate dataframes of the even and odd values, zip them together using the RDD API, and then convert the result back to a dataframe. It's clunky, but it's not a problem that's really in Spark's wheelhouse.

    import org.apache.spark.sql.Row
    
    val df = List(41,42,43,44,45,46,47,48,49,50).toDF("numbers")
    
    val evenRDD = df.where('numbers % 2 === 0).rdd
    val oddRDD = df.where('numbers % 2 === 1).rdd
    
    val df2 = evenRDD.zip(oddRDD).map{
        case (x : Row, y : Row) => (x.getInt(0), y.getInt(0))
        }.toDF("even", "odd")
    
    df2.show
    +----+---+
    |even|odd|
    +----+---+
    |  42| 41|
    |  44| 43|
    |  46| 45|
    |  48| 47|
    |  50| 49|
    +----+---+
    

    zip will only work if you have equal numbers of odd and even values in your initial dataframe. If not, you'll have to make them equal by either trimming off the excess in the larger or padding the smaller with zeroes or some other indicator of nullity.