Search code examples
sqlscalaapache-sparkapache-spark-dataset

create rows from columns in a apache spark dataset


I'm trying from a dataset to create a row from existing columns. Here is my case:

InputDataset

accountid payingaccountid billedaccountid startdate enddate
0011t00000MY1U3AAL 0011t00000MY1U3XXX 0011t00000ZZ1U3AAL 2020-06-10 00:00:00.000000 NULL

And I would like to have sometthing like this

accountid startdate enddate
0011t00000MY1U3AAL 2021-06-10 00:00:00.000000 NULL
0011t00000MY1U3XXX 2021-06-10 00:00:00.000000 NULL
0011t00000ZZ1U3AAL 2021-06-10 00:00:00.000000 NULL

In the input dataset the columns billedaccounid and payingaccounid are now also in accountid column.

Thank you in advance.


Solution

  • You can create an array column with 3 id columns then use explode function to get desired result.

    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    import spark.implicits._
    
    val df = List(("0011t00000MY1U3AAL", "0011t00000MY1U3XXX", "0011t00000ZZ1U3AAL", "2020-06-10 00:00:00.000000", "NULL"))
        .toDF("accountid", "payingaccountid", "billedaccountid", "startdate", "enddate")
    
    df.select(
        explode_outer(array("accountid","payingaccountid","billedaccountid")).as("accountid"),
        'startdate,'enddate).show(false)
    /*
    +------------------+--------------------------+-------+
    |accountid         |startdate                 |enddate|
    +------------------+--------------------------+-------+
    |0011t00000MY1U3AAL|2020-06-10 00:00:00.000000|NULL   |
    |0011t00000MY1U3XXX|2020-06-10 00:00:00.000000|NULL   |
    |0011t00000ZZ1U3AAL|2020-06-10 00:00:00.000000|NULL   |
    +------------------+--------------------------+-------+ */