Search code examples
scalaapache-sparkapache-spark-sqlrddapache-spark-dataset

Process each row to get date


I have a file having year and mon01,mon02

enter image description here

extract month using last two characters from columname(ie - 01 from MON01) length of text value in the respective months(MON01,MON02..) is same as number of days in the month. where retrive the date for every 1 occur.

like - 2018-01-02 (02 day because of 1 occur at 2nd day)
2018-01-03
2018-01-07

How can we do in spark-scala to get the result?


Solution

  •     UDF:
        def myudf =(month:String,year:String ,value:String ) => {
            val month1 = month.replaceAll("[A-Za-z]+","")
            var date=1
            val dateList = ListBuffer[String]()
            for(char<-value){ 
              if(char=='1'){
               dateList += year+"-"+month1+"-"+date
              } else {
                dateList += ""
              }
              date += 1
            }
            dateList.filter(_.nonEmpty)
          }
        //Main Method
        val data = spark.read.option("header", "true").csv("data.csv")
        data.show()
        +----+-----+-----+
        |Year|Mon01|Mon02|
        +----+-----+-----+
        |2018|01110|00111|
        |2019|01100|00001|
        +----+-----+-----+
        val myCostumeudf = udf(myudf)
        val monthCols = data.columns.drop(1)
        val requiredDF = monthCols.foldLeft(data){
            case (df, month) =>
              df.withColumn("Date_"+month, myCostumeudf(lit(month),data("Year"),data(month)))
          }
        requiredDF.show(false)
        +----+-----+-----+---------------------------------+---------------------------------+
        |Year|Mon01|Mon02|Date_Mon01                       |Date_Mon02                       |
        +----+-----+-----+---------------------------------+---------------------------------+
        |2018|01110|00111|[2018-01-2, 2018-01-3, 2018-01-4]|[2018-02-3, 2018-02-4, 2018-02-5]|
        |2019|01100|00001|[2019-01-2, 2019-01-3]           |[2019-02-5]                      |
        +----+-----+-----+---------------------------------+---------------------------------+
    

    I hope this will help you..