Search code examples
scalaapache-sparkrdd

Create multiple RDDs from single file based on row value ( header record in sample file) using Spark scala


I am trying to create multiple RDDs to process independently from below file based on the similar format of data .

Please find the file with different data formats

custid,starttime,rpdid,catry,auapp,sppp,retatype,status,process,fileavil
4fgdfg,00:56:30.034,BM_-unit1,GEN,TRUE,FALSE,NONE,A,45,TRUE
X95GEK,00:56:32.083,CBM_OMDD_RSVCM0CBM-unit0,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
XWZ08K,00:57:01.947,GWC-0-UNIT-1,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
custid,relstatus
fg3-03,R
dfsdf4-01,V
56fbfg,R
devid,reg,hold,devbrn,lname,lcon
CTUTANCM0CBM,TRUE,FALSE,13:17:36.934,CBM_BMI_25_5_2,13:43:21.370

In the above file, we have three different type of data formats exist and I want to split the file into three different RDDs as per the format.

Could you please suggest how to implement using Spark (Scala)?


Solution

  • Your file looks like it has 3 different csv files in it.

    You can read it as a single file and extract 3 RDDs from it based on the number of fields you have in each row.

    // Caching because you'll be filtering it thrice
    val topRdd = sc.textFile("file").cache
    topRdd.count
    //res0: Long = 10
    
    val rdd1 = topRdd.filter(_.split(",", -1).length == 10 )
    val rdd2 = topRdd.filter(_.split(",", -1).length ==  2 )
    val rdd3 = topRdd.filter(_.split(",", -1).length ==  6 )
    
    rdd1.collect.foreach(println)
    // custid,starttime,rpdid,catry,auapp,sppp,retatype,status,process,fileavil
    // 4fgdfg,00:56:30.034,BM_-unit1,GEN,TRUE,FALSE,NONE,A,45,TRUE
    // X95GEK,00:56:32.083,CBM_OMDD_RSVCM0CBM-unit0,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
    // XWZ08K,00:57:01.947,GWC-0-UNIT-1,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
    
    rdd2.collect.foreach(println)
    // custid,relstatus
    // fg3-03,R
    // dfsdf4-01,V
    // 56fbfg,R
    
    rdd3.collect.foreach(println)
    // devid,reg,hold,devbrn,lname,lcon
    // CTUTANCM0CBM,TRUE,FALSE,13:17:36.934,CBM_BMI_25_5_2,13:43:21.370