Create multiple RDDs from single file based on row value ( header record in sample file) using Spark scala

I am trying to create multiple RDDs to process independently from below file based on the similar format of data .

Please find the file with different data formats

custid,starttime,rpdid,catry,auapp,sppp,retatype,status,process,fileavil
4fgdfg,00:56:30.034,BM_-unit1,GEN,TRUE,FALSE,NONE,A,45,TRUE
X95GEK,00:56:32.083,CBM_OMDD_RSVCM0CBM-unit0,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
XWZ08K,00:57:01.947,GWC-0-UNIT-1,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
custid,relstatus
fg3-03,R
dfsdf4-01,V
56fbfg,R
devid,reg,hold,devbrn,lname,lcon
CTUTANCM0CBM,TRUE,FALSE,13:17:36.934,CBM_BMI_25_5_2,13:43:21.370

In the above file, we have three different type of data formats exist and I want to split the file into three different RDDs as per the format.

Could you please suggest how to implement using Spark (Scala)?

Solution

Your file looks like it has 3 different csv files in it.

You can read it as a single file and extract 3 RDDs from it based on the number of fields you have in each row.

// Caching because you'll be filtering it thrice
val topRdd = sc.textFile("file").cache
topRdd.count
//res0: Long = 10

val rdd1 = topRdd.filter(_.split(",", -1).length == 10 )
val rdd2 = topRdd.filter(_.split(",", -1).length ==  2 )
val rdd3 = topRdd.filter(_.split(",", -1).length ==  6 )

rdd1.collect.foreach(println)
// custid,starttime,rpdid,catry,auapp,sppp,retatype,status,process,fileavil
// 4fgdfg,00:56:30.034,BM_-unit1,GEN,TRUE,FALSE,NONE,A,45,TRUE
// X95GEK,00:56:32.083,CBM_OMDD_RSVCM0CBM-unit0,GEN,TRUE,FALSE,NONE,A,GWC,TRUE
// XWZ08K,00:57:01.947,GWC-0-UNIT-1,GEN,TRUE,FALSE,NONE,A,GWC,TRUE

rdd2.collect.foreach(println)
// custid,relstatus
// fg3-03,R
// dfsdf4-01,V
// 56fbfg,R

rdd3.collect.foreach(println)
// devid,reg,hold,devbrn,lname,lcon
// CTUTANCM0CBM,TRUE,FALSE,13:17:36.934,CBM_BMI_25_5_2,13:43:21.370