Search code examples
scalaapache-sparkstring-formattingrdd

How to create an RDD by selecting specific data from an existing RDD where output should of RDD[String]?


I have scenario to capture some data (not all) from an existing RDD and then pass it to other Scala class for actual operations. Lets see with example data(empnum, empname, emplocation, empsal) in a text file.

11,John,Paris,1000
12,Daniel,UK,3000 

first step, I create an RDD with RDD[String] by below code,

val empRDD = spark
  .sparkContext
  .textFile("empInfo.txt")

So, my requirement is to create another RDD with empnum, empname, emplocation (again with RDD[String]). For that I have tried below code hence I am getting RDD[String, String, String].

val empReqRDD = empRDD
  .map(a=> a.split(","))
  .map(x=> (x(0), x(1), x(2)))

I have tried with Slice also, it gives me RDD[Array(String)]. My required RDD should be of RDD[String] to pass to required Scala class to do some operations.

The expected output should be,

11,John,Paris
12,Daniel,UK

Can anyone help me how to achieve?


Solution

  • I would try this

    val empReqRDD = empRDD
      .map(a=> a.split(","))
      .map(x=> (x(0), x(1), x(2)))
    
    val rddString = empReqRDD.map({case(id,name,city) => "%s,%s,%s".format(id,name,city)})