scala mapreduce amazon-emr information-retrieval

How can I save file name in the tuples in scala

I have folder which contains many text files, I have to read this files in one RDD and save the file name with words on it

example :

doc1.txt :
" hello my name sam "

doc2.txt :

"hello world"

I need to pass folder path and the results be :

(hello, doc1), (my,doc1), (world,doc2), ..... etc

I tried this :

 val rddWhole = spark.sparkContext.wholeTextFiles("C:/tmp/files/*")
  rddWhole.foreach(f=>{
    println(f._1+"=>"+f._2)
  })

but it's dealing with whole text in the file as one string, any one have idea how ccan i solve it ?

Solution

Based on my assumptions, you want to extract every word in a file, and couple it with the file name which the word is contained in it. As you mentioned, spark gives you the whole content of a file as a single string. Like if this is the file content:

hello
my name    is
John Doe

The value you get would be:

val fileString = "hello\nmy name    is\nJohn Doe"

Right? So you need to split the string value by any amount of spaces or new line characters, like so:

val wordsSeparated = fileString.split("\\s+|\\n+") // \\s means space, \\n means new line (in regexes, character escaping and stuff)

So at the end, you'll need something like this:

rddWhole.foreach { f => 
  f._2.split("\\s+|\\n+").foreach(word => println(f._1 + " => " + word))
}

This would be the result:

file:/tmp/spark-test/two.txt => and
file:/tmp/spark-test/two.txt => this
file:/tmp/spark-test/two.txt => would
file:/tmp/spark-test/one.txt => so
file:/tmp/spark-test/one.txt => hello
file:/tmp/spark-test/one.txt => my
file:/tmp/spark-test/one.txt => name
file:/tmp/spark-test/one.txt => is
file:/tmp/spark-test/one.txt => John
file:/tmp/spark-test/one.txt => Doe
file:/tmp/spark-test/two.txt => be
file:/tmp/spark-test/two.txt => the
file:/tmp/spark-test/two.txt => second
file:/tmp/spark-test/two.txt => text
file:/tmp/spark-test/two.txt => file