Search code examples
scalaloopsapache-sparkkey-value

Spark Scala - Split Map, Getkey and so on


I have a text file which contains the following:

A>B,C,D
B>A,C,D,E
C>A,B,D,E
D>A,B,C,E
E>B,C,D

I would like to write a Spark-Scala script to obtain the following: (For each left member, we give all right members.)

(A,B)
(A,C)
(A,D)
(B,A)
(B,C)
(B,D)
(B,E)
...

I tried to go through the map and get the keys to feed a new map with my results but it did not work.

Here is my code (more like pseudo code):

import scala.io.Source

// Loading file
val file = sc.textFile("friends.txt")

// MAP
// A;B
// A;C
// ...

var associations_persons_friends:Map[Char,Char] = Map()

var lines = file.map(line=>line.split(">"))

for (line <- lines)
{
    val person = line.key
    
    for (friend <- line.value.split(","))
    {
        associations_persons_friends += (person -> friend)
    }
}

associations_persons_friends.collect()

val rdd = sc.parallelize(associations_persons_friends)
rdd.foreach(println)


// GROUP
// For each possible pair, all associated values
// AB;B-C-D-A-C-D-E


// REDUCE
// For each pair we keep occurrences >= 2
// AB;C-D

I wonder if it is possible to write basic code like this in Spark-Scala because I can't find any answer to my needs on the web. Thanks for help.


Solution

  • you can achieve your requirement with the combination of map and flatMap as

    val rdd = sc.textFile("path to the text file")
    
    rdd.map(line => line.split(">")).flatMap(array => array(1).split(",").map(arr => (array(0), arr))).foreach(println)
    

    You should have output as

    (A,B)
    (A,C)
    (A,D)
    (B,A)
    (B,C)
    (B,D)
    (B,E)
    (C,A)
    (C,B)
    (C,D)
    (C,E)
    (D,A)
    (D,B)
    (D,C)
    (D,E)
    (E,B)
    (E,C)
    (E,D)
    

    I hope the answer is helpful