Search code examples
pythonapache-sparkpysparkrowrdd

Splitting and RDD row to different column in Pyspark


This is a continuation of my previous question.

I am trying to find the index of 'e' of the following RDD using pyspark:

['a,b,c,d,e,f']

I am using the method:

rdd.zipWithIndex().lookup('e')

But I get = []

as the Rdd is in the form: [ ['a,b,c,d,e,f']

I tried

rdd.flatMap(lambda x: x)

so that I use lookup to get the index, but I am still getting []

Please help me. How do I get the Rdd as:

['a','b','c','d','e','f']

So that I can do this method

    rdd.zipWithIndex().lookup('e')

Solution

  • The issue is that you are using whole string as an array

    ['a,b,c,d,e,f']
    

    So, here a,b,c,d,e,f is all treated as one string. You need to separate them into separate rows of the RDD you have. You can simply use flatMap to separate the string into separate RDD rows and then use zipWithIndex() and lookUp()

    print(rdd.flatMap(lambda x: x.split(",")).zipWithIndex().lookup("e"))