This is a continuation of my previous question.
I am trying to find the index of 'e' of the following RDD using pyspark:
['a,b,c,d,e,f']
I am using the method:
rdd.zipWithIndex().lookup('e')
But I get = []
as the Rdd is in the form: [ ['a,b,c,d,e,f']
I tried
rdd.flatMap(lambda x: x)
so that I use lookup to get the index, but I am still getting []
Please help me. How do I get the Rdd as:
['a','b','c','d','e','f']
So that I can do this method
rdd.zipWithIndex().lookup('e')
The issue is that you are using whole string as an array
['a,b,c,d,e,f']
So, here a,b,c,d,e,f
is all treated as one string. You need to separate them into separate rows of the RDD you have. You can simply use flatMap
to separate the string into separate RDD rows and then use zipWithIndex()
and lookUp()
print(rdd.flatMap(lambda x: x.split(",")).zipWithIndex().lookup("e"))