Search code examples
apache-sparkrdd

convert an RDD of string into elements of characters using split function


Suppose I have an RDD of string as:

inputRDD=sc.parallelize('2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5')

I want to convert this RDD as:

inputRDD= [2596, 51, 3,.....]  

I implemented the following code:

inputRDD.flatMap(lambda line: line.split(',')).collect()

But getting output as:

['2',
 '5',
 '9',
 '6',
 '',
 '',
 '5',
 '1',
 '',
 '',
 '3',
 '',
 '',
 '2',
 '5',
 '8',
 '',
 '',
 '0',
 '',
 '',
 '5',
 '1',
 '0',
 '',....]  

May I know where am I going wrong in my code?


Solution

  • The problem actually lies in the RDD creation. All you need to do is to wrap the input data in a list for the parallelize method, as is here:

    inputRDD=sc.parallelize(['2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5'])
    

    The rest of the code will work fine and as expected afterwards.

    What was happening before, is that Spark considered each character as a separate list element, i.e. each character a new row.