Suppose I have an RDD of string as:
inputRDD=sc.parallelize('2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5')
I want to convert this RDD as:
inputRDD= [2596, 51, 3,.....]
I implemented the following code:
inputRDD.flatMap(lambda line: line.split(',')).collect()
But getting output as:
['2',
'5',
'9',
'6',
'',
'',
'5',
'1',
'',
'',
'3',
'',
'',
'2',
'5',
'8',
'',
'',
'0',
'',
'',
'5',
'1',
'0',
'',....]
May I know where am I going wrong in my code?
The problem actually lies in the RDD creation. All you need to do is to wrap the input data in a list for the parallelize
method, as is here:
inputRDD=sc.parallelize(['2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5'])
The rest of the code will work fine and as expected afterwards.
What was happening before, is that Spark considered each character as a separate list element, i.e. each character a new row.