I am trying to read a text file into an rdd
My sample data is below
"1" "Hai How are you!" "56"
"2" "0213"
3 columns with Tab delimiter. My data is also getting embedded with same delimiter(How\tHow areyou!). Can some one help me here to parse the data properly in pyspark.
my_Rdd = Spark.SparkContext.textFile("Mytext.txt").map(lambda line:line.split('\t'))
When I do the above code I get below output
ColA,ColB,Colc
"1","Hai,How are you!"
"2","0123"
2nd column splitted to 3rd as it is having the same delimiter in actual data and for 2nd row the 3rd value is getting mapped to 2nd
My expected output is
ColA,ColB,Colc
"1","Hai How are you!","56"
"2",,"0123"
In Dataframe I can keep quote options, but how can we do the same in RDD?
Use shlex.split()
which ignores quoted delimiters:
import shlex
sc.textFile('Mytext.txt').map(lambda line: shlex.split(line))
Another example with string:
import shlex
rdd = sc.parallelize(['"1"\t"Hai\tHow are you!"\t"56"']).map(lambda line: shlex.split(line))
>>> rdd.collect()
[['1', 'Hai\tHow are you!', '56']]