Search code examples
pythonapache-sparkpysparkrdd

How to read a delimited file using Spark RDD, if the actual data is embedded with same delimiter


I am trying to read a text file into an rdd

My sample data is below

"1" "Hai    How are you!"   "56"
"2"                         "0213"

3 columns with Tab delimiter. My data is also getting embedded with same delimiter(How\tHow areyou!). Can some one help me here to parse the data properly in pyspark.

my_Rdd = Spark.SparkContext.textFile("Mytext.txt").map(lambda line:line.split('\t'))

When I do the above code I get below output

ColA,ColB,Colc
"1","Hai,How are you!"
"2","0123"

2nd column splitted to 3rd as it is having the same delimiter in actual data and for 2nd row the 3rd value is getting mapped to 2nd

My expected output is

ColA,ColB,Colc
"1","Hai    How are you!","56"
"2",,"0123"

In Dataframe I can keep quote options, but how can we do the same in RDD?


Solution

  • Use shlex.split() which ignores quoted delimiters:

    import shlex
    
    sc.textFile('Mytext.txt').map(lambda line: shlex.split(line))
    

    Another example with string:

    import shlex
    
    rdd = sc.parallelize(['"1"\t"Hai\tHow are you!"\t"56"']).map(lambda line: shlex.split(line))
    
    >>> rdd.collect()
    [['1', 'Hai\tHow are you!', '56']]