Search code examples
pythonapache-sparkpysparkapache-spark-sqlrdd

How to enable multiline reading of a csv file in pyspark


I am reading a CSV file through PySpark. It is a caret delimited file. It has 5 columns. I need only 3 columns of it.

rdd = sc.textFile("test.csv").map(lambda x: x.split("^")).filter(lambda x: len(x)>1).map(lambda x: (x[0], x[2], x[3]))

print rdd.take(5)

As shown below the data in the csv file has a multiline data at the 4th record, last but one column. Due to which though the file is having only 5 records spark is treating it as 6 records. So I am facing the index out of range error.

Data in file.csv:

a1^b1^c1^d1^e1
a2^b2^c2^d2^e2
a3^b3^c3^d3^e3
a4^b4^c4^d4 is 
multiline^e4
a5^b5^c5^d5^e5

How to enable the multiline while creating the rdd through sc.textFile()?


Solution

  • In my analysis I came to know that, It cannot be done through sc.textFile(), the reason for this is as soon as we load the s3 file to rdd, then rdd will have list of elements as each record of a s3 file. At this level itself each line with in the multiline is split into different records. So it cannot be achieved through sc.textFile().