How to read text file seperated by multiple characters in PySpark?

I have a file in .bcp format and try to read it. The rows are seperated by "|;;|". A row may extend over several lines in the file.

rdd = sc.textFile("test.bcp") splits the file into lines, but I need it serparated by "|;;|". How can I do this without changing the Hadoop configuration?

Example .bcp:

A1|;|B1|;|C1|;|
D1|;;|A2|;|B2|;|
C2|;|D2|;;|

should be converted to: [["A1", "B1", "C1", "D1"], ["A2", "B2", "C2", "D2"]]

Solution

For a custom delimiter with multiple characters change the hadoop configuration:

sc = SparkContext.getOrCreate()

# let hadoop separate files by our custom delimiter
conf = sc._jsc.hadoopConfiguration()
conf.set("textinputformat.record.delimiter", '|;;|')

# create RDD of .bcp file
rows = sc.textFile('/PATH/TO/FILE/test.bcp')  # split file into rows
rows = rows.map(lambda row: row.split('|;|'))  # split rows into columns

# reset hadoop delimiter
conf.set("textinputformat.record.delimiter", "\n")