I have a file in .bcp format and try to read it. The rows are seperated by "|;;|". A row may extend over several lines in the file.
rdd = sc.textFile("test.bcp")
splits the file into lines, but I need it serparated by "|;;|". How can I do this without changing the Hadoop configuration?
Example .bcp
:
A1|;|B1|;|C1|;|
D1|;;|A2|;|B2|;|
C2|;|D2|;;|
should be converted to:
[["A1", "B1", "C1", "D1"], ["A2", "B2", "C2", "D2"]]
For a custom delimiter with multiple characters change the hadoop configuration:
sc = SparkContext.getOrCreate()
# let hadoop separate files by our custom delimiter
conf = sc._jsc.hadoopConfiguration()
conf.set("textinputformat.record.delimiter", '|;;|')
# create RDD of .bcp file
rows = sc.textFile('/PATH/TO/FILE/test.bcp') # split file into rows
rows = rows.map(lambda row: row.split('|;|')) # split rows into columns
# reset hadoop delimiter
conf.set("textinputformat.record.delimiter", "\n")