Search code examples
apache-sparkpysparkrddbcp

How to read text file seperated by multiple characters in PySpark?


I have a file in .bcp format and try to read it. The rows are seperated by "|;;|". A row may extend over several lines in the file.

rdd = sc.textFile("test.bcp") splits the file into lines, but I need it serparated by "|;;|". How can I do this without changing the Hadoop configuration?

Example .bcp:

A1|;|B1|;|C1|;|
D1|;;|A2|;|B2|;|
C2|;|D2|;;|

should be converted to: [["A1", "B1", "C1", "D1"], ["A2", "B2", "C2", "D2"]]


Solution

  • For a custom delimiter with multiple characters change the hadoop configuration:

    sc = SparkContext.getOrCreate()
    
    # let hadoop separate files by our custom delimiter
    conf = sc._jsc.hadoopConfiguration()
    conf.set("textinputformat.record.delimiter", '|;;|')
    
    # create RDD of .bcp file
    rows = sc.textFile('/PATH/TO/FILE/test.bcp')  # split file into rows
    rows = rows.map(lambda row: row.split('|;|'))  # split rows into columns
    
    # reset hadoop delimiter
    conf.set("textinputformat.record.delimiter", "\n")