Search code examples
pythonapache-beamapache-beam-io

How can we read CSV Files with enclosure in Apache Beam using python sdk?


I am reading a comma-separated CSV file where the fields are enclosed in double quotes, and some of them also have commas within their values, like: "abc","def,ghi","jkl"

Is there a way we can read this file into a PCollection using Apache Beam?


Solution

  • Sample csv file having data enclosed in double quotes.

    "AAA", "BBB", "Test, Test", "CCC" 
    "111", "222, 333", "XXX", "YYY, ZZZ"
    

    You can use the csv module from the standard library:

    def print_row(element):
      print element
    
    def parse_file(element):
      for line in csv.reader([element], quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True):
        return line
    
    parsed_csv = (
                    p 
                    | 'Read input file' >> beam.io.ReadFromText(input_filename)
                    | 'Parse file' >> beam.Map(parse_file)
                    | 'Print output' >> beam.Map(print_row)
                 )
    

    This gives the following output

    ['AAA', 'BBB', 'Test, Test', 'CCC']
    ['111', '222, 333', 'XXX', 'YYY, ZZZ ']
    

    The one thing to watch out for is that the csv.reader objects expect an iterator which will return iterator of strings. This means that you can't pass a string straight to a reader(), but you can enclose it in a list as above. You would then iterate over the output to get final string.