I am reading a comma-separated CSV file where the fields are enclosed in double quotes, and some of them also have commas within their values, like: "abc","def,ghi","jkl"
Is there a way we can read this file into a PCollection using Apache Beam?
Sample csv file having data enclosed in double quotes.
"AAA", "BBB", "Test, Test", "CCC"
"111", "222, 333", "XXX", "YYY, ZZZ"
You can use the csv module from the standard library:
def print_row(element):
print element
def parse_file(element):
for line in csv.reader([element], quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True):
return line
parsed_csv = (
p
| 'Read input file' >> beam.io.ReadFromText(input_filename)
| 'Parse file' >> beam.Map(parse_file)
| 'Print output' >> beam.Map(print_row)
)
This gives the following output
['AAA', 'BBB', 'Test, Test', 'CCC']
['111', '222, 333', 'XXX', 'YYY, ZZZ ']
The one thing to watch out for is that the csv.reader
objects expect an iterator
which will return iterator
of strings. This means that you can't pass a string straight to a reader()
, but you can enclose it in a list
as above. You would then iterate over the output to get final string.