how to handle read errors in pyarrow read_csv

I'm trying out apache arrow but getting a row column count error. How can I skip these rows please? In Pandas it's pretty easy but I can't see how to do the same thing in pyarrow. I just want to skip the problem rows.


from pyarrow import csv

test_arrow = csv.read_csv( test_file)

yields.

ArrowInvalid                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 test_arrow = csv.read_csv( core_test_file)

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1261, in pyarrow._csv.read_csv()

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1270, in pyarrow._csv.read_csv()

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: CSV parse error: Expected 340 columns, got 679: rtee,2024-01-02,0,10.89,10.89,408,46,1409,5,11460,43.73,37.693,83.51,50.68,43.89,58.05,103.217,6 ...

Solution

You can skip rows with errors using the parse_options argument in csv.read_csv:

from pyarrow import csv

def skip_comment(row):
    if row.text.startswith("# "):
        return 'skip'
    else:
        return 'error'

parse_options = csv.ParseOptions(invalid_row_handler=skip_comment)
test_arrow = csv.read_csv(test_file, parse_options=parse_options)

This example is taken from pyarrow documentation