Search code examples
pyarrow

how to handle read errors in pyarrow read_csv


I'm trying out apache arrow but getting a row column count error. How can I skip these rows please? In Pandas it's pretty easy but I can't see how to do the same thing in pyarrow. I just want to skip the problem rows.


from pyarrow import csv

test_arrow = csv.read_csv( test_file)

yields.

ArrowInvalid                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 test_arrow = csv.read_csv( core_test_file)

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1261, in pyarrow._csv.read_csv()

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1270, in pyarrow._csv.read_csv()

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: CSV parse error: Expected 340 columns, got 679: rtee,2024-01-02,0,10.89,10.89,408,46,1409,5,11460,43.73,37.693,83.51,50.68,43.89,58.05,103.217,6 ...

Solution

  • You can skip rows with errors using the parse_options argument in csv.read_csv:

    from pyarrow import csv
    
    def skip_comment(row):
        if row.text.startswith("# "):
            return 'skip'
        else:
            return 'error'
    
    parse_options = csv.ParseOptions(invalid_row_handler=skip_comment)
    test_arrow = csv.read_csv(test_file, parse_options=parse_options)
    

    This example is taken from pyarrow documentation