I'm trying out apache arrow but getting a row column count error. How can I skip these rows please? In Pandas it's pretty easy but I can't see how to do the same thing in pyarrow. I just want to skip the problem rows.
from pyarrow import csv
test_arrow = csv.read_csv( test_file)
yields.
ArrowInvalid Traceback (most recent call last)
Cell In[6], line 1
----> 1 test_arrow = csv.read_csv( core_test_file)
File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1261, in pyarrow._csv.read_csv()
File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/_csv.pyx:1270, in pyarrow._csv.read_csv()
File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()
File ~/Desktop/tont_24/data/spasm_1/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowInvalid: CSV parse error: Expected 340 columns, got 679: rtee,2024-01-02,0,10.89,10.89,408,46,1409,5,11460,43.73,37.693,83.51,50.68,43.89,58.05,103.217,6 ...
You can skip rows with errors using the parse_options
argument in csv.read_csv
:
from pyarrow import csv
def skip_comment(row):
if row.text.startswith("# "):
return 'skip'
else:
return 'error'
parse_options = csv.ParseOptions(invalid_row_handler=skip_comment)
test_arrow = csv.read_csv(test_file, parse_options=parse_options)
This example is taken from pyarrow documentation