I have a dataset of 100 million rows that I need to analyze. I use this function to read the file:
csv2020=pd.read_csv('filename.txt',
sep="\t",
error_bad_lines=False,
usecols=['field1', 'field2', 'field3', 'field4'],
dtype={'field1': int,'field2': float, 'field3': float, 'field4': float})
But I'm getting an error about one of the lines not possible to convert to a float:
ValueError: could not convert string to float: 'ORCH'
I would like to omit any lines where this error occurs, but I don't know how besides the error-bad-lines argument. Help?
Thanks!
The error_bad_lines
option is not for this purpose, it only applies to an incorrect number of fields.
Read your file without the dtype
option and do the conversion afterwards using pandas.to_numeric
with the errors='coerce'
option:
df = pd.read_csv(…)
df['field1'] = pd.to_numeric(df['field1'], errors='coerce')
df['field2'] = …