Search code examples
pythonpandascsvlarge-data

Errors reading CSV with Pandas


I have a dataset of 100 million rows that I need to analyze. I use this function to read the file:

csv2020=pd.read_csv('filename.txt',
                    sep="\t",
                    error_bad_lines=False,
                    usecols=['field1', 'field2', 'field3', 'field4'],
                    dtype={'field1': int,'field2': float, 'field3': float, 'field4': float})

But I'm getting an error about one of the lines not possible to convert to a float:

ValueError: could not convert string to float: 'ORCH'

I would like to omit any lines where this error occurs, but I don't know how besides the error-bad-lines argument. Help?

Thanks!


Solution

  • The error_bad_lines option is not for this purpose, it only applies to an incorrect number of fields.

    Read your file without the dtype option and do the conversion afterwards using pandas.to_numeric with the errors='coerce' option:

    df = pd.read_csv(…)
    df['field1'] = pd.to_numeric(df['field1'], errors='coerce')
    df['field2'] = …