Search code examples
pythonpandascsvtokenizeparse-error

Reading bad csv files with garbage values


I wish to read a csv file which has the following format using pandas:

    atrrth
    sfkjbgksjg
    airuqghlerig
    Name         Roll
    airuqgorqowi
    awlrkgjabgwl
    AAA          67
    BBB          55
    CCC          07

As you can see, if I use pd.read_csv, I get the fairly obvious error:

 ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2

But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values

These are the 2 of the possible column names as given below :

Name : [Name , NAME , Name of student] 
Roll : [Rollno , Roll , ROLL]

How to achieve this?


Solution

  • Open the csv file and find a row from where the column name starts:

    with open(r'data.csv') as fp:
        skip = next(filter(
            lambda x: x[1].startswith(('Name','NAME')),
            enumerate(fp)
        ))[0]
    

    The value will be stored in skip parameter

    import pandas as pd
    df = pd.read_csv('data.csv', skiprows=skip)
    

    Works in Python 3.X