Search code examples
pandascsvdirty-data

import dirty csv file with unwanted characters, strings


I would like to import csv files with pandas. Normally my data is given in the form:

a,b,c,d
a1,b1,c1,d1
a2,b2,c2,d2

where a,b,c,d is the header. I can easily use the pandas.read_csv here. However, now I have data stored like this:

"a;b;c;d"
"a1;\"b1\";\"c1\";\"d1\""
"a2;\"b2\";\"c2\";\"d2\""

How can I clean this up in the most efficient way? How can I remove the string around the entire row so that it can detect the columns? And then how to remove all the "?

Thanks a lot for any help!!

I am not sure what to do. enter image description here


Solution

  • Here is one option with read_csv (and I'm sure we can make it better) :

    df = (
            pd.read_csv("input.csv", sep=r";|;\\?", engine="python")
                .pipe(lambda df_: df_.set_axis(df_.columns.str.strip('"'), axis=1))
                .replace(r'[\\"]', "", regex=True)
    
         )
    

    Output :

    ​
    print(df)
    ​
        a   b   c   d
    0  a1  b1  c1  d1
    1  a2  b2  c2  d2