Search code examples
pythonpandasimportquote

How to deal with Unmatched Quotes in .csv file while importing with read_csv(..) in pandas python


I am having a .csv file like this-

A|B|C|D 
1|"|1|"1
2|2|2|2 
3|"|3|3 
4|4|4|4
5|"five"|5|5
6|6|6|6

I import this .csv file as -

data=pd.read_csv('C:\Python Work\inverted_commas.csv',sep='|',dtype={'B':object,'C':object,'D':object},names=['A','B','C','D'],skiprows=1)

The result looks like this:

A                           B    C    D
1                        |1|1  NaN  NaN
2                           2    2    2
3  |3|3\r\n4|4|4|4\r\n5|five"    5    5
6                           6    6    6 

| is a separator for this .csv file. As we had unmatched double quotes " in line 1 and 3 (top line is header), the entire import went wrong. I wanted the result like -

A         B    C    D
1         "    1   "1
2         2    2    2
3         "    3    3
4         4    4    4
5    "five"    5    5
6         6    6    6

SAS interestingly imports this file correctly in this way. Is there a way or an option in .read_csv where we could specify that the moment one encounters a delimiter, irrespective of whether the beginning quote is matched with the closing quote or not, the column should be filled in with that value, as shown above?


Solution

  • You can use parameter quoting=3:

    import pandas as pd
    from pandas.compat import StringIO
    
    temp=u"""A|B|C|D 
    1|"|1|"1
    2|2|2|2 
    3|"|3|3 
    4|4|4|4
    5|"five"|5|5
    6|6|6|6"""
    #after testing replace 'StringIO(temp)' to 'filename.csv'
    df = pd.read_csv(StringIO(temp), sep="|", quoting=3)
    
    print (df)
       A       B  C  D 
    0  1       "  1  "1
    1  2       2  2  2 
    2  3       "  3  3 
    3  4       4  4   4
    4  5  "five"  5   5
    5  6       6  6   6