Search code examples
pythoncsvspecial-characterspandas

Problems reading CSV file with commas and characters in pandas


I am trying to read a csv file using pandas and the file has a column called Tags which consist of user provided tags and has tags like - , "", '',1950's, 16th-century. Since these are user provided, there are many special characters which are entered by mistake as well. The issue is that I cannot open the csv file using pandas read_csv. It shows error:Cparser, error tokenizing data. Can someone help me with reading the csv file into pandas?


Solution

  • Okay. Starting from a badly formatted CSV we can't read:

    >>> !cat unquoted.csv
    1950's,xyz.nl/user_003,bad, 123
    17th,red,flower,xyz.nl/user_001,good,203
    "",xyz.nl/user_239,not very,345
    >>> pd.read_csv("unquoted.csv", header=None)
    Traceback (most recent call last):
      File "<ipython-input-40-7d9aadb2fad5>", line 1, in <module>
        pd.read_csv("unquoted.csv", header=None)
    [...]
      File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17041)
    CParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6
    

    We can make a nicer version, taking advantage of the fact the last three columns are well-behaved:

    import csv
    
    with open("unquoted.csv", "rb") as infile, open("quoted.csv", "wb") as outfile:
        reader = csv.reader(infile)
        writer = csv.writer(outfile)
        for line in reader:
            newline = [','.join(line[:-3])] + line[-3:]
            writer.writerow(newline)
    

    which produces

    >>> !cat quoted.csv
    1950's,xyz.nl/user_003,bad, 123
    "17th,red,flower",xyz.nl/user_001,good,203
    ,xyz.nl/user_239,not very,345
    

    and then we can read it:

    >>> pd.read_csv("quoted.csv", header=None)
                     0                1         2    3
    0           1950's  xyz.nl/user_003       bad  123
    1  17th,red,flower  xyz.nl/user_001      good  203
    2              NaN  xyz.nl/user_239  not very  345
    

    I'd look into fixing this problem at source and getting data in a tolerable format, though. Tricks like this shouldn't be necessary, and it would have been very easy for it to be impossible to repair.