Search code examples
pythonpandascsvquotes

Pandas: Read csv with quoted values, comma as decimal separator, and period as digit grouping symbol


UPDATE: When initally posting the question, I did not realize that . is used in some entries as the digit grouping symbol. However, this information is crucial to the problem. Here is the original question:

I am currently trying to import a csv file in pandas that has a less than perfect format, as all the values including numbers are quoted. The format looks like this:

Date;Type;Amount;Currency;Category;Person;Account;Counter Account;Group;Note
"19.02.17";"Expenses";"-36,37";"EUR";"Groceries";"";"Bank account";"";"";""

Now, I have tried importing this using the following command:

import pandas

dtypes = {"Type":"string", "Amount": "float"}
table = pandas.read_csv("data.csv", delimiter = ";", decimal = ",", parse_dates = ["Date"], dtype = dtypes, quoting = 3)

So I have basically been trying to tell pandas that the decimal separator is comma, that the field delimiter is semicolon, and that the column "Amount" should be parsed as floats. However, trying to parse the file, I still get the error message:

ValueError: could not convert string to float: '689,15'"

I assume the combination of the quotes and the comma decimal separator somehow is too much for pandas, even though I think I have technically provided it with all the information it needs.

The file is an export from a third-party program, so unfortunately I have no influence on the format. Does anyone know how to get pandas to swallow this?

Bonus question: If I read this file without providing explicit data types, I don't get any columns of type "string" as I would have expected, but instead "object" is used. Why is that?


Solution

  • What about that ?

    import pandas
    
    table = pandas.read_csv("data.csv", sep=";", decimal=",")
    
    print(table["Amount"][0])  # -36.37
    print(type(table["Amount"][0]))   # <class 'numpy.float64'>
    print(table["Amount"][0] + 36.37)  # 0.0
    

    Pandas automatically detects a number and converts it to numpy.float64.


    Edit:

    As @bweber discovered, some values in data.csv ​​contained more than 3 digits, and used a digit grouping symbol '.'. In order to convert the String to Integer, the symbol used must be passed to the read_csv() method:

    table = pandas.read_csv("data.csv", sep=";", decimal=",", thousands='.')