Search code examples
pythonpandasnumpycsvgenfromtxt

Using numpy.genfromtxt to read a csv file with strings containing commas


I am trying to read in a csv file with numpy.genfromtxt but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':

2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0

the code

np.genfromtxt('t.csv', delimiter=',')

produces the error:

ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)

The data structure I am looking for is:

array([['2012', 'Louisville KY', '3.5'],
       ['2011', 'Lexington, KY', '4.0']], 
      dtype='|S13')

Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv module and then convert it to a numpy array?


Solution

  • You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:

    quotechar : string

    The character to used to denote the start and end of a quoted item. Quoted items 
    can include the delimiter and it will be ignored.
    

    The default value is ". An example:

    In [1]: import pandas as pd
    
    In [2]: from StringIO import StringIO
    
    In [3]: s="""year, city, value
       ...: 2012, "Louisville KY", 3.5
       ...: 2011, "Lexington, KY", 4.0"""
    
    In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
    Out[4]:
       year           city  value
    0  2012  Louisville KY    3.5
    1  2011  Lexington, KY    4.0
    

    The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.

    Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).