Search code examples
pythoncsvpandasstringio

Python: Mapping error for a StringIO converted csv dataframe versus pandas data frame?


I realised that there's a problem with mapping a column when I created a dataframe using the pd.read_csv and StringIO, to convert a string of text into a dataframe then mapping a specific column with a dictionary, as the column will give me all NaN values that are mapped. Why is this so?

In contrast, I did not experience an error when I directly map the column with a dictionary, if the column comes from a pandas dataframe I directly created.

Here is the code where the error comes in:

import pandas as pd
from io import StringIO 

df1 = unicode('''
green, M, 10.1, class1
red, L, 13.5, class2
blue, XL, 15.3, class1
''')

df1 = pd.read_csv(StringIO(df1), header = None)

df1.columns = ['color', 'size', 'price', 'classlabel']

size_mapping = {
    'XL':3,
    'L': 2,
    'M': 1,
}
df1['size'] = df1['size'].map(size_mapping)
print df1

Returns this:

   color  size  price classlabel
0  green   NaN   10.1     class1
1    red   NaN   13.5     class2
2   blue   NaN   15.3     class1

Versus this:

import pandas as pd
from io import StringIO

df1 = pd.DataFrame([
    ['green', 'M', 10.1, 'class1'],
    ['red', 'L', 13.5, 'class2'],
    ['blue', 'XL', 15.3, 'class1']
    ])

df1.columns = ['color', 'size', 'price', 'classlabel']

size_mapping = {
    'XL':3,
    'L': 2,
    'M': 1,
}
df1['size'] = df1['size'].map(size_mapping)
print df1

Where I get this returned instead:

   color  size  price classlabel
0  green     1   10.1     class1
1    red     2   13.5     class2
2   blue     3   15.3     class1

Why is there a difference when I try to convert what I read from a unicode file?


Solution

  • You can see why if you inspect an individual value from your string-derived DataFrame:

    >>> df1['size'].iloc[0]
    ' M'
    

    Note the leading space. Your mapping doesn't work because the values in your DataFrame are not "M", "L", and "XL"; they are " M", " L" and " XL".

    CSV files should not contain spaces after the commas; such spaces will be considered part of the data. If you reformat your string to remove the spaces after the commas, it will work fine.