I realised that there's a problem with mapping a column when I created a dataframe using the pd.read_csv and StringIO, to convert a string of text into a dataframe then mapping a specific column with a dictionary, as the column will give me all NaN values that are mapped. Why is this so?
In contrast, I did not experience an error when I directly map the column with a dictionary, if the column comes from a pandas dataframe I directly created.
Here is the code where the error comes in:
import pandas as pd
from io import StringIO
df1 = unicode('''
green, M, 10.1, class1
red, L, 13.5, class2
blue, XL, 15.3, class1
''')
df1 = pd.read_csv(StringIO(df1), header = None)
df1.columns = ['color', 'size', 'price', 'classlabel']
size_mapping = {
'XL':3,
'L': 2,
'M': 1,
}
df1['size'] = df1['size'].map(size_mapping)
print df1
Returns this:
color size price classlabel
0 green NaN 10.1 class1
1 red NaN 13.5 class2
2 blue NaN 15.3 class1
Versus this:
import pandas as pd
from io import StringIO
df1 = pd.DataFrame([
['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']
])
df1.columns = ['color', 'size', 'price', 'classlabel']
size_mapping = {
'XL':3,
'L': 2,
'M': 1,
}
df1['size'] = df1['size'].map(size_mapping)
print df1
Where I get this returned instead:
color size price classlabel
0 green 1 10.1 class1
1 red 2 13.5 class2
2 blue 3 15.3 class1
Why is there a difference when I try to convert what I read from a unicode file?
You can see why if you inspect an individual value from your string-derived DataFrame:
>>> df1['size'].iloc[0]
' M'
Note the leading space. Your mapping doesn't work because the values in your DataFrame are not "M"
, "L", and "XL"
; they are " M"
, " L"
and " XL"
.
CSV files should not contain spaces after the commas; such spaces will be considered part of the data. If you reformat your string to remove the spaces after the commas, it will work fine.