Search code examples
pythonpandas

converting non-numeric to numeric value using Panda libraries


I am learning Pandas and I came to an interesting question. So I have a Dataframe like this:

COL1    COL2      COL3
a     9/8/2016     2
b     12/4/2016    23
         ...
n     1/1/2015     21

COL1 is a String, Col2 is a timestamp and Col3 is a number. Now I need to do some analysis on this Dataframe and I want to convert all the non-numeric data to numeric. I tried using DictVectorizer() to convert COL1 and 2 to numeric but first of all I am not sure if this is the best way doing such a thing and second I don't know what to do with the timestamp. When I use DictVectorizer the output would be like:

{u'COL3: {0:2, 1:23 , ...,n:21}, 'COL1': {0: u'a', 1:'b', ... , n:'n'}, 'COL2': {0: u'9/8/2016' , 1: u'12/4/2016' , ... , n:u'1/1/2016'}}

but from what I learned it should be like this or at least I know I need something like this:

 {COL1:'a', COL2: '9/8/2016' , COL3: 2  and so on}   

so, questions: 1-what is the best way of converting non- numeric (including date) to numeric values to use in sklearn libraries 2- what is the right way of using DictVectorize()

Any help would be appreciated.


Solution

  • To encode non-numeric data to numeric you can use scikit-learn's LabelEncoder. It will encode each category such as COL1's a, b, c to integers.

    Assuming df is your dataframe, try:

    from sklearn.preprocessing import LabelEncoder
    enc = LabelEncoder()
    enc.fit(df['COL1'])
    df['COL1'] = enc.transform(df['col1'])
    
    • enc.fit() creates the corresponding integer values.
    • enc.transform() applies the encoding to the df values.

    For the second column, using Pandas to_datetime() function should do the trick, like @quinn-weber mentioned, try:

    df['COL2'] = pd.to_datetime(df['COL2'])