Search code examples
pythonscikit-learntargetnumericalcategorical-data

Categorical & Numerical Features - Categorical Target - Scikit Learn - Python


I have a data set containing both categorical and numerical columns and my target column is also categorical. I am using Scikit library in Python34. I know that Scikit needs all categorical values to be transformed to numerical values before doing any machine learning approach.

How should I transform my categorical columns to numerical values? I tried a lot of thing but I am getting different errors such as "str" object has no 'numpy.ndarray' object has no attribute 'items'.

Here is an example of my data:
 UserID  LocationID   AmountPaid    ServiceID   Target
 29876      IS345       23.9876      FRDG        JFD
 29877      IS712       135.98       WERS        KOI

My dataset is saved in a CSV file, here is the little code I wrote to give you an idea about what I want to do:

#reading my csv file
data_dir = 'C:/Users/davtalab/Desktop/data/'
train_file = data_dir + 'train.csv'
train = pd.read_csv( train_file )

#numeric columns:
x_numeric_cols = train['AmountPaid']

#Categrical columns:
categorical_cols = ['UserID' + 'LocationID' + 'ServiceID']
x_cat_cols = train[categorical_cols].as_matrix() 


y_target = train['Target'].as_matrix() 

I need x_cat_cols to be converted to numeric values and the add them to x_numeric_cols and so have my complete input (x) values.

Then I need to convert my target function into numeric value as well and make that as my final target (y) column.

Then I want to do a Random Forest using these two complete sets as:

rf = RF(n_estimators=n_trees,max_features=max_features,verbose =verbose, n_jobs =n_jobs)
rf.fit( x_train, y_train )

Thanks for your help!


Solution

  • This was because of the way I enumerate the data. If I print the data (using another sample) you will see:

    >>> import pandas as pd
    >>> train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'],
    ...                       'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']})
    >>> samples = [dict(enumerate(sample)) for sample in train]
    >>> samples
    [{0: 'a'}, {0: 'b'}, {0: 'c'}, {0: 'd'}]
    

    This is a list of dicts. We should do this instead:

        >>> train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()]
        >>> train_as_dicts
        [{'a': 'a', 'c': 'b', 'b': 0, 'd': 'e'},
         {'a': 'b', 'c': 'c', 'b': 1, 'd': 'e'},
         {'a': 'a', 'c': 'b', 'b': 1, 'd': 'f'}]
    Now we need to vectorize the dicts:
    
    >>> from sklearn.feature_extraction import DictVectorizer
    
    >>> vectorizer = DictVectorizer()
    >>> vectorized_sparse = vectorizer.fit_transform(train_as_dicts)
    >>> vectorized_sparse
    <3x7 sparse matrix of type '<type 'numpy.float64'>'
        with 12 stored elements in Compressed Sparse Row format>
    
    >>> vectorized_array = vectorized_sparse.toarray()
    >>> vectorized_array
    array([[ 1.,  0.,  0.,  1.,  0.,  1.,  0.],
           [ 0.,  1.,  1.,  0.,  1.,  1.,  0.],
           [ 1.,  0.,  1.,  1.,  0.,  0.,  1.]])
    To get the meaning of each column, ask the vectorizer:
    
    >>> vectorizer.get_feature_names()
    ['a=a', 'a=b', 'b', 'c=b', 'c=c', 'd=e', 'd=f']