Search code examples
pandasmachine-learningscikit-learnfeature-extractiondictvectorizer

Convert string features to numeric features in sklearn and pandas


I'm currently working with sklearn (I'm beginner) and I want to train and test a very naif classifier.

The structure of my training and testing data is the following:

 ----|----|----|----|----|----|------|----|----|----|-------  
  f1 | f2 | f3 | c1 | c2 | c3 | word | c4 | c5 | c6 | label   
 ----|----|----|----|----|----|------|----|----|----|------- 

Where:

f1: feature 1, binary numerical type like 0
f2: feature 2, binary numerical type like 1
f3: feature 3, binary numerical type like 0
c1: context 1, string type like "from"
c2: context 2, string type like "this"
c3: context 3, string type like "website"
word: central word (string) of the context like "http://.."
c4: context 4, string type
c5: context 5, string type
c6: context 6, string type
label: this is the label (string) that the classifier has to train and predict like: "URL" (I have only three types of label: REF,IRR,DATA)

What I want to do is to convert my context string features in numerical features. Every string field is composed of a maximum of one word.

The main goal is to assign a numeric value for every context and word string in such a way to make the system works. What I thought is that it's possible to define a vocabulary like:

{ from, website, to, ... }

and provide this vocabulary to the DictVectorizer, but I don't know how to do this now.

What I really want to do is to generate a huge number of binary features: the word “from” immediately preceding the word in question is one feature; the word “available” two positions after the word is another one. But I really don't know how.

This is what I tried to do:

#I tried to read the train csv:
train = pd.read_csv('train.csv')

#Drop the label field:
train_X = train.drop(['label'],axis=1)

#Take the other parameters:
train_y = train.label.values

#Then I convert the panda's data type into a dictionary: 
train_X = train_X.to_dict('r')

#And I tried to vectorize everything:
vec = DictVectorizer()
train_X = vec.fit_transform(train_X).toarray()

Obviously did't work. This because the context and word fields can be a very big word like an url.

Any suggestions? I accept all kinds of solutions.

Thank you very much.


Solution

  • If unique words are finite, you can do something like this using pandas.

    mapping_dict = {'word1':0,
                    'word2':1,
                    'word3':3  }
    
    df[col] = df[col].str.map(mapping_dict)