Search code examples
pythonscikit-learnlabel-encoding

sklearn LabelEncoder to combine multiple values into a single label


I am looking to run classification on a column that has few possible values, but i want to consolidate them into fewer labels.

for example, a job may have multiple end states: success, fail, error, killed. but i am looking to classify the jobs into either a group of end states (which would include error and killed) and another group (which will only include success and fail).

I cannot find a way to do that in sklearn's LabelEncoder, and other than manually changing the target column myself (by assigning 1 to success or fail and 0 to everything else) i cannot find a way.

EDIT example. this is what i need to happen:

>>> label_binarize(['success','fail','error','killed', 'success'], classes=(['success', 'fail']))
array([[1],
       [1],
       [0],
       [0],
       [1]])

unfortunately, label_binarize (or LabelBinarizer, for that matter) does it for each column separately. THIS IS NOT WHAT I WANT:

>>> label_binarize(['success','fail','error','killed', 'success'], classes=['success', 'fail'])
array([[1, 0],
       [0, 1],
       [0, 0],
       [0, 0],
       [1, 0]])

any ideas on how to do that?


Solution

  • Maybe you should check out label_binarize. You could set the success as the only class, thereby defaulting the rest to 0. Same result as changing the data prior to encoding, but might fit better into your pipeline.

    from sklearn.preprocessing import label_binarize
    label_binarize(['success','fail','error','killed', 'success'], classes=['success'])
    

    Output

    array([[1],
           [0],
           [0],
           [0],
           [1]])