Search code examples
pythonlabel-encoding

Is it possible to apply sklearn.preprocessing.LabelEncoder() on a 2D list?


Say I have a list as given below:

l = [
       ['PER', 'O', 'O', 'GEO'],
       ['ORG', 'O', 'O', 'O'],
       ['O', 'O', 'O', 'GEO'],
       ['O', 'O', 'PER', 'O']
    ]

I want to encode the 2D list with LabelEncoder().

It should look something like:

l = [
       [1, 0, 0, 2],
       [3, 0, 0, 0],
       [0, 0, 0, 2],
       [0, 0, 1, 0]
    ]

Is it possible? If not, is there any workaround?

Thanks in advance!


Solution

  • You can flatten the list, fit the encoder with all the potential values and then use the encoder to transform each sublist, as shown below:

    from sklearn.preprocessing import LabelEncoder
    
    l = [
           ['PER', 'O', 'O', 'GEO'],
           ['ORG', 'O', 'O', 'O'],
           ['O', 'O', 'O', 'GEO'],
           ['O', 'O', 'PER', 'O']
        ]
    
    flattened_l = [e for sublist in l for e in sublist]
    
    # flattened_l is ['PER', 'O', 'O', 'GEO', 'ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'GEO', 'O', 'O', 'PER', 'O']
    
    le = LabelEncoder().fit(flattened_l)
    
    # See the mapping generated by the encoder:
    list(enumerate(le.classes_))
    # [(0, 'GEO'), (1, 'O'), (2, 'ORG'), (3, 'PER')]
    
    # And, finally, transform each sublist:
    res = [list(le.transform(sublist)) for sublist in l]
    res
    
    # Getting the result you want:
    # [[3, 1, 1, 0], [2, 1, 1, 1], [1, 1, 1, 0], [1, 1, 3, 1]]