Say I have a list as given below:
l = [
['PER', 'O', 'O', 'GEO'],
['ORG', 'O', 'O', 'O'],
['O', 'O', 'O', 'GEO'],
['O', 'O', 'PER', 'O']
]
I want to encode the 2D list with LabelEncoder().
It should look something like:
l = [
[1, 0, 0, 2],
[3, 0, 0, 0],
[0, 0, 0, 2],
[0, 0, 1, 0]
]
Is it possible? If not, is there any workaround?
Thanks in advance!
You can flatten the list, fit the encoder with all the potential values and then use the encoder to transform each sublist, as shown below:
from sklearn.preprocessing import LabelEncoder
l = [
['PER', 'O', 'O', 'GEO'],
['ORG', 'O', 'O', 'O'],
['O', 'O', 'O', 'GEO'],
['O', 'O', 'PER', 'O']
]
flattened_l = [e for sublist in l for e in sublist]
# flattened_l is ['PER', 'O', 'O', 'GEO', 'ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'GEO', 'O', 'O', 'PER', 'O']
le = LabelEncoder().fit(flattened_l)
# See the mapping generated by the encoder:
list(enumerate(le.classes_))
# [(0, 'GEO'), (1, 'O'), (2, 'ORG'), (3, 'PER')]
# And, finally, transform each sublist:
res = [list(le.transform(sublist)) for sublist in l]
res
# Getting the result you want:
# [[3, 1, 1, 0], [2, 1, 1, 1], [1, 1, 1, 0], [1, 1, 3, 1]]