I am using CTU-13 datasets consisting of 13 scenarios for the Detection of Botnet.
Here target variable Label is a string type variable. Label-encoding upon this variable, simply creates around 52-60 unique numeric values, numbers vary to different scenarios. But I obserbed if label-encoding using Substring matching or Regular expression can be done, then we can simply encoded upto 3 numbers. The problem would then be trinary classification(3-class) problem. Then plotting ROC, auc score determination would be simple enough.
For example, mapping of 3 different cases like "%background%: 0, %normal%: 1, %botnet%: 2 can be done.
Then if an instance of String like to-background udp flows , then labels it to 0, instance like to-normal tcp flows labels it to 1 and so on. Is their any standard or customized way to encode like above?
Finally, I solved the problem simply using the following code. Firstly, Three substrings are extracted using string contains method. Then simply Label encoding the dataframe, got values like expected.
cat_data.loc[cat_data.Label.str.contains('Normal')] = 'Normal'
cat_data.loc[cat_data.Label.str.contains('Background')] = 'Background'
cat_data.loc[cat_data.Label.str.contains('Botnet')] = 'Botnet'
target = le.fit_transform(cat_data.Label)