Search code examples
pythonregexpandasscikit-learnmulticlass-classification

How to encode String type Target variable into numeric Type using Substring matching or Regular Expression


I am using CTU-13 datasets consisting of 13 scenarios for the Detection of Botnet. Here target variable Label is a string type variable. Label-encoding upon this variable, simply creates around 52-60 unique numeric values, numbers vary to different scenarios. But I obserbed if label-encoding using Substring matching or Regular expression can be done, then we can simply encoded upto 3 numbers. The problem would then be trinary classification(3-class) problem. Then plotting ROC, auc score determination would be simple enough.
For example, mapping of 3 different cases like "%background%: 0, %normal%: 1, %botnet%: 2 can be done. Then if an instance of String like to-background udp flows , then labels it to 0, instance like to-normal tcp flows labels it to 1 and so on. Is their any standard or customized way to encode like above?


Solution

  • Finally, I solved the problem simply using the following code. Firstly, Three substrings are extracted using string contains method. Then simply Label encoding the dataframe, got values like expected.

    cat_data.loc[cat_data.Label.str.contains('Normal')] = 'Normal'
    cat_data.loc[cat_data.Label.str.contains('Background')] = 'Background'
    cat_data.loc[cat_data.Label.str.contains('Botnet')] = 'Botnet' 
    target = le.fit_transform(cat_data.Label)