I have a dataframe that looks like the following:
source_ip destination_ip malware_tag ransomware_tag brutefore_tag source_bytes destination_bytes label
ip_1 ip_2 True True False 10 20 0
ip_1 ip_2 True False False 20 60 0
ip_1 ip_2 True False False 30 100 0
ip_1 ip_2 True True False 40 300 0
ip_3 ip_4 False False True 5 20 1
ip_3 ip_4 False False True 500 9 1
ip_3 ip_4 False True True 200 15 1
I want first, for each 'label' to create a new column called 'attack type' which is created based on the three boolean tag column; by considering only the most common/frequent attack flag. So I want the resulting dataframe to look as the following:
source_ip destination_ip source_bytes destination_bytes label attack_type
ip_1 ip_2 10 20 0 malware
ip_1 ip_2 20 60 0 malware
ip_1 ip_2 30 100 0 malware
ip_1 ip_2 40 300 0 malware
ip_3 ip_4 5 20 1 bruteforre
ip_3 ip_4 500 9 1 bruteforce
ip_3 ip_4 200 15 1 bruteforce
Second summarize (sum of source and destination bytes) the resulting dataframe by source and destination IPs. So the final dataframe should look like:
source_ip destination_ip total_source_bytes total_destination_bytes attack_type
ip_1 ip_2 100 480 malware
ip_3 ip_4 705 44 bruteforce
I would appreciate any help to achieve this. Thanks.
Let's try idxmax
to extract the attack type for each label, then map
to convert the label to the most common attack type:
atk_by_labels = df.filter(like='tag').groupby(df['label']).sum().idxmax(1)
# first summary
df['attack_type'] = df['label'].map(atk_by_labels)
# second summary
(df.groupby(['source_ip', 'destination_ip', 'attack_type'], as_index=False)
[['total_source_bytes','total_destination_bytes']].sum()
)