I have .csv file full of packet header information. Few first lines:
28;03/07/2000;11:27:51;00:00:01;8609;4961;8609;097.139.024.164;131.084.001.031;0;-
29;03/07/2000;11:27:51;00:00:01;29396;4962;29396;058.106.180.191;131.084.001.031;0;-
30;03/07/2000;11:27:51;00:00:01;26290;4963;26290;060.075.194.137;131.084.001.031;0;-
31;03/07/2000;11:27:51;00:00:01;28324;4964;28324;038.087.169.169;131.084.001.031;0;-
there are about ~33k lines overall(each line is information from different packet header). Now I need to calculate entropy using source and destination addresses.
Using code i wrote:
def openFile(file_name):
srcFile = open(file_name, 'r')
dataset = []
for line in srcFile:
newLine = line.split(";")
dataset.append(newLine)
return dataset
I get a return that looks like
dataset = [
['28', '03/07/2000', '11:27:51', '00:00:01', '8609', '4961', '8609', '097.139.024.164', '131.084.001.031', '0', '-\n'],
['29', '03/07/2000', '11:27:51', '00:00:01', '29396', '4962', '29396', '058.106.180.191', '131.084.001.031', '0', '-\n'],
['30', '03/07/2000', '11:27:51', '00:00:01', '26290', '4963', '26290', '060.075.194.137', '131.084.001.031', '0', '-\n'],
['31', '03/07/2000', '11:27:51', '00:00:01', '28324', '4964', '28324', '038.087.169.169', '131.084.001.031', '0', '-']
]
and i pass it to my Entropy function:
#---- Entropy += - prob * math.log(prob, 2) ---------
def Entropy(data):
entropy = 0
counter = 0 # -- counter for occurances of the same ip address
#-- For loop to iterate through every item in outer list
for item in range(len(data)):
#-- For loop to iterate through inner list
for x in data[item]:
if x == data[item][8]:
counter += 1
prob = float(counter) / len(data)
entropy += -prob * math.log(prob, 2)
print("\n")
print("Entropy: {}".format(entropy))
code runs without any error but it gives bad entropy and i feel it's because of bad probability calculation(that second for loop is suspicious) or bad entropy formula. Is there any way to find that probability of IP occurance without another for loop? Any editing of code is welcome
Using numpy
and the built-in collections
module you can greatly simplify the code:
import numpy as np
import collections
sample_ips = [
"131.084.001.031",
"131.084.001.031",
"131.284.001.031",
"131.284.001.031",
"131.284.001.000",
]
C = collections.Counter(sample_ips)
counts = np.array(C.values(),dtype=float)
prob = counts/counts.sum()
shannon_entropy = (-prob*np.log2(prob)).sum()
print (shannon_entropy)