Search code examples
pythonprobabilityentropy

Entropy of IP packet information


I have .csv file full of packet header information. Few first lines:

28;03/07/2000;11:27:51;00:00:01;8609;4961;8609;097.139.024.164;131.084.001.031;0;-
29;03/07/2000;11:27:51;00:00:01;29396;4962;29396;058.106.180.191;131.084.001.031;0;-
30;03/07/2000;11:27:51;00:00:01;26290;4963;26290;060.075.194.137;131.084.001.031;0;-
31;03/07/2000;11:27:51;00:00:01;28324;4964;28324;038.087.169.169;131.084.001.031;0;- 

there are about ~33k lines overall(each line is information from different packet header). Now I need to calculate entropy using source and destination addresses.

Using code i wrote:

def openFile(file_name):
    srcFile = open(file_name, 'r')
    dataset = []
    for line in srcFile:
        newLine = line.split(";")
        dataset.append(newLine)
    return dataset

I get a return that looks like

dataset = [
    ['28', '03/07/2000', '11:27:51', '00:00:01', '8609', '4961', '8609', '097.139.024.164', '131.084.001.031', '0', '-\n'], 
    ['29', '03/07/2000', '11:27:51', '00:00:01', '29396', '4962', '29396', '058.106.180.191', '131.084.001.031', '0', '-\n'], 
    ['30', '03/07/2000', '11:27:51', '00:00:01', '26290', '4963', '26290', '060.075.194.137', '131.084.001.031', '0', '-\n'],
    ['31', '03/07/2000', '11:27:51', '00:00:01', '28324', '4964', '28324', '038.087.169.169', '131.084.001.031', '0', '-']
]

and i pass it to my Entropy function:

#---- Entropy += - prob * math.log(prob, 2) ---------
def Entropy(data):
    entropy = 0
    counter = 0 # -- counter for occurances of the same ip address
    #-- For loop to iterate through every item in outer list
    for item in range(len(data)):
        #-- For loop to iterate through inner list
        for x in data[item]:
            if x == data[item][8]: 
                counter += 1
        prob = float(counter) / len(data)
        entropy += -prob * math.log(prob, 2)
    print("\n")
    print("Entropy: {}".format(entropy))

code runs without any error but it gives bad entropy and i feel it's because of bad probability calculation(that second for loop is suspicious) or bad entropy formula. Is there any way to find that probability of IP occurance without another for loop? Any editing of code is welcome


Solution

  • Using numpy and the built-in collections module you can greatly simplify the code:

    import numpy as np
    import collections
    
    sample_ips = [
        "131.084.001.031",
        "131.084.001.031",
        "131.284.001.031",
        "131.284.001.031",
        "131.284.001.000",
    ]
    
    C = collections.Counter(sample_ips)
    counts  = np.array(C.values(),dtype=float)
    prob    = counts/counts.sum()
    shannon_entropy = (-prob*np.log2(prob)).sum()
    print (shannon_entropy)