Search code examples
pythonfilterpcaporganization

How to effectively separate data inputs of varying sizes?


I'm trying to write a program that takes in a pcap file, filters the packet data through the application tshark, and outputs the data into a dictionary, separating the individual packets. I'm having problems with the seperation segment.

Here is basically what I have so far:

#example data input
records = ["Jamie,20,12/09/1997,Henry,15,05/12/2002,Harriot,22,11/02/1995"]

dict = {}
list1 = str(records).split(',')
i = 0
#seperates list into sublists with length "3"
list1 = [list1[i:i + 3] for i in range(0, len(list1), 3)] 

#places the sublists into a dictionary
for i in range (0,len(fields)): #places the sublists into dictionary
    dict[i] = list1[i][0].split(',') + list1[i][1].split(',') + list1[i][2].split(',')

print(dict)

The output looks like this:

{0: ["['Jamie", '20', '12/09/1997'], 1: ['Henry', '15', '05/12/2002'], 2: ['Harriot', '22', "11/02/1995']"]}

I understand my code is quite flawed and messy. In order to store get more data from each row, you need to manually add each additional field to the dictionary along with having to change where to split the list. Any help on how to better automate this process, considering an input of varying size, would be greatly appreciated. If I explained my problem poorly just ask.

EDIT: Here is the code I use to call tshark. The input for the previous code is "out" converted to string. The name, age and date of birth in the previous example represents ip source, ip destination and protocol.

filters = ["-e","ip.src"," -e ","ip.dst"," -e ","_ws.col.Protocol] #Specifies the metadeta to be extracted

tsharkCall = ["tshark.exe", "-r", inputpcap, "-T", "fields", filters]
tsharkProc = subprocess.Popen(tsharkCall, stdout=subprocess.PIPE)

out, err= tsharkProc.communicate()

Solution

  • Consider something like the following:

    filters = ["ip.src","ip.dst","_ws.col.Protocol"] #Specifies the metadeta to be extracted
    ex_base = 'tshark.exe -r {path} -Tfields {fields}'
    ex = ex_base.format(path=myfile, fields=' '.join('-e ' + f for f in filters))
    tsharkProc = subprocess.Popen(ex.split(), stdout=subprocess.PIPE, universal_newlines=True)
    
    out, err= tsharkProc.communicate()
    
    split_records = [line.split('\t') for line in out.split('\n')]
    records = [dict(zip(filters, line)) for line in split_records]
    
    # [{'ip.src': '127.0.0.1', 'ip.dst': '192.168.0.1', '_ws.col.Protocol': 'something'}, {...}, ...]
    

    This assumes that you leave the default output delimiters, that is, newlines between records and tabs between fields. By zipping your fields array against your output records, you'll automatically expand the dictionaries to fit new fields as you add them to that array.

    Note that you could also use Pandas so solve this problem elegantly, like:

    import pandas as pd
    records = pd.Dataframe(split_records, columns=filters)
    

    This would give you a dataframe structure to work with, which might be useful depending on your application.