Search code examples
pythonwiresharktshark

tshark extract fields with their string representation


I have a tshark's pcap file with data that I want to analyze. I would like to analyze it and export to CSV or xls file. In the tshark documentation I can see that I can either use -z option with proper arguments or -T together with -E and -e. I'm using python 3.6 on Debian machine. Currently, my command looks like this:

command="tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
              "-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
              "Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
              "Subscription-Id,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
              "Multiple-Services-Credit-Control,Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
              "SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
              "Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)

Later I'm processing it with pandas dataframe like so:

# loops adding TCP and/or UDP ports to scan traffic from
    if args.tcp:
        for port in args.tcp:
            command += " -d tcp.port=={},diameter".format(port)

    if args.udp:
        for port in args.udp:
            command += " -d udp.port=={},diameter".format(port)

    # calling subprocess with output redirection to task variable
    task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)

    # a loop adding new data dictionaries to data_list
    for line in task.stdout:
        line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
        # secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
        line = re.split(r"\s|=", line)

        # convert obtained list to ordered dictionary to preserve column order
        # transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
        dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
        data_list.append(dict)

    # remove last 4 dictionaries (last 4 lines of task.stdout)
    data_list = data_list[:-4]
    df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"
    df.to_excel("{}.xls".format(args.output_file), index=False)
    print("Please remember that 'frame' column may not correspond to row index!")

When I open output file I can see that it works ok, except the fact that in e.g. CC-Request-Number I have numeric values instead of string representation, that is e.g. in Wireshark I have data like this:

enter image description here

and in the output excel file in the CC-Request-Number column I can see 3 in the row corresponding to this packet, instead of TERMINATION-REQUEST.

My question is: how can I translate this number to its string representation, while using -z option, or (as I can guess from what I've seen on the web) how can I get fields mentioned above with their values using -T and -e command? I listed all available fields with tshark -G but there are too many of them and I can't think of any reasonable way to find the ones that I want.


Solution

  • Thanks to John Zwick's suggestion, this answer and Python documentation on The ElementTree XML API I implemented code presented below (I downloaded dictionary.xml and chargecontrol.xml from official Wireshark Github repository):

    chargecontrol_tree = ET.parse("chargecontrol.xml")
    dictionary_tree = ET.parse("dictionary.xml")
    chargecontrol_root = chargecontrol_tree.getroot()
    dictionary_root = dictionary_tree.getroot()
    
    # list that will contain data dictionaries
    data_list = []
    
    # base command
    command = "tshark -q -o tcp.relative_sequence_numbers:false -o tcp.analyze_sequence_numbers:false " \
              "-o tcp.track_bytes_in_flight:false -Q -l -z diameter,avp,272,Session-Id,Origin-Host," \
              "Origin-Realm,Destination-Realm,Auth-Application-Id,Service-Context-Id,CC-Request-Type,CC-Request-Number," \
              "Subscription-Id-Data,Subscription-Id-Type,CC-Session-Failover,Destination-Host,User-Name,Origin-State-Id," \
              "Requested-Service-Unit,Used-Service-Unit,SN-Total-Used-Service-Unit," \
              "SN-Remaining-Service-Unit,Service-Identifier,Rating-Group,User-Equipment-Info,Service-Information," \
              "Route-Record,Credit-Control-Failure-Handling -r {}".format(args.input_file)
    
    # loops adding tcp and/or udp ports to scan traffic from
    if args.tcp:
        for port in args.tcp:
            command += " -d tcp.port=={},diameter".format(port)
    
    if args.udp:
        for port in args.udp:
            command += " -d udp.port=={},diameter".format(port)
    
    # calling subprocess with output redirection to task variable
    task = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
    
    # a loop adding new data dictionaries to data_list
    for line in task.stdout:
        line = re.sub(r"'", "", line.decode("utf-8")) # firstly, decode byte string and get rid of '
        # secondly, split string every whitespace or = and obtain dictionary-like list of keys, values
        line = re.split(r"\s|=", line)
    
        # convert obtained list to ordered dictionary to preserve column order
        # transform list to dictionary so that each i item is dictionary key and i+1 item is it's value
        dict = OrderedDict(line[i:i+2] for i in range(0, len(line)-2, 2))
        data_list.append(dict)
    
    # remove last 4 dictionaries (last 4 lines of task.stdout)
    data_list = data_list[:-4]
    df = pd.DataFrame(data_list).fillna("-") # create data frame from list of dicts and fill each NaN with "-"
    
    # values taken from official wireshark repository
    # https://github.com/boundary/wireshark/blob/master/diameter/dictionary.xml
    # https://github.com/wireshark/wireshark/blob/2832f4e97d77324b4e46aac40dae0ce898ae559d/diameter/chargecontrol.xml
    df["Auth-Application-Id"] = df["Auth-Application-Id"].map({node.attrib["code"]:node.attrib["name"] for node in
          dictionary_root.findall(".//*[@name='Auth-Application-Id']/enum")})
    
    # list of columns that values of have to be substituted
    for col in ["CC-Request-Type", "CC-Session-Failover", "Credit-Control-Failure-Handling", "Subscription-Id-Type"]:
        df[col] = df[col].map({node.attrib["code"]: node.attrib["name"] for node in
              chargecontrol_root.findall((".//*[@name='{}']/enum").format(col))})
    
    
    df.to_excel("{}.xls".format(args.output_file), index=False)
    print("Please remember that 'frame' column may not correspond to row index!")