Search code examples
regexpython-3.xexpressionpcapregular-language

Using Python to search through a PCAP file and return key information about the search query


We have been given a PCAP file and my job is to find:

Before the user got infected/attacked, they used a popular search engine (not Google) to search some information. Use Python to find out 1) which search engine and 2) which keywords they used to do such searches. 3) Which website did the search engine recommend and 4) which website did the user actually access?

By opening the PCAP file on Wireshark, I have already found the correct answer to be: Bing

although I still haven't been able to determine parts 2, 3, and 4

However this is obviously not the purpose of the assignment as I have to use Python to return the information

The code I have so far is:

pcapfile = open('nameofpcapfile.pcap', 'rb')

x = pcapfile.read()
decoded = x.decode("iso-8859-1")

searchengines = ["www.google.com", "www.yahoo.com", "www.ask.com", "www.bing.com",
                 "www.aol.com", "www.baidu.com", "www.wolframalpha.com",
                 "www.duckduckgo.com", "www.yandex.ru"]

searchenginesfound = []

for i in searchengines:
    if i in decoded:
        searchenginesfound.append(i)


if searchenginesfound.__len__() == 0:
    print("Search engine not found")
elif searchenginesfound.__len__() == 1:
    print("Search Engine used: ", searchenginesfound)
elif searchenginesfound.__len__() > 1:
    print("Search Engines used: ", searchenginesfound)

This code is able to successfully return bing.com as the search engine used. However, I have no idea what to do for parts 2, 3, and 4

Any suggestions?


Solution

  • pcaps have a strict format, that allows to delimit the different packets. In a perfect world, you would need to implement a pcap parser, allowing you to get every packet one by one for studying. You used the heavier way, which just parses everything as text (which works in your very specific case :-) ), so that’s what I’ll be documenting. However, I really recommend you to look into that: it’s much easier when you have each packet as it’s own.

    If you’re allowed to use a library, some such as scapy or dpkt can help you parse the pcaps.

    First, you need to know what you are looking for. Keywords are actually the parameters linked to an HTTP request. As you’re parsing it as a file, it matches the ? and & arguments of an url, such as in http://www.example.org/?param1=foo&param2=bar

    In your case, as you’re looking for Bing, here’s the list of the parameters you could find: https://learn.microsoft.com/en-us/rest/api/cognitiveservices/bing-web-api-v5-reference#query-parameters

    To get those, you need to extract all URLs first. For that you could use a regex and the Python builtin re module. Look for a good one online, for instance here’s one I’ve found for HTTP:

    regex = r"/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)/"
    

    Then you’ll need to re.search(regex, decoded) then use groups() (look that up online :-) ) to find all URLs. After that, you’ll be able to split("&") to get the various keywords.

    For 3) and 4) you need to find the next HTTP answers and requests. This is where not implementing a pcap parser gets tricky, as you need to guess where they are in your blob of text. You can probably look for HTTP tags, as they are before and after the HTTP requests, but that’s messy.