network-programming analysis duration traffic

Using command line tools to filter and meet packet flow criteria

I have a pcap file with 8 million packets that I have reduced to a txt file with just three fields: time (in seconds), IP source address, type (of traffic).

I need to extract from this 8 million-line file only those IP addresses that have 100 packets or more, eliminating those addresses which do not meet the 100-packet or more criteria, making the file smaller.

But I need to keep all 3 fields, and all packets in the flow of the remaining addresses (of 100+ packets) in the reduced txt file because I need to calculate the packet flow duration for each IP source address (ending time of flow - beginning time of flow), and keep only those Ip source addresses whose flow duration is 60 seconds or more, thus reducing my file even more.

When I used command line tools to fulfill the first criteria (100 packets or more) I eliminate all packet flow for those addresses. How can I achieve those two conditions using command line tools to be able to automate the process using a bash script? Below is a sample of my file to which I need to apply the two criteria. Thank you very much for your help!

1385957611.118522 99.61.34.145 TCP 1385957859.425248 99.61.34.145 TCP 1385958784.632631 99.61.34.145 TCP 1385959038.972602 99.61.34.145 TCP 1385959481.571627 99.61.34.145 TCP 1385860339.225421 37.139.6.111 TCP 1385860339.238402 37.139.6.111 TCP 1385860339.286538 37.139.6.111 TCP 1385860339.379029 37.139.6.111 TCP 1385860339.380669 37.139.6.111 TCP 1385860339.425247 37.139.6.111 TCP 1385860339.556737 37.139.6.111 TCP 1385860339.583913 37.139.6.111 TCP 1385860339.623861 37.139.6.111 TCP 1385857840.419300 103.248.63.253 TCP 1385857841.739372 103.248.63.253 TCP 1385857848.593171 103.248.63.253 TCP 1385857850.411457 103.248.63.253 TCP

Solution

I think you can use a combination of awk and xargs to accomplish this. The following script assumes that your data file is organized as one-record-per-line and also that each timestamp is larger than the previous one:

awk '{
    line = $0;
    addr = $2;
    addrcount[addr]++;
}
END {
    for (addr in addrcount) {
        if (addrcount[addr] >= 100) {
            print addr;
        }
    }
}' [DATA_FILE] | xargs -P [MAXPROCS] -I 'IP_ADDR' awk '{ if ($2 == "IP_ADDR") { print $0 } }' [DATA_FILE] | awk '{
    timestamp = $1
    addr = $2;
    traffictype = $3;
    if (!(addr in minfor)) {
        minfor[addr] = timestamp;
    }
    maxfor[addr] = timestamp;
    typefor[addr] = traffictype;
}
END {
    for (addr in minfor) {
        print addr, minfor[addr], maxfor[addr], maxfor[addr] - minfor[addr], typefor[addr]
    }
}' | awk '{ if ($4 >= 60) { print $1, $5} }'

The first awk bit figures out which IP addresses have 100+ records and prints them, one address per line. This is piped to xargs which runs another awk script that prints only those lines in your file that have those IP addresses. This should prevent you from losing context when trying to filter for 100+ packets. The second-to-last awk script goes through each line in the filtered data and records the minimum timestamp and maximum timestamp, then prints out the difference. It also records the traffic type. The final awk script filters the data such that only those IP addresses with time deltas of more than 60, printing the IP address and traffic type.