Search code examples
linuxbashshelllogfilelogfile-analysis

Extract the unpredictable data that have its own timestamp in a log file using a Shell script


log.txt will be as below, which are the ID data with its own timestamp (detection_time) that will continuously update in this log.txt file. The ID data will be unpredictable number. It could be from 0000-9999 and the same ID could be appeared in the log.txt again.

My goal is to filter the ID that appears again in the log.txt within 15 sec from its first appearance by using shell script. Can anyone help me with this?

ID = 4231
detection_time = 1595556730 
ID = 3661
detection_time = 1595556731
ID = 2654
detection_time = 1595556732
ID = 3661
detection_time = 1595556733

To be more clear, from log.txt above, the ID 3661 first appear at time 1595556731 and then appear again at 1595556733 which is just 2 sec after the first appearance. So it is matched to my condition which is want the ID that appear again within 15sec. I would like this ID 3661 to be filtered by my shell script

The output after running the shell script will be ID = 3661

My problem is I don't know how to develop the programming algorithm in shell script.

Heres what i try by using ID_new and ID_previous variable but ID_previous=$(ID_new) detection_previous=$(detection_new) are not working

input="/tmp/log.txt"
ID_previous=""
detection_previous=""
while IFS= read -r line
do
    ID_new=$(echo "$line" | grep "ID =" | awk -F " " '{print $3}')
    echo $ID_new
    detection_new=$(echo "$line" | grep "detection_time =" | awk -F " " '{print $3}')
    echo $detection_new
    ID_previous=$(ID_new)
    detection_previous=$(detection_new)
done < "$input"

EDIT log.txt actually the data is in a set contain ID, detection_time, Age and Height. Sorry for not mention this in the first place

ID = 4231
detection_time = 1595556730 
Age = 25
Height = 182
ID = 3661
detection_time = 1595556731
Age = 24
Height = 182
ID = 2654
detection_time = 1595556732
Age = 22
Height = 184    
ID = 3661
detection_time = 1595556733
Age = 27
Height = 175
ID = 3852
detection_time = 1595556734
Age = 26
Height = 156
ID = 4231
detection_time = 1595556735 
Age = 24
Height = 184

I've tried the Awk solution. the result is 4231 3661 2654 3852 4231 which are all the IDs in the log.txt The correct output should be 4231 3661

From this, I think Age and Height data might affect to the Awk solution because its inserted between the focused data which are ID and detection_time.


Solution

  • Assuming the time stamps in the log file are increasing monotonically, you only need a single pass with Awk. For each id, keep track of the latest time it was reported (use an associative array t where the key is the id and the value is the latest timestamp). If you see the same id again and the difference between the time stamps is less than 15, report it.

    For good measure, keep a second array p of the ones we have already reported so we don't report them twice.

    awk '/^ID = / { id=$3; next }
        # Skip if this line is neither ID nor detection_time
        !/^detection_time = / { next }
        (id in t) && (t[id] >= $3-15) && !(p[id]) { print id; ++p[id]; next }
        { t[id] = $3 }' /tmp/log.txt
    

    If you really insist on doing this natively in Bash, I would refactor your attempt to

    declare -A dtime printed
    while read -r field _ value
    do
        case $field in
         ID) id=$value;;
         detection_time)
          if [[ dtime["$id"] -ge $((value - 15)) ]]; then
              [[ -v printed["$id"] ]] || echo "$id"
              printed["$id"]=1
          fi
          dtime["$id"]=$value ;;
        esac
    done < /tmp/log.txt
    

    Notice how read -r can easily split a line on whitespace just as well as Awk can, as long as you know how many fields you can expect. But while read -r is typically an order of magnitude slower than Awk, and you'll have to agree that the Awk attempt is more succinct and elegant, as well as portable to older systems.

    (Associative arrays were introduced in Bash 4.)

    Tangentially, anything that looks like grep 'x' | awk '{ y }' can be refactored to awk '/x/ { y }'; see also useless use of grep.

    Also, notice that $(foo) attempts to run foo as a command. To simply refer to the value of the variable foo, the syntax is $foo (or, optionally, ${foo}, but the braces add no value here). Usually you will want to double-quote the expansion "$foo"; see also When to wrap quotes around a shell variable

    Your script would only remember a single earlier event; the associative array allows us to remember all the ID values we have seen previously (until we run out of memory).

    Nothing prevents us from using human-readable variable names in Awk either; feel free to substitute printed for p and dtime for t to have complete parity with the Bash alternative.