Search code examples
bashshellawkgrep

Optimise text file statistics calculation by combination of grep, awk, cut


My goal is to calculate statistics from the file that contains the following lines:

2024-05-08 11:02:58,731 INFO o.a.j.a.J.Some check: Closest Day: Wed, New quantity for email0987@yahoo.com is: 1

What I need is to calculate how many emails have the same unique number. For example:

1 - 6575 emails
2 - 333 emails

and so on.

The file can potentially contain duplicate lines for the same email, so I filter that out using awk '!seen[$1]++'.

I have created bash script for that purpose. It works, however for the files of hundreds of MB it's tremendously slow and takes ages to calculate. Is there a way to optimise it? I suppose getting through line by line is not the best way.

#!/bin/bash

input_file="$1"

closest_date=$(grep "Closest Day:" "$input_file" | rev | cut -d' ' -f1,3 | rev | awk '!seen[$1]++')

declare -A counts

# Iterate over each line of closest_draw_data
while read -r line; do
    # Extract email and number
    email=$(echo "$line" | cut -d' ' -f1)
    number=$(echo "$line" | cut -d' ' -f2)
    
    # Increment count for the number
    (( counts[$number]++ ))
done <<< "$closest_date"

# Print the counts
for number in "${!counts[@]}"; do
    echo "Number $number: ${counts[$number]}"
done

Solution

  • You only posted 1 line of sample input and no expected output so we're all just guessing at what you might need but here's a GNU awk script (for arrays of arrays) and the output it produces from the input you provided in case this is what you're trying to do:

    $ awk '
        /Closest Day/ { pairs[$NF][$(NF-2)] }
        END { for (nr in pairs) printf "%d - %d emails\n", nr, length(pairs[nr]) }
    ' file
    1 - 1 emails
    

    See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for why your shell script is so slow, an equivalent awk script will be orders of magnitude faster.