My goal is to calculate statistics from the file that contains the following lines:
2024-05-08 11:02:58,731 INFO o.a.j.a.J.Some check: Closest Day: Wed, New quantity for is: 1
What I need is to calculate how many emails have the same unique number. For example:
1 - 6575 emails
2 - 333 emails
and so on.
The file can potentially contain duplicate lines for the same email, so I filter that out using awk '!seen[$1]++'
I have created bash script for that purpose. It works, however for the files of hundreds of MB it's tremendously slow and takes ages to calculate. Is there a way to optimise it? I suppose getting through line by line is not the best way.
closest_date=$(grep "Closest Day:" "$input_file" | rev | cut -d' ' -f1,3 | rev | awk '!seen[$1]++')
declare -A counts
# Iterate over each line of closest_draw_data
while read -r line; do
# Extract email and number
email=$(echo "$line" | cut -d' ' -f1)
number=$(echo "$line" | cut -d' ' -f2)
# Increment count for the number
(( counts[$number]++ ))
done <<< "$closest_date"
# Print the counts
for number in "${!counts[@]}"; do
echo "Number $number: ${counts[$number]}"
You only posted 1 line of sample input and no expected output so we're all just guessing at what you might need but here's a GNU awk script (for arrays of arrays) and the output it produces from the input you provided in case this is what you're trying to do:
$ awk '
/Closest Day/ { pairs[$NF][$(NF-2)] }
END { for (nr in pairs) printf "%d - %d emails\n", nr, length(pairs[nr]) }
' file
1 - 1 emails
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for why your shell script is so slow, an equivalent awk script will be orders of magnitude faster.