Counting number of delimiters of special character bash shell script Performance improvement

Hi I have a script that is going to count the number of records in a file and find the expected delimiters per a record by dividing the total record count by rs_count. It works fine but it is a little slow on large records. I was wondering if there is a way to improve performance. The RS is a special character octal \246. I am using bash shell script.

Some additional info:

A line is a record. The file will always have the same number of delimiters. The purpose of the script is to check if the file has the expected number of fields. After calculating it, the script just echos it out.

for file in $SOURCE; do
        echo "executing File -"$file
        if (( $total_record_count != 0 ));then
        filename=$(basename "$file")
        total_record_count=$(wc -l < $file)
        rs_count=$(sed -n 'l' $file | grep -o $RS | wc -l)
        Delimiter_per_record=$((rs_count/total_record_count))
        fi
done

Solution

Counting the delimiters (not total records) in a file

On a file with 50,000 lines, I note around a 10 fold increase by incorporating the sed, grep, and wc pipeline to a single awk process:

awk -v RS='Delimiter' 'END{print NR -1}' input_file

Dealing with wc when there's no trailing line breaks

If you count the instances of ^ (start of line), you will get a true count of lines. Using grep:

```
grep -co "^" input_file
```

(Thankfully, even though ^ is a regex, the performance of this is on par with wc)

Incorporating these two modifications into a trivial test based on your supplied code:

#!/usr/bin/env bash

SOURCE="$1"
RS=$'\246'

for file in $SOURCE; do
    echo "executing File -"$file
    if [[ $total_record_count != 0 ]];then
        filename=$(basename "$file")
        total_record_count=$(grep -oc "^" $file)
        rs_count="$(awk -v RS=$'\246' 'END{print NR -1}' $file)"
        Delimiter_per_record=$((rs_count/total_record_count))
    fi
done

echo -e "\$rs_count:\t${rs_count}\n\$Delimiter_per_record:\t${Delimiter_per_record}\n\$total_record_count:\t${total_record_count}" | column -t

Running this on a file with 50,000 lines on my macbook:

time ./recordtest.sh /tmp/randshort

executing File -/tmp/randshort
$rs_count:              186885
$Delimiter_per_record:  3
$total_record_count:    50000

real    0m0.064s
user    0m0.038s
sys     0m0.012s

Unit test one-liner

(creates /tmp/recordtest, chmod +x's it, creates /tmp/testfile with 10 lines of random characters including octal \246, and then runs the script file on the testfile)

echo $'#!/usr/bin/env bash\n\nSOURCE="$1"\nRS=$\'\\246\'\n\nfor file in $SOURCE; do\n    echo "executing File -"$file\n    if [[ $total_record_count != 0 ]];then\n        filename=$(basename "$file")\n        total_record_count=$(grep -oc "^" $file)\n        rs_count="$(awk -v RS=$\'\\246\' \'END{print NR -1}\' $file)"\n        Delimiter_per_record=$((rs_count/total_record_count))\n    fi\ndone\n\necho -e "\\$rs_count:\\t${rs_count}\\n\\$Delimiter_per_record:\\t${Delimiter_per_record}\\n\\$total_record_count:\\t${total_record_count}" | column -t' > /tmp/recordtest ; echo $'\246459ca4f23bafff1c8fc017864aa3930c4a7f2918b\246753f00e5a9278375b\nb\246a3\246fc074b0e415f960e7099651abf369\246a6f\246f70263973e176572\2467355\n1590f285e076797aa83b2ee537c7f99\24666990bb60419b8aa\246bb5b6b\2467053\n89b938a5\246560a54f2826250a2c026c320302529331229255\246ef79fbb52c2\n9042\246bb\246b942408a22f912268ffc78f08c\2462798b0c05a75439\246245be2ea5\n0ef03170413f90e\246e0\246b1b2515c4\2466bf0a1bb\246ee28b78ccce70432e6b\24653\n51229e7ab228b4518404360b31a\2463673261e3242985bf24e59bc657\246999a\n9964\246b08\24640e63fae788ea\246a1777\2460e94f89af8b571e\246e1b53e6332\246c3\246e\n90\246ae12895f\24689885e\246e736f942080f267a275132a348ec1e837b99efe94\n2895e91\246\246f506f\246c1b986a63444b4258\246bc1b39182\24630\24696be' > /tmp/testfile ; chmod +x /tmp/recordtest ; /tmp/./recordtest /tmp/testfile

Which produces this result:

$rs_count:              39
$Delimiter_per_record:  3
$total_record_count:    10

Though there's a number of solutions for counting instances of characters in files, quite a few come undone when trying to process special characters like octal \246

awk seems to handle it reliably and quickly.