Search code examples
bashshellunixdelimiter

Counting number of delimiters of special character bash shell script Performance improvement


Hi I have a script that is going to count the number of records in a file and find the expected delimiters per a record by dividing the total record count by rs_count. It works fine but it is a little slow on large records. I was wondering if there is a way to improve performance. The RS is a special character octal \246. I am using bash shell script.

Some additional info:

A line is a record. The file will always have the same number of delimiters. The purpose of the script is to check if the file has the expected number of fields. After calculating it, the script just echos it out.

for file in $SOURCE; do
        echo "executing File -"$file
        if (( $total_record_count != 0 ));then
        filename=$(basename "$file")
        total_record_count=$(wc -l < $file)
        rs_count=$(sed -n 'l' $file | grep -o $RS | wc -l)
        Delimiter_per_record=$((rs_count/total_record_count))
        fi
done

Solution

  • Counting the delimiters (not total records) in a file

    On a file with 50,000 lines, I note around a 10 fold increase by incorporating the sed, grep, and wc pipeline to a single awk process:

    • awk -v RS='Delimiter' 'END{print NR -1}' input_file

    Dealing with wc when there's no trailing line breaks

    If you count the instances of ^ (start of line), you will get a true count of lines. Using grep:

    • grep -co "^" input_file

    (Thankfully, even though ^ is a regex, the performance of this is on par with wc)


    Incorporating these two modifications into a trivial test based on your supplied code:

    #!/usr/bin/env bash
    
    SOURCE="$1"
    RS=$'\246'
    
    for file in $SOURCE; do
        echo "executing File -"$file
        if [[ $total_record_count != 0 ]];then
            filename=$(basename "$file")
            total_record_count=$(grep -oc "^" $file)
            rs_count="$(awk -v RS=$'\246' 'END{print NR -1}' $file)"
            Delimiter_per_record=$((rs_count/total_record_count))
        fi
    done
    
    echo -e "\$rs_count:\t${rs_count}\n\$Delimiter_per_record:\t${Delimiter_per_record}\n\$total_record_count:\t${total_record_count}" | column -t
    

    Running this on a file with 50,000 lines on my macbook:

    time ./recordtest.sh /tmp/randshort
    
    executing File -/tmp/randshort
    $rs_count:              186885
    $Delimiter_per_record:  3
    $total_record_count:    50000
    
    real    0m0.064s
    user    0m0.038s
    sys     0m0.012s

    Unit test one-liner

    (creates /tmp/recordtest, chmod +x's it, creates /tmp/testfile with 10 lines of random characters including octal \246, and then runs the script file on the testfile)

    echo $'#!/usr/bin/env bash\n\nSOURCE="$1"\nRS=$\'\\246\'\n\nfor file in $SOURCE; do\n    echo "executing File -"$file\n    if [[ $total_record_count != 0 ]];then\n        filename=$(basename "$file")\n        total_record_count=$(grep -oc "^" $file)\n        rs_count="$(awk -v RS=$\'\\246\' \'END{print NR -1}\' $file)"\n        Delimiter_per_record=$((rs_count/total_record_count))\n    fi\ndone\n\necho -e "\\$rs_count:\\t${rs_count}\\n\\$Delimiter_per_record:\\t${Delimiter_per_record}\\n\\$total_record_count:\\t${total_record_count}" | column -t' > /tmp/recordtest ; echo $'\246459ca4f23bafff1c8fc017864aa3930c4a7f2918b\246753f00e5a9278375b\nb\246a3\246fc074b0e415f960e7099651abf369\246a6f\246f70263973e176572\2467355\n1590f285e076797aa83b2ee537c7f99\24666990bb60419b8aa\246bb5b6b\2467053\n89b938a5\246560a54f2826250a2c026c320302529331229255\246ef79fbb52c2\n9042\246bb\246b942408a22f912268ffc78f08c\2462798b0c05a75439\246245be2ea5\n0ef03170413f90e\246e0\246b1b2515c4\2466bf0a1bb\246ee28b78ccce70432e6b\24653\n51229e7ab228b4518404360b31a\2463673261e3242985bf24e59bc657\246999a\n9964\246b08\24640e63fae788ea\246a1777\2460e94f89af8b571e\246e1b53e6332\246c3\246e\n90\246ae12895f\24689885e\246e736f942080f267a275132a348ec1e837b99efe94\n2895e91\246\246f506f\246c1b986a63444b4258\246bc1b39182\24630\24696be' > /tmp/testfile ; chmod +x /tmp/recordtest ; /tmp/./recordtest /tmp/testfile

    Which produces this result:

    $rs_count:              39
    $Delimiter_per_record:  3
    $total_record_count:    10
    

    Though there's a number of solutions for counting instances of characters in files, quite a few come undone when trying to process special characters like octal \246

    awk seems to handle it reliably and quickly.