Hi I have a script that is going to count the number of records in a file and find the expected delimiters per a record by dividing the total record count by rs_count. It works fine but it is a little slow on large records. I was wondering if there is a way to improve performance. The RS is a special character octal \246. I am using bash shell script.
Some additional info:
A line is a record. The file will always have the same number of delimiters. The purpose of the script is to check if the file has the expected number of fields. After calculating it, the script just echos it out.
for file in $SOURCE; do
echo "executing File -"$file
if (( $total_record_count != 0 ));then
filename=$(basename "$file")
total_record_count=$(wc -l < $file)
rs_count=$(sed -n 'l' $file | grep -o $RS | wc -l)
Delimiter_per_record=$((rs_count/total_record_count))
fi
done
Counting the delimiters (not total records) in a file
On a file with 50,000 lines, I note around a 10 fold increase by incorporating the sed
, grep
, and wc
pipeline to a single awk
process:
awk -v RS='Delimiter' 'END{print NR -1}' input_file
Dealing with wc
when there's no trailing line breaks
If you count the instances of ^
(start of line), you will get a true count of lines. Using grep
:
grep -co "^" input_file
(Thankfully, even though ^
is a regex, the performance of this is on par with wc
)
Incorporating these two modifications into a trivial test based on your supplied code:
#!/usr/bin/env bash
SOURCE="$1"
RS=$'\246'
for file in $SOURCE; do
echo "executing File -"$file
if [[ $total_record_count != 0 ]];then
filename=$(basename "$file")
total_record_count=$(grep -oc "^" $file)
rs_count="$(awk -v RS=$'\246' 'END{print NR -1}' $file)"
Delimiter_per_record=$((rs_count/total_record_count))
fi
done
echo -e "\$rs_count:\t${rs_count}\n\$Delimiter_per_record:\t${Delimiter_per_record}\n\$total_record_count:\t${total_record_count}" | column -t
Running this on a file with 50,000 lines on my macbook:
time ./recordtest.sh /tmp/randshort
executing File -/tmp/randshort
$rs_count: 186885
$Delimiter_per_record: 3
$total_record_count: 50000
real 0m0.064s
user 0m0.038s
sys 0m0.012s
(creates /tmp/recordtest
, chmod +x's it, creates /tmp/testfile with 10 lines of random characters including octal \246
, and then runs the script file on the testfile)
echo $'#!/usr/bin/env bash\n\nSOURCE="$1"\nRS=$\'\\246\'\n\nfor file in $SOURCE; do\n echo "executing File -"$file\n if [[ $total_record_count != 0 ]];then\n filename=$(basename "$file")\n total_record_count=$(grep -oc "^" $file)\n rs_count="$(awk -v RS=$\'\\246\' \'END{print NR -1}\' $file)"\n Delimiter_per_record=$((rs_count/total_record_count))\n fi\ndone\n\necho -e "\\$rs_count:\\t${rs_count}\\n\\$Delimiter_per_record:\\t${Delimiter_per_record}\\n\\$total_record_count:\\t${total_record_count}" | column -t' > /tmp/recordtest ; echo $'\246459ca4f23bafff1c8fc017864aa3930c4a7f2918b\246753f00e5a9278375b\nb\246a3\246fc074b0e415f960e7099651abf369\246a6f\246f70263973e176572\2467355\n1590f285e076797aa83b2ee537c7f99\24666990bb60419b8aa\246bb5b6b\2467053\n89b938a5\246560a54f2826250a2c026c320302529331229255\246ef79fbb52c2\n9042\246bb\246b942408a22f912268ffc78f08c\2462798b0c05a75439\246245be2ea5\n0ef03170413f90e\246e0\246b1b2515c4\2466bf0a1bb\246ee28b78ccce70432e6b\24653\n51229e7ab228b4518404360b31a\2463673261e3242985bf24e59bc657\246999a\n9964\246b08\24640e63fae788ea\246a1777\2460e94f89af8b571e\246e1b53e6332\246c3\246e\n90\246ae12895f\24689885e\246e736f942080f267a275132a348ec1e837b99efe94\n2895e91\246\246f506f\246c1b986a63444b4258\246bc1b39182\24630\24696be' > /tmp/testfile ; chmod +x /tmp/recordtest ; /tmp/./recordtest /tmp/testfile
Which produces this result:
$rs_count: 39
$Delimiter_per_record: 3
$total_record_count: 10
Though there's a number of solutions for counting instances of characters in files, quite a few come undone when trying to process special characters like octal \246
awk
seems to handle it reliably and quickly.