I want to execute a task on a very long string for which I think I need to operate on each character individually. Before doing so I'm trying to establish a baseline of the amount of time required to access all those characters in the first place.
The input I intend to use is a text file with a single line containing a single word composed of digits.
The code below is my best attempt so far, are there faster methods for accessing all characters in a string one by one?
function handle_split() {
split=$1
while read -N 1 char; do
:
done
}
while read -N 100000 split; do
((i=i%100)); ((i++==0)) && wait
handle_split $split &
done < "filename.txt"
This line is intended to avoid queuing more than a 100 tasks at the same time
((i=i%100)); ((i++==0)) && wait
The length of the split is hardcoded based on the known length of the input, in this case I'm using a string of about a billion characters.
As an aside I tried to convert my string splits into arrays looking for an increase in performance.
function handle_split() {
split=($@)
for char in ${split[@]}; do
:
done
}
while read -N 100 split; do
arr=($(echo $split | grep -o .))
((i=i%100)); ((i++==0)) && wait
handle_split "${arr[@]}" &
done < "filename.txt"
But this implementation with arrays was even slower than a simple for loop.
while read -N 1 char; do
:
done < "filename.txt"
GNU or BSD awk
is probably a good choice here:
awk 'BEGIN {FS=""}
{for(i=1;i<=NF;i++) a[$i]++}
END {for(i=0;i<10;i++) for(j=0;j<a[i];j++) printf "%d", i}' file
should be about what you want. Explanation: FS=""
in the BEGIN
block means that each single character in the file is a separate field. The {for(i=1;i<=NF;i++) a[$i]++}
block loops over all fields (1 to NF
) and for each of them it increments one of the 10 cells of the a
array. The END
block is executed at the END and it prints each digit d
as many times as a[d]
, in increasing order of digits.
Note that this does not print a final newline. If you need one add ; print ""
at the end of the block:
END {for(i=0;i<10;i++) for(j=0;j<a[i];j++) printf "%d", i; print ""}
Of course, the result is so redundant that you could prefer a more compact form, like, for instance one line per character with two fields: the character and the number of occurrences:
awk 'BEGIN {FS=""}
{for(i=1;i<=NF;i++) a[$i]++}
END {for(i in a) printf "%s %d\n", i, a[i]}' file
Just tested this last one on a 3.6 GHz Intel Core i7, with a 1.4GB input: 2m38.480s.