Search code examples
stringbashloopstexttext-files

Faster method to operate on all characters in a text file


I want to execute a task on a very long string for which I think I need to operate on each character individually. Before doing so I'm trying to establish a baseline of the amount of time required to access all those characters in the first place.

The input I intend to use is a text file with a single line containing a single word composed of digits.

The code below is my best attempt so far, are there faster methods for accessing all characters in a string one by one?

function handle_split() {
    split=$1
    while read -N 1 char; do
        :
    done
}

while read -N 100000 split; do
    ((i=i%100)); ((i++==0)) && wait
    handle_split $split &
done < "filename.txt"

This line is intended to avoid queuing more than a 100 tasks at the same time

((i=i%100)); ((i++==0)) && wait

The length of the split is hardcoded based on the known length of the input, in this case I'm using a string of about a billion characters.


As an aside I tried to convert my string splits into arrays looking for an increase in performance.

function handle_split() {
    split=($@)
    for char in ${split[@]}; do
        :
    done
}

while read -N 100 split; do
   arr=($(echo $split | grep -o .))
   ((i=i%100)); ((i++==0)) && wait
   handle_split "${arr[@]}" &
done < "filename.txt"

But this implementation with arrays was even slower than a simple for loop.

while read -N 1 char; do
    :
done < "filename.txt"

Solution

  • GNU or BSD awk is probably a good choice here:

    awk 'BEGIN {FS=""}
      {for(i=1;i<=NF;i++) a[$i]++}
      END {for(i=0;i<10;i++) for(j=0;j<a[i];j++) printf "%d", i}' file
    

    should be about what you want. Explanation: FS="" in the BEGIN block means that each single character in the file is a separate field. The {for(i=1;i<=NF;i++) a[$i]++} block loops over all fields (1 to NF) and for each of them it increments one of the 10 cells of the a array. The END block is executed at the END and it prints each digit d as many times as a[d], in increasing order of digits.

    Note that this does not print a final newline. If you need one add ; print "" at the end of the block:

    END {for(i=0;i<10;i++) for(j=0;j<a[i];j++) printf "%d", i; print ""}
    

    Of course, the result is so redundant that you could prefer a more compact form, like, for instance one line per character with two fields: the character and the number of occurrences:

    awk 'BEGIN {FS=""}
      {for(i=1;i<=NF;i++) a[$i]++}
      END {for(i in a) printf "%s %d\n", i, a[i]}' file
    

    Just tested this last one on a 3.6 GHz Intel Core i7, with a 1.4GB input: 2m38.480s.