Search code examples
arraysloopsawksemantics

Deleting array element in awk while looping through it: always safe?


This is an awk question: I wonder what is the exact semantics of the loop iterator for (k in array): I know we don't have much control on the order in which the array elements are scanned, and yet I want to know if it is always safe (i.e. guaranteed by some POSIX specification) to delete an array element in the body of such a loop. I mean, is it guaranteed that the subsequent iterations in the loop will behave properly, without skipping any element nor hitting the deleted one?

A minimal example is the one below, where we omit all names from the input that start with an uppercase "A". It seems to work well on my GNU Awk 4.2.1, but I'm not sure whether it's fully portable and safe on all awk implementations. Any thoughts on this? Thx!

echo -e "Alberto\n Adam\n Payne\n Kristell\n John\n\
   Arjuna\n Albert\n Me\n You\n Toto\n Auntie\n Terribel" | 
awk '{ names[NR] = $1 } 
     END { for (k in names)
             if (substr(names[k], 1, 1) == "A") delete names[k];
           for (k in names) print names[k] }'

Solution

  • Yes and no. It is "safe" to delete an entry in as much as the entry won't exist after it's deleted but it's not safe to assume you won't hit that index after you delete it while the loop is iterating.

    The POSIX spec couldn't say:

    the following code deletes an entire array:
    
    for (index in array)
        delete array[index]
    

    if doing so might skip an index and this:

    for (index in arrayA) {
        if (index in arrayB) {
            print "Both:", index
            delete arrayA[index]
            delete arrayB[index]
        }
    }
    
    for (index in arrayA)
        print "A only:", index
    
    for (index in arrayB)
        print "B only:", index
    

    is an extremely common idiom for finding which sets values are in and that wouldn't work if that approach weren't "safe" in that context.

    BUT that doesn't mean you can assume an array index won't be hit after it's been deleted during the loop because whether an awk figures out all of the array indices that are going to be visited before entering the loop or during execution is implementation dependent. GNU awk, for example, determines all the indices it will visit before the loop is entered so you get this behavior where the array gets 1 element shorter after the delete a[3] but the deleted index 3 is still visited in the loop where it was previously deleted:

    $ gawk 'BEGIN{split("a b c d e",a);
        for (i in a) {print length(a), i, a[i]; delete a[3]} }'
    5 1 a
    4 2 b
    4 3
    4 4 d
    4 5 e
    

    but not all awks do that, e.g. BWK awk/nawk doesn't and neither does MacOS/BSD awk:

    $ awk 'BEGIN{split("a b c d e",a);
        for (i in a) {print length(a), i, a[i]; delete a[3]} }'
    5 2 b
    4 4 d
    4 5 e
    4 1 a
    

    The gawk behavior is equivalent to this in the other awks mentioned above:

    $ awk 'BEGIN{split("a b c d e",a); for (i in a) b[i];
        for (i in b) { print length(a), i, (i in a ? a[i] : x); delete a[3]} }'
    5 2 b
    4 3
    4 4 d
    4 5 e
    4 1 a
    

    I'm using an unassigned variable x above instead of "" to accurately depict the zero-or-null nature of a[3] after deletion but it doesn't actually matter in this case since we're printing it as "" anyway.

    So no matter which awk you use, once the above loop is exited a[3] will be gone, e.g. with GNU awk again:

    $ gawk 'BEGIN{split("a b c d e",a);
        for (i in a) {print length(a), i, a[i]; delete a[3]}
        print "---";
        for (i in a) {print i, a[i]} }'
    5 1 a
    4 2 b
    4 3
    4 4 d
    4 5 e
    ---
    1 a
    2 b
    4 d
    5 e
    

    Note that in this above script a[3] actually gets recreated during that first loop due to accessing a[i] when i is 3 but then the delete a[3] happening for every index is what removes it again. If we only did the delete when i is 1 then we'd see a[3] exists but contains zero-or-null after the loop:

    $ gawk 'BEGIN{split("a b c d e",a);
            for (i in a) {print length(a), i, a[i]; if (i==1) delete a[3]}
            print "---";
            for (i in a) {print i, a[i]} }'
    5 1 a
    4 2 b
    4 3
    5 4 d
    5 5 e
    ---
    1 a
    2 b
    3
    4 d
    5 e
    

    To see why the gawk approach of pre-determining the indices that will be visited before you start looping is better than trying to determine them on the fly as you're looping, consider this code which is attempting to add 3 new elements to the array inside the loop:

    $ cat tst.awk
    BEGIN {
        split("a b c",a)
        for (i in a) {
            j=i+100
            a[j] = "foo" j
            print length(a), i, a[i]
        }
        print "---"
        for (i in a) {
            print i, a[i]
        }
    }
    

    With gawk the output and end result are both predictable and as desired:

    $ gawk -f tst.awk
    4 1 a
    5 2 b
    6 3 c
    ---
    6 1 a
    6 2 b
    6 3 c
    6 101 foo101
    6 102 foo102
    6 103 foo103
    

    while with MacOS/BSD awk (ignore the order, just look at the length of the array and the values of the indices):

    $ awk -f tst.awk
    4 2 b
    5 3 c
    6 102 foo102
    7 103 foo103
    8 202 foo202
    9 203 foo203
    10 302 foo302
    11 1 a
    ---
    11 303 foo303
    11 2 b
    11 3 c
    11 402 foo402
    11 101 foo101
    11 102 foo102
    11 103 foo103
    11 202 foo202
    11 203 foo203
    11 302 foo302
    11 1 a
    

    it's apparently chaotic because it's trying to access the indices being added within the loop while it's looping but with limited success (presumably due to the order of the new indices in the hash table vs the previously visited hash table entries) which is lucky or we'd be stuck in an infinite loop.

    To get useful results from MacOS/BSD awk, etc. you again need to save the predetermined indices in a new array before looping as already demonstrated above:

    $ cat tst.awk
    BEGIN {
        split("a b c",a)
        for (i in a) {
            b[i]
        }
        for (i in b) {
            j=i+100
            a[j] = "foo" j
            print length(a), i, a[i]
        }
        print "---"
        for (i in a) {
            print length(a), i, a[i]
        }
    }
    
    $ awk -f tst.awk
    4 2 b
    5 3 c
    6 1 a
    ---
    6 2 b
    6 3 c
    6 101 foo101
    6 102 foo102
    6 103 foo103
    6 1 a
    

    Oh and wrt I know we don't have much control on the order in which the array elements are scanned - with GNU awk you can control that precisely by setting PROCINFO["sorted_in"], see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning. For example:

    $ gawk 'BEGIN{split("10 2 bob alf",a);
        PROCINFO["sorted_in"]="@ind_str_asc"; for (i in a) print i, a[i]}'
    1 10
    2 2
    3 bob
    4 alf
    
    $ gawk 'BEGIN{split("10 2 bob alf",a);
        PROCINFO["sorted_in"]="@val_str_asc"; for (i in a) print i, a[i]}'
    1 10
    2 2
    4 alf
    3 bob
    
    $ gawk 'BEGIN{split("10 2 bob alf",a);
        PROCINFO["sorted_in"]="@val_num_asc"; for (i in a) print i, a[i]}'
    4 alf
    3 bob
    2 2
    1 10