Search code examples
bashfunctionrandomawk

Generate many random alphanumeric strings


From inside of awk, I want to generate a string of X alphanumeric characters reasonably random (i.e., random but not cryptographic) on demand and rapidly.

In Ruby, I could do this:

ruby -e '
def rand_string(len, min=48, max=123, pattern=/[[:alnum:]]/)
    rtr=""
    while rtr.length<len do
        rtr+=(0..len).map { (min + rand(max-min)).chr }.
            select{|e| e[pattern] }.join
    end                     # falls out when min length achieved 
    rtr[0..len]
end

(0..5).each{|_| puts rand_string(20)}'  

Prints:

7Ntz5NF5juUL7tGmYQhsc
kaOzO1aIxkW5rmJ9CaKtD
49SpdFTibXR1WPWV7li6c
PT862YZQd0dOIaFOIY0d1
vYktRXkdsj38iH3s2WKI
3nQZ7cCVEXvoaOZvm6mTR

For a time comparison, the Ruby can produce 1,000,000 unique strings (no duplicates) in roughly 9 seconds.

Taking that, I tried in awk:

awk -v r=$RANDOM '
# the r value will only be a new seed each invocation -- not each f call
function rand_string(i) {
    s=""
    min=48
    max=123
    srand(r)
    while (length(s)<i) {
        c=sprintf("%c", int(min+rand()*(max-min+1)))
        if (c~/[[:alnum:]]/) s=s c
    }
    return s
}
BEGIN{ for (i=1; i<=5; i++) {print rand_string(20)}}'

That does not work -- same seed, same string result. Prints:

D65CsI55zTsk5otzSoJI
D65CsI55zTsk5otzSoJI
D65CsI55zTsk5otzSoJI
D65CsI55zTsk5otzSoJI
D65CsI55zTsk5otzSoJI

Now try reading /dev/urandom with od:

awk '
function rand_string(i) {
    arg=i*4
    cmd="od -A n -t u1 -N " arg " /dev/urandom"  # this is POSIX
    #             ^  ^                unsigned character
    #                   ^  ^          count of i*4 bytes
    s=""
    min=48
    max=123
    while (length(s)<i) {
        while((cmd | getline line)>0) {
            split(line, la)
            for (e in la) {
                if (la[e]<min || la[e]>max) continue
                c=sprintf("%c", la[e])
                if (c~/[[:alnum:]]/) s=s c
            }
        }
        close(cmd)
    }
    return substr(s,1,i)
}
BEGIN {for(i=1;i<=5;i++) print rand_string(20) }'

This works as desired. Prints:

sYY195x6fFQdYMrOn1OS
9mv7KwtgdUu2DgslQByo
LyVvVauEBZU2Ad6kVY9q
WFsJXvw8YWYmySIP87Nz
AMcZY2hKNzBhN1ByX7LW

But now the problem is with the pipe od -A n -t u1 -N " arg " /dev/urandom is is really slow -- unusable except for a trivial number of strings.

Any idea how I can modify one of those awks so that it:

  1. Runs on most platforms (i.e., default POSIX kit);
  2. Can produce reasonably random strings of X length rapidly.

This question has been asked a few times:

  1. How can I replace a string with a random alphanumeric string 48 characters long using awk where the answer is use external tools -- too slow;
  2. Substitute given pattern with a random one with awk but that is a random int and does not use srand;
  3. Execute a command (to generate random strings) inside awk but again uses shell pipe (too slow) and Linux only.

Solution

  • I don't have access to Ruby but on my (apparently slow!) system the awk script from @dawgs answer takes 24 seconds to run while this one takes 5 seconds:

    $ cat tst.sh
    #!/usr/bin/env bash
    
    time awk -v r=$RANDOM '
    function rand_string(n,         s,i) {
        for ( i=1; i<=n; i++ ) {
            s = s chars[int(1+rand()*numChars)]
        }
        return s
    }
    BEGIN{
        srand(r)      # Use srand ONCE only
        for (i=48; i<=122; i++) {
            char = sprintf("%c", i)
            if ( char ~ /[[:alnum:]]/ ) {
                chars[++numChars] = char
            }
        }
    
        for (i=1; i<=1000000; i++) {print rand_string(20)}
    }' | sort | uniq -c | awk '$1>1'
    

    $ ./tst.sh
    
    real    0m5.078s
    user    0m4.077s
    sys     0m0.045s
    

    so if you want to produce a lot of strings then create an array of the possible letters first and then index the array using rand() instead of calling sprintf() for every letter of every string.

    Since making a variable like s iteratively larger is slow in terms of memory [re]allocation, you can make the script about 20% faster still by setting OFS="" then setting $i to each char rather than building up a string:

    function rand_string(n,         i) {
        for ( i=1; i<=n; i++ ) {
            $i = chars[int(1+rand()*numChars)]
        }
        return $0
    }
    

    $ ./tst2.sh
    
    real    0m3.954s
    user    0m3.420s
    sys     0m0.015s
    

    as long as you don't need $0 for anything else.