Search code examples
bashawkpipelevenshtein-distance

Piping output to a bash function with multiple inputs


Here is what I am trying to do: I want to measure the Levensthein distance between two strings, using bash. I found an implementation of the LD here.

Now, suppose that I have some toy data like so:

1    The brown fox jumped    The green fox jumped
0    The red fox jumped    The green fox jumped
1    The gray fox jumped    The green fox jumped

and lets say that this is stored in data.test.

Then I put it through a simple awk command which filters out the rows which start with 1 like so:

awk -F '\t' '{if ($1>0) print $2,t,$3}' data.test

The first output from this simple command will then be:

The brown fox jumped    The green fox jumped

I now want to measure the Levensthein distance between these two sentences, by piping this output directly to this function (lifted from the above link):

function levenshtein {
    if (( $# != 2 )); then
        echo "Usage: $0 word1 word2" >&2
    elif (( ${#1} < ${#2} )); then
        levenshtein "$2" "$1"
    else
        local str1len=${#1}
        local str2len=${#2}
        local d

        for i in $( seq 0 $(( (str1len+1)*(str2len+1) )) ); do
            d[i]=0
        done

        for i in $( seq 0 $str1len );   do
            d[i+0*str1len]=$i
        done

        for j in $( seq 0 $str2len );   do
            d[0+j*(str1len+1)]=$j
        done

        for j in $( seq 1 $str2len ); do
            for i in $( seq 1 $str1len ); do
                [ "${1:i-1:1}" = "${2:j-1:1}" ] && local cost=0 || local cost=1
                del=$(( d[(i-1)+str1len*j]+1 ))
                ins=$(( d[i+str1len*(j-1)]+1 ))
                alt=$(( d[(i-1)+str1len*(j-1)]+cost ))
                d[i+str1len*j]=$( echo -e "$del\n$ins\n$alt" | sort -n | head -1 )
            done
        done
        echo ${d[str1len+str1len*(str2len)]}
    fi
}

I know you can do this, but I am getting stuck by there being two arguments that need passing, and the fact that I am passing sequences.

I have tried using various versions of this suggestion, which advocates grabbing the input as such:

function levenshtein {
    # Grab input.
    declare input1=${1:-$(</dev/stdin)};
    declare input2=${2:-$(</dev/stdin)};
.
.
.
}

This is the part I cannot quite get to work.


Solution

  • You don't need awk at all:

    while IFS=$'\t' read num first second; do
        [[ $num -gt 0 ]] || continue
        levenshtein "$first" "$second"
    done < data.txt
    

    (True, awk is faster at processing a large file than bash, but if you are implementing the Levenshtein algorithm in bash in the first place, speed is probably not a concern.)


    As an aside, a simpler (though minimally tested) implementation that doesn't require so much index arithmetic by using an associative array with "tuples" as keys.

    levenshtein () {
      if (( ${#1} < ${#2} )); then
        levenshtein "$2" "$1"
        return
      fi
    
      local str1len str2len cost m a b i j
      local -A d
    
      str1len=${#1}
      str2len=${#2}
      for ((i=0;i<=strlen1;i++)); do
        d[$i,0]=0
      done
    
      for ((j=0;j<=strlen2;j++)); do
        d[0,$j]=0
      done
    
      for ((j=1; j<=str2len; j++)); do
        for ((i=1; i<=str1len; i++)); do
          a=${1:i-1:1}
          b=${2:j-1:1}
          [ "$a" = "$b" ] && cost=0 || cost=1
          del=$(( $d[$((i-1)),$j] + 1 ))
          ins=$(( $d[$i,$((j-1))] + 1 ))
          alt=$(( $d[$((i-1)),$((j-1))] + cost ))
    
          # Compute the min without forking
          m=$del; ((ins < m)) && m=$ins; ((alt < m)) && m=$alt
    
          d[$i,$j]=$m
        done
      done
      echo ${d[$str1len,$str2len]}
    }