Here is what I am trying to do: I want to measure the Levensthein distance between two strings, using bash. I found an implementation of the LD here.
Now, suppose that I have some toy data like so:
1 The brown fox jumped The green fox jumped
0 The red fox jumped The green fox jumped
1 The gray fox jumped The green fox jumped
and lets say that this is stored in data.test
.
Then I put it through a simple awk
command which filters out the rows which start with 1
like so:
awk -F '\t' '{if ($1>0) print $2,t,$3}' data.test
The first output from this simple command will then be:
The brown fox jumped The green fox jumped
I now want to measure the Levensthein distance between these two sentences, by piping this output directly to this function (lifted from the above link):
function levenshtein {
if (( $# != 2 )); then
echo "Usage: $0 word1 word2" >&2
elif (( ${#1} < ${#2} )); then
levenshtein "$2" "$1"
else
local str1len=${#1}
local str2len=${#2}
local d
for i in $( seq 0 $(( (str1len+1)*(str2len+1) )) ); do
d[i]=0
done
for i in $( seq 0 $str1len ); do
d[i+0*str1len]=$i
done
for j in $( seq 0 $str2len ); do
d[0+j*(str1len+1)]=$j
done
for j in $( seq 1 $str2len ); do
for i in $( seq 1 $str1len ); do
[ "${1:i-1:1}" = "${2:j-1:1}" ] && local cost=0 || local cost=1
del=$(( d[(i-1)+str1len*j]+1 ))
ins=$(( d[i+str1len*(j-1)]+1 ))
alt=$(( d[(i-1)+str1len*(j-1)]+cost ))
d[i+str1len*j]=$( echo -e "$del\n$ins\n$alt" | sort -n | head -1 )
done
done
echo ${d[str1len+str1len*(str2len)]}
fi
}
I know you can do this, but I am getting stuck by there being two arguments that need passing, and the fact that I am passing sequences.
I have tried using various versions of this suggestion, which advocates grabbing the input as such:
function levenshtein {
# Grab input.
declare input1=${1:-$(</dev/stdin)};
declare input2=${2:-$(</dev/stdin)};
.
.
.
}
This is the part I cannot quite get to work.
You don't need awk
at all:
while IFS=$'\t' read num first second; do
[[ $num -gt 0 ]] || continue
levenshtein "$first" "$second"
done < data.txt
(True, awk
is faster at processing a large file than bash
, but if you are implementing the Levenshtein algorithm in bash
in the first place, speed is probably not a concern.)
As an aside, a simpler (though minimally tested) implementation that doesn't require so much index arithmetic by using an associative array with "tuples" as keys.
levenshtein () {
if (( ${#1} < ${#2} )); then
levenshtein "$2" "$1"
return
fi
local str1len str2len cost m a b i j
local -A d
str1len=${#1}
str2len=${#2}
for ((i=0;i<=strlen1;i++)); do
d[$i,0]=0
done
for ((j=0;j<=strlen2;j++)); do
d[0,$j]=0
done
for ((j=1; j<=str2len; j++)); do
for ((i=1; i<=str1len; i++)); do
a=${1:i-1:1}
b=${2:j-1:1}
[ "$a" = "$b" ] && cost=0 || cost=1
del=$(( $d[$((i-1)),$j] + 1 ))
ins=$(( $d[$i,$((j-1))] + 1 ))
alt=$(( $d[$((i-1)),$((j-1))] + cost ))
# Compute the min without forking
m=$del; ((ins < m)) && m=$ins; ((alt < m)) && m=$alt
d[$i,$j]=$m
done
done
echo ${d[$str1len,$str2len]}
}