Search code examples
bashperformancesubsequence

bash subsequence matching speed-up


I am wondering if there is an easy way to check if a string is a subsequence of another string in bash, actually a subsequence with an extra rule. I will explain.

Some subsequences of "apple" are "aple", "al", "pp" and "ale". The subsequences with an extra rule, I want to get are those that start and end with the same letter as the string so only "aple" and "ale" fit my desire.

I have made the following program:

#!/bin/bash
while read line
do
    search=$(echo "$line" | tr -s 'A-Za-z' | sed 's/./\.\*&/g;s/^\.\*//' )
    expr match "$1" "$search" >/dev/null && echo "$line"
done

It is executed as followed:

./program.sh greogdgedlqfe < words.txt

This program works, but is very slow.

It takes every line of the file, modify it to regex expression and then check if they match and then print the original line. So example:

one of the lines has the word google

$search becomes g.*o.*g.*l.*e (repeated letters become squeezed, extra rule )

then we check that expression with the given parameter and if it matches, we print the line: google

This works fine, however when the file words.txt gets too big, this program becomes too slow. How can I speed up my program, possibly by faster matching subsequences.

Edit after possible solution of Kamilcuk

That solution returns quick,quiff,quin,qwerty for the string "qwertyuihgfcvbnhjk" and only quick should be returned, so it is almost correct, but not quite yet.


Solution

  • Try it like so:

    grep -x "$(<<<"$1" tr -s 'A-Za-z' | sed 's/./&*/g;s/\*$//;s/\*//1')" words.txt
    

    Tested against:

    set -- apple  
    cat >words.txt <<EOF
    aple
    al
    pp
    ale
    fdafda
    apppppppple
    apple
    google
    EOF
    

    outputs:

    aple
    ale
    apppppppple
    apple
    

    And for set -- greogdgedlqfe it outputs just google.

    If I understand you correctly, a "subsequent" of apple is everything that mathes ap*l*e.

    Tested on repl