Search code examples
perl

How to find the indices of sub-list patterns in Perl


I have a longer list whose elements are multi-character symbols, for example:

@c = qw(iim v7 v7 iM iv7 iM im iv7 iv7 bviiM im biio iim bviim biiM biim bviM bviM ivm iih v7 v7 v7 iiim iiih vi7 iim v7 v7 iM iv7 iM im iv7 bviiM im biio iim bviim bviim iiio iim v7 v7 v7 vm i7 ivM iiih vi7);

I'd like to find the indices in this list that match sub-lists of the type S1+ S2+ S3+, where the "+" means match one or more times. So, for example, the sub-list pattern (im iv7 bviiM) would match both (im iv7 iv7 bviiM) and (im iv7 bviiM) as highlighted in bold above. The code would provide the indices 6, 7, 8, 9 for the first match and 32, 33, 34 for the second.

On the surface, this doesn't seem like it should be difficult, and I've tried implementing this with a variety of methods, including regexes, but so far it has beaten me. If there's a simple way to do this I'd be grateful for any help.


Solution

  • An interesting problem, because repeated elements need also be matched by items from the given sub-sequence while the order need be maintained.

    use warnings;
    use strict;
    use feature 'say';
    use Data::Dump qw(dd);
    
    my @words = qw(iim v7 v7 iM iv7 iM im iv7 iv7 bviiM im biio iim bviim biiM
        biim bviM bviM ivm iih v7 v7 v7 iiim iiih vi7 iim v7 v7 iM iv7 iM im 
        iv7 bviiM im biio iim bviim bviim iiio iim v7 v7 v7 vm i7 ivM iiih vi7);
    
    my @subseq = qw(im iv7 bviiM);
    
    my (@all_seqs, @mi);
    my $s = 0;
    
    for my $i (0 .. $#words) { 
        if ($words[$i] eq $subseq[$s]) {  # first in @subseq or repeated from @words
            push @mi, $i;
        }   
        elsif (@mi and $s == @subseq-1) { # done, exhausted @subseq
            push @all_seqs, [ @mi ];  
            $s = 0;
            @mi = (); 
        }   
        elsif (@mi and $words[$i] eq $subseq[++$s]) { # next in @subseq
            push @mi, $i;
        }
        elsif (@mi) { # failed to match all from @subseq
            $s = 0;  
            @mi = ();
        }
    }
    dd \@all_seqs;
    

    The @mi is included in all tests after the first one so that they are done only when something has already been matched.

    Prints

    [[6 .. 9], [32, 33, 34]]
    

    Uncomment printing lines to track its operation. This has been tested beyond the basic run above but not well enough.


    Or, concatenate all words into a string and match the sub-sequence, concatenated into a pattern, by regex; then it's easy to take care of possible repetitions. In order to also scoop up indices from the original array in a match I prepend each word by __INDEX__.

    # Same @words and @subseq from above
    
    my $w = join '', map { '__'.$_.'__' . $words[$_] } 0.. $#words;
    
    my $patt = '(' . 
        join('', map { '(?:' . '__[0-9]+__' . quotemeta($_) . ')+' } @subseq) . ')';
    
    my @seqs = $w =~ /$patt/g;
    
    my @seqs_idx = map { [ /__([0-9]+)__/g ]  } @seqs;
    
    dd \@seqs_idx;
    

    Since __IDX__ must not be in @words nor @subseq it should really be checked for. That would hurt efficiency so perhaps use an even more unlikely separator-token built with the index (and if it contains regex-special characters put it through quotemeta in the @subseq-based pattern).