algorithm dynamic-programming string-matching fuzzy-search

Efficiently find a given subsequence in a string, maximizing the number of contiguous characters

Long problem description

Fuzzy string matcher utilities like fzf or CtrlP filter a list of strings for ones which have a given search string as a subsequence. As an example, consider that a user wants to search for a specific photo in a list of files. To find the file

/home/user/photos/2016/pyongyang_photo1.png

it suffices to type ph2016png, because this search string is a subsequence of this file name. (Mind that this is not LCS. The whole search string must be a subsequence of the file name.)

It is trivial to check whether a given search string is a subsequence of another string, but I wonder how to efficiently obtain the best match: In the above example, there are multiple possible matches. One is

/home/user/photos/2016/pyongyang_photo1.png

but the one which the user probably had in mind is

/home/user/photos/2016/pyongyang_photo1.png

To formalize this, I'd define the "best" match as the one that is composed of the the smallest number of substrings. This number is 5 for the first example match and 3 for the second.

I came up with this because it would be interesting to obtain the best match to assign a score to each result, for sorting. I'm not interested in approximate solutions though, my interest in this problem is primarily of academic nature.

tl;dr problem description

Given strings s and t, find among the subsequences of t that are equal to s one that maximizes the number of pairs of elements that are contiguous in t.

What I've tried so far

For discussion, let's call the search query s and the string to test t. The problem's solution is denoted fuzzy(s, t). I'll utilize Python's string slicing notation. The easiest approach is as follows:

Since any solution must use all characters from s in order, an algorithm for solving this problem can start by searching the first occurrence of s[0] in t (with index i) and then use the better of the two solutions

t[:i+1] + fuzzy(s[1:], t[i+1:])    # Use the character
t[:i]   + fuzzy(s,     t[i+1:])    # Skip it and use the next occurence 
                                   # of s[0] in t instead

This is obviously not the best solution to this problem. En contraire, it's the obvious brute force one. (I've played around with simultaneously searching for the last occurrence of s[-1] and using this information in an earlier version of this question, but it turned out that this approach does not work.)

→ My question is: What is the most efficient solution to this problem?

Solution

I would suggest creating a search tree, where each node represents a character position in the haystack that matches one of the needle characters.

The top nodes are siblings and represent the occurrences of the first needle character in the haystack.

The children of a parent node are those nodes that represent the occurrences of the next needle character in the haystack, but only those that are positioned after the position represented by that parent node.

This logically means that some children are shared by several parents, and so this structure is not really a tree, but a directed acyclic graph. Some sibling parents might even have exactly the same children. Other parents might not have children at all: they are a dead-end, unless they are at the bottom of the graph where the leaves represent positions of the last needle character.

Once this graph is set up, a depth-first search in it can easily derive the number of segments that are still needed from a certain node onwards, and then minimise that among alternatives.

I have added some comments in the Python code below. This code might still be improved, but it seems already quite efficient compared to your solution.

def fuzzy_trincot(haystack, needle, returnSegments = False):
    inf = float('inf')

    def getSolutionAt(node, depth, optimalCount = 2):
        if not depth: # reached end of needle
            node['count'] = 0
            return
        minCount = inf # infinity ensures also that incomplete branches are pruned
        child = node['child']
        i = node['i']+1
        # Optimisation: optimalCount gives the theoretical minimum number of  
        # segments needed for any solution. If we find such case, 
        # there is no need to continue the search.
        while child and minCount > optimalCount:
            # If this node was already evaluated, don't lose time recursing again.
            # It works without this condition, but that is less optimal.
            if 'count' not in child:
                getSolutionAt(child, depth-1, 1)
            count = child['count'] + (i < child['i'])
            if count < minCount:
                minCount = count
            child = child['sibling']
        # Store the results we found in this node, so if ever we come here again,
        # we don't need to recurse the same sub-tree again.
        node['count'] = minCount

    # Preprocessing: build tree
    # A node represents a needle character occurrence in the haystack.
    # A node can have these keys:
    #   i:       index in haystack where needle character occurs
    #   child:   node that represents a match, at the right of this index, 
    #            for the next needle character
    #   sibling: node that represents the next match for this needle character
    #   count:   the least number of additional segments needed for matching the 
    #            remaining needle characters (only; so not counting the segments
    #            already taken at the left)
    root = { 'i': -2, 'child': None, 'sibling': None }
    # Take a short-cut for when needle is a substring of haystack
    if haystack.find(needle) != -1:
        root['count'] = 1
    else:
        parent = root
        leftMostIndex = 0
        rightMostIndex = len(haystack)-len(needle)
        for j, c in enumerate(needle):
            sibling = None
            child = None
            # Use of leftMostIndex is an optimisation; it works without this argument
            i = haystack.find(c, leftMostIndex)
            # Use of rightMostIndex is an optimisation; it works without this test
            while 0 <= i <= rightMostIndex:
                node = { 'i': i, 'child': None, 'sibling': None }
                while parent and parent['i'] < i:
                    parent['child'] = node
                    parent = parent['sibling']
                if sibling: # not first child
                    sibling['sibling'] = node
                else: # first child
                    child = node
                    leftMostIndex = i+1
                sibling = node
                i = haystack.find(c, i+1)
            if not child: return False
            parent = child
            rightMostIndex += 1
        getSolutionAt(root, len(needle))

    count = root['count']
    if not returnSegments:
        return count

    # Use the `returnSegments` option when you need the character content 
    # of the segments instead of only the count. It runs in linear time.

    if count == 1: # Deal with short-cut case 
        return [needle]
    segments = []
    node = root['child']
    i = -2
    start = 0
    for end, c in enumerate(needle):
        i += 1
        # Find best child among siblings
        while (node['count'] > count - (i < node['i'])):
            node = node['sibling']
        if count > node['count']:
            count = node['count']
            if end:
                segments.append(needle[start:end])
                start = end
        i = node['i']
        node = node['child']
    segments.append(needle[start:])
    return segments

The function can be called with an optional third argument:

haystack = "/home/user/photos/2016/pyongyang_photo1.png"
needle = "ph2016png"

print (fuzzy_trincot(haystack, needle))

print (fuzzy_trincot(haystack, needle, True))

Outputs:

3
['ph', '2016', 'png']

As the function is optimised to return only the count, the second call will add a bit to the execution time.