Search code examples
cstringsubstringstring-matching

Given an array of long strings, how do I efficiently check given pairs of substrings (of the given strings) if they differ by at most one character?


Any given pair of substrings are of the same length. Many pairs of substrings have to be checked, so a naive compare is not efficient enough, but I can't really think of any preprocessing of the array of strings that would help speed up the compare process. Thanks in advance!

An example is provided for clarification:

an array of long strings:

str = {"aaaaa", "aaabbcc", "abcdefgh"...}

pairs of substrings to be checked:

pairs = {(str[0][0..1],str[1][1..2]), (str[0][1..4],str[2][3..6]), (str[1][2..4], str[2][0..2])...}

pairs of substrings to be checked (substituted in):

pairs = {("aa","aa"), ("aaaa","defg"), ("abb","abc")...}

the final result:

result = {true, false, true}

A naive compare would result in a runtime of O(|pairs|*max(|str[i]|)), and I would like to improve that.


Solution

  • (Cross-posting my answer from Quora here).

    IMO, the question was not stated very clearly, but I think it seems to be asking the following: we are given a set of strings S[1], S[2], …, S[N], and a set of queries which each take the form (i1, j1, i2, j2, L). The answer to such a query is “yes” if the strings of length L starting at position j1 of S[i1] and at position j2 of S[i2] differ by a change in at most one character, otherwise “no”. The sum of the L-values across all such queries may be much larger than the total length of the strings.

    In that case, we can devise an efficient algorithm using the following observation: if S and T are strings of length L, then the statement “S and T differ by a change in at most one character” is equivalent to “LCP(S, T) + LCP(R(S), R(T)) >= L-1” where R denotes reversal of a string, and LCP is the length of the longest common prefix of two strings.

    Thus, in order to answer queries efficiently, we only need to preprocess the strings S[1], …, S[N] and R(S[1]), …, R(S[N]) so that longest-common-prefix queries are fast. This can be done by concatenating S[1], …, S[N] to give one string S, and building the suffix array and longest-common-prefix array of S, then doing the same for the reverse of S. Determining the LCP of two substrings of the original strings is then equivalent to determining the LCP of two substrings of S (*) which is equivalent to a range-minimum-query in the LCP array, which can be answered efficiently by preprocessing. A similar statement holds for the reverses of the original strings and the reverse of S.

    (*) Technically, the LCP in the concatenated string S can extend beyond the boundaries of the original strings. However, this would only happen if the query substrings were in fact identical, so it would just mean we would answer “yes” in cases where the answer is “yes” anyway.