Search code examples
algorithmprefix

Detect largest suffix of some file that is prefix of another file


I have two files - let's call them file0 and file1.

What I would like to get is a fast algorithm for the following problem (it is clear to me how to write a rather slow algorithm that solves it):

Detect the largest suffix of file0 that is a prefix of file1, that means a memory block B (or more precisely: the number of bytes of such a memory block) of maximum length so that

  • file0 consists of some memory block A, followed by B
  • file1 constist of memory block B, followed by some memory block C

Note that the blocks A, B and C can also have a length of zero bytes.

Edit (to answer drysdam's remark): the obvious rather slow algorithm that I think of (Pseudocode): let the length of the files be bounded by m, n with wlog m<=n.

for each length from m to 0
    compare the m last bytes of file0 with the first m bytes of file1
    if they are equal
        return length

This is obviously an O(m*min(m, n)) algorithm. If the files are about the same size this is O(n^2).

The files that I have to handle currently are of size of 10 up to a few hundread megabytes. But in extreme cases they can also be of size of a few gigabytes - just big enough not to fit into the 32 bit adress space of x86 anymore.


Solution

  • Depending on how much memory you have available, you may want to consider building a suffix tree for the first file. Once you have this, you can query the prefix of the second file that maximally overlaps with a suffix of the second file by just walking the suffix tree down from the root along the edges matching the letters of the prefix of the second file. Since suffix trees can be built in linear time, the runtime for this algorithm is O(|A| + |B|) using your terminology, since it takes O(|A| + |B|) time to build the suffix tree and O(|B|) time to walk the suffix tree to find the block B.