Search code examples
stringalgorithmhamming-distance

Get all pairs of strings (DNA) seperated by a distance of Hamming = 1


I have an array of DNA sequences, like:

AA
TA
AC
CC

and I search a faster way to calculate the hamming distance between all sequences pairs (maybe by sorting...), then naive approach (O(N^2))

For motif1 in array
   For motif2 in array
      calculate Hamming_Distance(motif1 , motif2)
   end
end

I need the sequence of pairs that have an Hamming distance = 1


Solution

  • If your n >> k, then you can try following

    Your original complexity is O(nnk), where k is length of the sequence (as Hamming distance comparison requires k steps). Let's try to improve on that.

    1. Create hashmap with all strings in it (complexity O(n*k) because of hashing)
    2. For each string in your input, create all strings which are 1 away from it and see if they are contained in hashmap - if yes, you have found a pair (complexity O(nkk) because you need to hash O(k) each of k variations for each of n strings)

    With that, you have replaced O(nnk) with O(nkk), which should be beneficial if n >> k.

    For k >> n you probably don't really care about n^2 part, so use trivial algorithm.

    For k near n, you can try what I have suggested in comment

    1. Create pseudo-hash for each sequence, by summing all letters with 0,1,2,3 (complexity O(n*k))
    2. Sort them (complexity O(n*logn) if you use out of the box sorting, or O(n) with radix/bucket sort)
    3. Compare pairs going through sorted sequence, looking only at things which are max 3 away from each other (complexity depending on your case, will be O(nnk) in most pathological case, but with real world data it can be closer to O(nkf(n)) where f(n) would be very very small)