Search code examples
rubydna-sequence

How to find indices of identical sub-sequences in two strings in Ruby?


Here each instance of the class DNA corresponds to a string such as 'GCCCAC'. Arrays of substrings containing k-mers can be constructed from these strings. For this string there are 1-mers, 2-mers, 3-mers, 4-mers, 5-mers and one 6-mer:

  • 6 1-mers: ["G", "C", "C", "C", "A", "C"]
  • 5 2-mers: ["GC", "CC", "CC", "CA", "AC"]
  • 4 3-mers: ["GCC", "CCC", "CCA", "CAC"]
  • 3 4-mers: ["GCCC", "CCCA", "CCAC"]
  • 2 5-mers: ["GCCCA", "CCCAC"]
  • 1 6-mers: ["GCCCAC"]

The pattern should be evident. See the Wiki for details.

The problem is to write the method shared_kmers(k, dna2) of the DNA class which returns an array of all pairs [i, j] where this DNA object (that receives the message) shares with dna2 a common k-mer at position i in this dna and at position j in dna2.

dna1 = DNA.new('GCCCAC')
dna2 = DNA.new('CCACGC')

dna1.shared_kmers(2, dna2)
#=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]

dna2.shared_kmers(2, dna1)
#=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]

dna1.shared_kmers(3, dna2)
#=> [[2, 0], [3, 1]]

dna1.shared_kmers(4, dna2)
#=> [[2, 0]]

dna1.shared_kmers(5, dna2)
#=> []

Solution

  • class DNA
      attr_accessor :sequencing
    
      def initialize(sequencing)
        @sequencing = sequencing
      end
    
      def kmers(k)
        @sequencing.each_char.each_cons(k).map(&:join)
      end
    
      def shared_kmers(k, dna)
        kmers(k).each_with_object([]).with_index do |(kmer, result), index|
          dna.kmers(k).each_with_index do |other_kmer, other_kmer_index|
            result << [index, other_kmer_index] if kmer.eql?(other_kmer)
          end
        end
      end
    end
    
    dna1 = DNA.new('GCCCAC')
    dna2 = DNA.new('CCACGC')
    
    dna1.kmers(2)
    #=> ["GC", "CC", "CC", "CA", "AC"]
    
    dna2.kmers(2)
    #=> ["CC", "CA", "AC", "CG", "GC"]
    
    dna1.shared_kmers(2, dna2)
    #=> [[0, 4], [1, 0], [2, 0], [3, 1], [4, 2]]
    
    dna2.shared_kmers(2, dna1)
    #=> [[0, 1], [0, 2], [1, 3], [2, 4], [4, 0]]
    
    dna1.shared_kmers(3, dna2)
    #=> [[2, 0], [3, 1]]
    
    dna1.shared_kmers(4, dna2)
    #=> [[2, 0]]
    
    dna1.shared_kmers(5, dna2)
    #=> []