Search code examples
pythonrubypattern-matchingstring-matching

How to detect identical part(s) inside string?


I try to break down the decoding algorithm wanted question into smaller questions. This is Part I.

Question:

  • two strings: s1 and s2
  • part of s1 is identical to part of s2
  • space is separator
  • how to extract the identical part(s)?

example 1:

s1 = "12 November 2010 - 1 visitor"
s2 = "6 July 2010 - 100 visitors"

the identical parts are "2010", "-", "1" and "visitor"

example 2:

s1 = "Welcome, John!"
s2 = "Welcome, Peter!"

the identical parts are "Welcome," and "!"

example 3: (to clarify the "!" example)

s1 = "Welcome, Sam!"
s2 = "Welcome, Tom!"

the identical parts are "Welcome," and "m!"

Python and Ruby preferred. Thanks


Solution

  • EDIT: Updated this example to work with all the examples, including #1:

    def scan(s1, s2):
        # Find the longest match where s1 starts with s2
        # Returns None if no matches
        l = len(s1)
        while 1:
            if not l:
                return None
            elif s1[:l] == s2[:l]:
                return s1[:l]
            else:
                l -= 1
    
    def contains(s1, s2):
        D = {} # Remove duplicates using a dict
        L1 = s1.split(' ')
        L2 = s2.split(' ')
    
        # Don't add results which have already 
        # been processed to satisfy example #1!
        DProcessed = {}
    
        for x in L1:
            yy = 0
            for y in L2:
                if yy in DProcessed:
                    yy += 1
                    continue
    
                # Scan from the start to the end of the words
                a = scan(x, y)
                if a: 
                    DProcessed[yy] = None
                    D[a] = None
                    break
    
                # Scan from the end to the start of the words
                a = scan(x[::-1], y[::-1])
                if a: 
                    DProcessed[yy] = None
                    D[a[::-1]] = None
                    break
                yy += 1
    
        return list(D.keys())
    
    print contains("12 November 2010 - 1 visitor",
                   "6 July 2010 - 100 visitors")
    print contains("Welcome, John!",
                   "Welcome, Peter!")
    print contains("Welcome, Sam!",
                   "Welcome, Tom!")
    

    Which outputs:

    ['1', 'visitor', '-', '2010']
    ['Welcome,', '!']
    ['Welcome,', 'm!']