Search code examples
pythonregextext-search

Finding Standardized Text Pattern In String


We are looking through a very large set of strings for standard number patterns in order to locate drawing sheet numbers. For example valid sheet numbers are: A-101, A101, C-101, C102, E-101, A1, C1, A-100-A, ect.

They may be contained in a string such as "The sheet number is A-101 first floor plan"

The sheet number patterns are always comprised of similar patterns of character type (numbers, characters and separators (-, space, _)) and if we convert all valid numbers to a pattern indicating the character type (A-101=ASNNN, A101=ANNN, A1 - AN, etc) that there are only ~100 valid patterns.

Our plan is to convert each character in the string to it's character type and then search for a valid pattern. So the question is what is the best way to search through "AAASAAAAASAAAAAASAASASNNNSAAAAASAAAAASAAAA" to find one of 100 valid character type patterns. We considered doing 100 text searches for each pattern, but there seems like there could be a better way to find a candidate pattern and then search to see if it is one of the 100 valid patterns.


Solution

  • Solution

    Is it what you want?

    import re
    
    pattern_dict = {
        'S': r'[ _-]',
        'A': r'[A-Z]',
        'N': r'[0-9]',
    }
    
    patterns = [
        'ASNNN',
        'ANNN',
        'AN',
    ]
    
    text = "A-1 A2 B-345 C678 D900 E80"
    
    for pattern in patterns:
        converted = ''.join(pattern_dict[c] for c in pattern)
        print(pattern, re.findall(rf'\b{converted}\b', text))
    

    output:

    ASNNN ['B-345']
    ANNN ['C678', 'D900']
    AN ['A2']
    

    Exmplanation

    • rf'some\b {string}': Combination of r-string and f-string.
    • r'some\b': Raw string. It prevents python string escaping. So it is same with 'some\\b'
    • f'{string}': Literal format string. Python 3.6+ supports this syntax. It is similar to '{}'.format(string).
    • So you can alter rf'\b{converted}\b' to '\\b' + converted + '\\b'.
    • \b in regex: It matches word boundary.