Search code examples
pythonregexspeech-to-textrasa

How can I write Python Regex that will take 4 numbers followed by Phonetic Alphabet values? Example: 1 2 3 4 Alpha Bravo -> 1234AB


I am using the following script so that Rasa framework will detect a Dutch postcode when it is passed by a user:

https://medium.com/@naoko.reeves/rasa-regex-entity-extraction-317f047b28b6

the format of a Dutch postcode is 1234 AB. This works great when using regex like:

 [1-9][0-9]{3}[\s]?[a-z]{2}

However, I am now trying to implement a Speech-To-Text functionality (Azure Cognitive Services) that does not pick up the alphabet very easily. e.g 'B' is picked up as 'Bee'.

I am now trying to alter the regex so that the user can say '1 2 3 4 Alpha Bravo' and the regex extractor will pick out '1 2 3 4 A B'.

I have tried using word boundary like the following:

[1-9]*[\s]?[0-9]*[\s]?[0-9]*[\s]?[0-9]*[\s]?\b[a-zA-Z]

and

[1-9]\s[0-9\s]{5}\s?\b[a-zA-Z]

The former is far too lenient and if the user says 'Hello There', it will trigger the regex extractor and pass 'HT' to the postcode behaviour.

The latter is more strict but I can only get '1 2 3 4 Alpha Bravo' to match as '1 2 3 4 A'.

I'd really appreciate any solutions as to how I can solve this problem. If this is not easily achievable in Regex, I believe that altering the following function in the medium article linked would get the results I'm after. Unfortunately, I'm no Python/Regex expert :).

 def match_regex(self, message):
    extracted = []
    for d in self.regex_feature:
        match = re.search(pattern=d['pattern'], string=message)
        if match:
            entity = {
                "start": match.pos,
                "end": match.endpos,
                "value": match.group(),
                "confidence": 1.0,
                "entity": d['name'],
            }
            extracted.append(entity)
    extracted = self.add_extractor_name(extracted)
    return extracted

I hope this is clear enough.

Thanks!

Jake


Solution

  • You can use 3 groups matching optional spaces between the digits and between the uppercase chars A-Z.

    ([1-9](?:\s*[0-9]){3})\s?([A-Z])[a-z]*\s*([A-Z])[a-z]*
    

    The pattern matches

    • ([1-9](?:\s*[0-9]){3}) Match 4 digits with optional whitspace chars
    • \s? Match an optional whitespace
    • ([A-Z])[a-z]*\s* Match an uppercase char A-Z followed by optional lowercase chars and optional whitespac
    • ([A-Z])[a-z]* Match an uppercase char A-Z followed by optional lowercase chars

    regex demo

    A bit more strict option could be matching the uppercase char A-Z followed by only upper or lowercase variations of the same char using an optionally repeated backreference

    \b([1-9](?:\s*[0-9]){3})\s?([A-Z])(?i:\2*)\s*([A-Z])(?i:\3*)\b
    

    Regex demo | Python demo

    import re
    
    pattern = r"\b([1-9](?:\s*[0-9]){3})\s?([A-Z])(?i:\2*)\s*([A-Z])(?i:\3*)\b"
    strings = [
        "1 2 3 4 Alpha Bravo",
        "1234 Alpha Bravo",
        "1234A Bbbbbbbc",
        "1234Aaa Bbb",
        "1234Aa Bbb",
        "1234A BbbbbBbb"
    ]
    
    for s in strings:
        print(re.findall(pattern, s))
    

    Output

    []
    []
    []
    [('1234', 'A', 'B')]
    [('1234', 'A', 'B')]
    [('1234', 'A', 'B')]