Search code examples
regexregular-language

Using regex to find abbreviations


I am trying to create a regular expression that will identify possible abbreviations within a given string in Python. I am kind of new to RegEx and I am having difficulties creating an expression though I beleive it should be somewhat simple. The expression should pick up words that have two or more capitalised letter. The expression should also be able to pick up words where a dash have been used in-between and report the whole word (both before and after the dash). If numbers are also present they should also be reported with the word.

As such, it should pick up:

ABC, AbC, ABc, A-ABC, a-ABC, ABC-a, ABC123, ABC-123, 123-ABC.

I have already made the following expression: r'\b(?:[a-z]*[A-Z\-][a-z\d[^\]*]*){2,}'.

However this does also pick up these wrong words:

A-bc, a-b-c

I believe the problem is that it looks for either multiple capitalised letters or dashes. I wish for it to only give me words that have atleast two or more capitalised letters. I understand that it will also "mistakenly" take words as "Abc-Abc" but I don't believe there is a way to avoid these.


Solution

  • If a lookahead is supported and you don't want to match double -- you might use:

    \b(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b
    

    Explanation

    • \b A word boundary
    • (?= Positive lookahead, assert that from the current location to the right is
      • (?:[a-z\d-]*[A-Z]){2} Match 2 times the optionally the allowed characters and an uppercase char A-Z
    • ) Close the lookahead
    • [A-Za-z\d]+ match 1+ times the allowed characters without the hyphen
    • (?:-[A-Za-z\d]+)* Optionally repeat - and 1+ times the allowed characters
    • \b A word boundary

    See a regex101 demo.

    To also not not match when there are hyphens surrounding the characters you can use negative lookarounds asserting not a hyphen to the left or right.

    \b(?<!-)(?=(?:[a-z\d-]*[A-Z]){2})[A-Za-z\d]+(?:-[A-Za-z\d]+)*\b(?!-)
    

    See another regex demo.