Search code examples
pythonregexnsregularexpression

How can I use Regex to abbreviate words that all start with a capital letter


I want to abbreviate words from a string by writing a python script. for example, I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia; becomes I studied in KSU, which is in Riyadh, the capital of SA.

I tried to use the lambda, to scan all the string but I couldn't remove the rest of the word after finding it.

import re
str = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"

result = re.sub(r"\b[A-Z]", lambda x: x.group()  ,str)

print(result)

Solution

  • You need to actually consume two or more words starting with an uppercase letter.

    You can use something like

    result = re.sub(r"\b[A-Z]\w*(?:\s+[A-Z]\w*)+", lambda x: "".join(c[0] for c in x.group().split()), text)
    

    See the Python demo:

    import re
    text = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"
    result = re.sub(r"\b[A-Z]\w*(?:\s+[A-Z]\w*)+", lambda x: "".join(c[0] for c in x.group().split()), text)
    print(result)
    # => I studied in KSU, which is in Riyadh, the capital of SA
    

    See the regex demo. Details:

    • \b - a word boundary
    • [A-Z] - an uppercase ASCII letter
    • \w* - zero or more word chars
    • (?:\s+[A-Z]\w*)+ - one or more occurrences of
      • \s+ - one or more whitespaces
      • [A-Z]\w* - an uppercase ASCII letter and then zero or more word chars.

    The "".join(c[0] for c in x.group().split()) part grabs first chars from the non-whitespace chunks in the match value and joins them into a single string.

    To support all Unicode uppercase letters, I'd advise to use PyPi regex module, and use

    import regex
    #...
    result = regex.sub(r"\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*)+", lambda x: "".join(c[0] for c in x.group().split()), text)
    

    where \p{Lu} matches any Unicode uppercase letter and \p{L} matches any Unicode letter.