How can I use Regex to abbreviate words that all start with a capital letter

I want to abbreviate words from a string by writing a python script. for example, I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia; becomes I studied in KSU, which is in Riyadh, the capital of SA.

I tried to use the lambda, to scan all the string but I couldn't remove the rest of the word after finding it.

import re
str = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"

result = re.sub(r"\b[A-Z]", lambda x: x.group()  ,str)

print(result)

Solution

You need to actually consume two or more words starting with an uppercase letter.

You can use something like

result = re.sub(r"\b[A-Z]\w*(?:\s+[A-Z]\w*)+", lambda x: "".join(c[0] for c in x.group().split()), text)

See the Python demo:

import re
text = "I studied in King Saud University, which is in Riyadh, the capital of Saudi Arabia"
result = re.sub(r"\b[A-Z]\w*(?:\s+[A-Z]\w*)+", lambda x: "".join(c[0] for c in x.group().split()), text)
print(result)
# => I studied in KSU, which is in Riyadh, the capital of SA

See the regex demo. Details:

\b - a word boundary
[A-Z] - an uppercase ASCII letter
\w* - zero or more word chars
(?:\s+[A-Z]\w*)+ - one or more occurrences of
- \s+ - one or more whitespaces
- [A-Z]\w* - an uppercase ASCII letter and then zero or more word chars.

The "".join(c[0] for c in x.group().split()) part grabs first chars from the non-whitespace chunks in the match value and joins them into a single string.

To support all Unicode uppercase letters, I'd advise to use PyPi regex module, and use

import regex
#...
result = regex.sub(r"\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*)+", lambda x: "".join(c[0] for c in x.group().split()), text)

where \p{Lu} matches any Unicode uppercase letter and \p{L} matches any Unicode letter.