Search code examples
pythonpython-3.xregexregex-group

Regex to add comma after abbreviations


I want to add a comma and a space , after abbreviations that are defined as single or more letters followed by a dot followed by a single or more letters repeated 2 or more times. For example these are considered as abbreviations A.b.C. a.b. ab.cd. ab.cde. ab.cd.ef.gh. while these are not abbreviations a.b or A. B I don't want to add a comma:

  • if the last dot of the abbreviation is the end of the given text,
  • if after the abbreviation there is optional space and a capital letter, or
  • if after the abbreviation there is optional space and another punctuation symbol.

Given the following test sentence:

test_str = """This is an example e.g. sentence and this is with i.e. text and two abbreviations S.T.R. and K.LM.NO.P. as example with acronym.
            but in here it shouldn't catch it because after that there is space and dot g.k. . Also here it shouldn't detect because the next sentence starts with capital A.BC.D.
            And this is a normal sentence. Followed by another normal sentence. This contains only one letter A. and is not abbreviation.
            This shouldn't match i.e., since it contains already a comma. I like to read books such as e.g. book 1 or i.e. book2.
            A.B.C. is an abbreviation that should match. A.B.! is an abbreviation that shouldn't match because it has ! after the abbreviation. 
            A.B.? is an abbreviation that shouldn't match because it has ? after the abbreviation. 
            A.B. ; is an abbreviation that shouldn't match because it has a space and ; after the abbreviation.
            a.b.c.d. is an abbreviation that should match.
            a.b.c., is an abbreviation that shouldn't match because it already has a comma. A.B is not an abbreviation because it contains only one dot.
            Another abbreviation that should not match j.j.L.o.U.h."""

I want the output to be the following:

output_text = """This is an example e.g., sentence and this is with i.e., text and two abbreviations S.T.R., and K.LM.NO.P., as example with acronym.
            but in here it shouldn't catch it because after that there is space and dot g.k. . Also here it shouldn't detect because the next sentence starts with capital A.BC.D.
            And this is a normal sentence. Followed by another normal sentence. This contains only one letter A. and is not abbreviation.
            This shouldn't match i.e., since it contains already a comma. I like to read books such as e.g., book 1 or i.e., book2.
            A.B.C., is an abbreviation that should match. A.B.! is an abbreviation that shouldn't match because it has ! after the abbreviation. 
            A.B.? is an abbreviation that shouldn't match because it has ? after the abbreviation. 
            A.B. ; is an abbreviation that shouldn't match because it has a space and ; after the abbreviation.
            a.b.c.d., is an abbreviation that should match.
            a.b.c., is an abbreviation that shouldn't match because it already has a comma. A.B is not an abbreviation because it contains only one dot.
            Another abbreviation that should not match j.j.L.o.U.h."""

What I use right now is the following:

regex = r"(\b(?:[A-Za-z]\.){2,}(?!\s*[,.;?!-]))"

but it produces the following output:

This is an example e.g., sentence and this is with i.e., text and two abbreviations S.T.R., and K.LM.NO.P. as example with acronym. but in here it shouldn't catch it because after that there is space and dot g.k. . Also here it shouldn't detect because the next sentence starts with capital A.BC.D. And this is a normal sentence. Followed by another normal sentence. This contains only one letter A. and is not abbreviation. This shouldn't match i.e., since it contains already a comma. I like to read books such as e.g., book 1 or i.e., book2. A.B.C., is an abbreviation that should match. A.B.! is an abbreviation that shouldn't match because it has ! after the abbreviation. A.B.? is an abbreviation that shouldn't match because it has ? after the abbreviation. A.B. ; is an abbreviation that shouldn't match because it has a space and ; after the abbreviation. a.b.c.d., is an abbreviation that should match. a.b., c., is an abbreviation that shouldn't match because it already has a comma. A.B is not an abbreviation because it contains only one dot.
Another abbreviation that should not match j.j.L.o.U.h.,

The cases where my regex fails are in bold. They should be K.LM.NO.P., a.b.c., and j.j.L.o.U.h., since the first one should be detected as abbreviation, the second one contains already a punctuation symbol after the last dot and the last one is the end of the given text.

Is there a way to achieve this? Any help is greatly appreciated!


Solution

  • You can match using this regex:

    (?<=\.[a-zA-Z])\.(?=\s[a-z])
    

    and replace with the string .,.

    RegEx Demo

    RegEx Details:

    • (?<=\.[a-zA-Z]): Assert that we have a dot and a letter before matching a dot
    • \.: Match a dot
    • (?=\s[a-z]): Assert that after matching a dot we have a whitespace and a lowercase letter