Search code examples
python-3.xregexregular-language

Split a string by punctuation marks (.!?;:) while excluding abbreviations


I'd like to create a function that's capable of splitting a string containing multiple sentences by dots, but at the same time handles abbreviations. For example, it shouldn't split after "Univ." and "Dept.". It's kinda hard to explain but I will show the test cases. I have seen this post (Split string with "." (dot) while handling abbreviations) but the answer removed the non-punctuation dots (U.S.A. to USA) and I want to keep dots in place

This is my function:

def split_string_by_punctuation(line: str) -> list[str]:
    """
    Splits a given string into a list of strings using terminal punctuation marks (., !, ?, or :) as delimiters.

    This function utilizes regular expression patterns to ensure that abbreviations, honorifics,
    and certain special cases are not considered as sentence delimiters.

    Args:
        line (str): The input string to be split into sentences.

    Returns:
        list: A list of strings representing the sentences obtained after splitting the input string.

    Notes:
        - Negative lookbehind is used to exclude abbreviations (e.g., "e.g.", "i.e.", "U.S.A."),
          which might have a period but are not the end of a sentence.
        - Negative lookbehind is also used to exclude honorifics (e.g., "Mr.", "Mrs.", "Dr.")
          that might have a period but are not the end of a sentence.
        - Negative lookbehind is also used to exclude some abbreviations (e.g., "Dept.", "Univ.", "et al.")
          that might have a period but are not the end of a sentence.
        - Positive lookbehind is used to match a whitespace character following a terminal
          punctuation mark (., !, ?, or :).
    """
    punct_regex = re.compile(r"(?<=[.!?;:])(?:(?<!Prof\.)|(?<!Dept\.)|(?<!Univ\.)|(?<!et\sal\.))(?<!\w\.\w.)(?<![A-Z][a-z]\.)\s")


    return re.split(punct_regex, line)

And these are my test cases:

class TestSplitStringByPunctuation(object):
    def test_split_string_by_punctuation_1(self):
        # Test case 1
        text1 = "I am studying at Univ. of California, Dept. of Computer Science. The research team includes " \
                "Prof. Smith, Dr. Johnson, and Ms. Adams et al. so we are working on a new project."
        result1 = split_string_by_punctuation(text1)
        assert result1 == ['I am studying at Univ. of California, Dept. of Computer Science.',
                           'The research team includes Prof. Smith, Dr. Johnson, and Ms. Adams et al. '
                           'so we are working on a new project.'], "Test case 1 failed"

    def test_split_string_by_punctuation_2(self):
        # Test case 2
        text2 = "This is a city in U.S.A.. This is i.e. one! What about this e.g. one? " \
                "Finally, here's the last one:"
        result2 = split_string_by_punctuation(text2)
        assert result2 == ['This is a city in U.S.A..', 'This is i.e. one!', 'What about this e.g. one?',
                           "Finally, here's the last one:"], "Test case 2 failed"

    def test_split_string_by_punctuation_3(self):
        # Test case 3
        text3 = "This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return as a single element list"
        result3 = split_string_by_punctuation(text3)
        assert result3 == [
            'This sentence contains no punctuation marks from Mr. Zhong, Dr. Lu and Mrs. Han It should return '
            'as a single element list'], "Test case 3 failed"

For example, the result of test case 1 is ['I am studying at Univ.', 'of California, Dept.', 'of Computer Science.', 'The research team includes Prof.', 'Smith, Dr. Johnson, and Ms. Adams et al.', 'so we are working on a new project.'] which splits the string on "Univ.", "Dept.", "Prof." and "et al.".


Solution

  • I would suggest using findall to capture sentences instead of split to identify sentence breaks.

    Some other remarks:

    • It is counter productive to use re.compile when you pass the regex object as argument to re.split (or any other re method), because then it gets compiled again. Instead you should call the method on the regex object, like punct_regex.split(line). However, as this regex is only used once, you might as will skip the call to compile. Compilation will happen on the re method call.

    • Listing all possible abbreviations will be a never-ending task! Unless you are sure you caught them all, I would suggest a heuristic: if a point is not followed by white space and a capital, the preceding word is an abbreviation. If the word has a capital as first letter and has at most 4 letters and is followed by a point, it is also an abbreviation. In all other cases the point is interpreted as ending a sentence.

    • There were some errors in your test cases.

    After fixing the test cases, this function passed the tests:

    def split_string_by_punctuation(line):
        punct_regex = r"(?=\S)(?:[A-Z][a-z]{0,3}\.|[^.?!;:]|\.(?!\s+[A-Z]))*.?"
        return re.findall(punct_regex, line)
    

    Explanation:

    • (?=\S): assert that the first character of any match is not white space
    • (?: | | )*: a non-capturing group with three alternate patterns. This can repeat 0 or more times.
    • [A-Z][a-z]{0,3}\.: one of the alternatives: a capital followed by at most three lower case letters and then a point.
    • [^.?!;:]: one of the alternatives: a character that is not one of .?!;:.
    • \.(?!\s+[A-Z]): a point that is not followed by white space and a capital letter.
    • .?: any character -- if there is still one. If there is one, we know it is one of .?!;: (otherwise the second alternative above would still have been used). If not, we are at the end of the input.

    NB: a non-capturing group still matches text, it just cannot be referenced with a back reference. The word "capture" refers to creating a group for it, not to "matching".