Search code examples
pseudocodetext-parsingsentence

Best way to split text into sentences avoiding acronyms clashes


Given the following phrase

Ms. Mary got to know her husband Mr. Dave in her trip to U.S.A. and it was cool. Did you know Dave worked for Microsoft? Well he did. He was even part of Internet Explorer devs.

What is the best "pseudo-code" way to split it into sentences? Python or any other similar language is also fine because of its pseudo-code resemblance.

What I've thought is to replace every occurrence of " a-zA-Z." (notice the space), ".a-zA-Z" and ".a-zA-Z." to its equivalent without the dot of course, so for example

" a."
" b."
" c."
" d."
" e."
" f."
...

and

".a."
".b."
".c."
".d."
".e."
".f."
...

and

" ab."
" ac."
" ad."
...
" ba."
" bc."
" bd."
...

The phrase should be nicely converted to the following

Ms Mary got to know her husband Mr Dave in her trip to USA and it was cool. Did you know Dave worked for Microsoft? Well he did. He was even part of Internet Explorer devs.

...or am I wrong somewhere and I have a flawed logic?

For the future what's your question comments, I need to know what's the best way to split the example text into correct sentences avoiding clashes with acronyms.

This either explained in pseudo-code, Python or other languages similar to pseudo-code. I want it to be language agnostic so it can be implemented by anyone, regardless of the language they use.


Solution

  • All acronyms in the example are of the pattern Uppercase . or Uppercase lowercase .; none of the other -- regular -- occurrences of the full stop match this particular pattern.

    So a simple RegEx can be used to remove the full stops. What's left after that can be split on the regular punctuation marks .!?. In Javascript:

    str2 = str.replace(/([A-Z][a-z]?)\./g, '$1');
    

    or using a GREP flavor that does understand most common character classes:

    str2 = str.replace(/(\u\l?)\./g, '$1');
    

    This results directly in the output as shown.

    Using a RegEx is straightforward (and easily expanded!), but the same pattern can be tested in other languages as well. In C, you can copy input to output and test only when seeing the . character:

    int main (void)
    {
        char input[] = "Ms. Mary got to know her husband Mr. Dave in her trip to "
           "U.S.A. and it was cool. Did you know Dave worked for Microsoft? Well "
           "he did. He was even part of Internet Explorer devs.";
        char output[256], *readptr, *writeptr;
    
        printf ("in: %s\n", input);
    
        readptr = input;
        writeptr = output;
        while (*readptr)
        {
            if (*readptr == '.')
            {
                if ((readptr > input && isupper(readptr[-1])) ||
                    (readptr > input+1 && isupper(readptr[-2]) && islower(readptr[-1])))
                {
                    readptr++;
                    continue;
                }
            }
            *writeptr = *readptr;
            readptr++;
            writeptr++;
        }
    
        *writeptr = 0;
        printf ("out: %s\n", output);
    
        return 0;
    }
    

    These solutions remove full stops from the source text. If you want to keep them, you can replace them with a placeholder (for example, a character that does not normally occur in the source text), or do the reverse: when splitting on sentences, test to see whether or not a full stop is a valid breaking point.


    Afterthought: it does work on the original sample sentence... but it does not on the one in the comments:

    I made a trip to the U.S.A. It was cool.I liked it very much.
    

    where you get the output

    I made a trip to the USA It was cool.I liked it very much.
    

    This requires checking for more possible scenarios:

    1. common abbreviations, such as Ms. and Mr.: \u\l\.
    2. in-sentence acronyms; "U.S.A." followed by a lowercase: (\u\.)+ (?=\l), where the full stop needs removing;
    3. end-of-sentence acronyms; "U.S.A." followed by an uppercase: (\u\.)+ (?=\u), where the last full stop should remain.