Search code examples
c#regexsplitprefix

Regex Split a Paragraph in to Sentences but skip Prefix Titles


I need to split the following paragraph in to sentences BUT ignore splitting at points where prefix titles such as Mr. Mrs. Ms. is used.

string text = "Joffrey died on March 25, 1988 of AIDS at the age of 57 in New York City, New York. He is buried at Cathedral of Saint John the Divine. Mr.Joffrey was inducted into the National Museum of Dance's Mr. & Mrs. Cornelius Vanderbilt Whitney Hall of Fame in 2000."

A normal regex statement such as: @"(?<=[\.!\?])\s+" would successfully split sentences but also split words such as Mr.Joffrey along with it, which is what i want to avoid.

A regex statement to clarify this issue would be very helpful :)


Solution

  • This is simple enough using negative lookbehinds:

    Split on the following regex:

    (?<!Mr?s?)\.\s*

    This will match periods that are not preceded by Mr or Mrs. It will also include the following spaces.

    If you want to ignore initials as well, you can use this:

    (?<!Mr?s?|\b[A-Z])\.\s*

    This will ignore any periods preceded by a single uppercase letter.