Search code examples
c#regextextsplit

Regex split by non-alphanumeric characters with special treatment of words with apostrophes contractions


I am trying to split a string using Regex in C#. I want to split it based on all non-alphanumeric characters but I would like to treat words with apostrophes as whole word when contains a contraction such as: 'd, 's, 't.
An example should clarify what I would like to achieve. Given a sentence such as:

"Steve's dog is mine 'not yours' I know you'd like'it"

I would like to obtain the following tokens:

steve's, dog, is, mine, not, yours, i, know, you'd, like, it

At the moment I am using:

Regex.Split(str.ToLower(), @"[^a-zA-Z0-9_']").Where(s => s != String.Empty).ToArray<string>();

It returns:

steve's , dog , is , mine , 'not , yours', i , know, you'd, like'it

Solution

  • Here is a half-regex-half-LINQ solution:

    string s = "Steve's dog is mine 'not yours' I know you'd like'it";
    string[] result = Regex.Matches(s, "\\w+('(s|d|t|ve|m))?")
        .Cast<Match>().Select(x => x.Value).ToArray();
    

    I try to match everything that you want to get, instead of the separators you want to split by. And then I just Selected the Values and turn them all into an array.