I am trying to split a string using Regex in C#. I want to split it based on all non-alphanumeric characters but I would like to treat words with apostrophes as whole word when contains a contraction such as: 'd
, 's
, 't
.
An example should clarify what I would like to achieve. Given a sentence such as:
"Steve's dog is mine 'not yours' I know you'd like'it"
I would like to obtain the following tokens:
steve's, dog, is, mine, not, yours, i, know, you'd, like, it
At the moment I am using:
Regex.Split(str.ToLower(), @"[^a-zA-Z0-9_']").Where(s => s != String.Empty).ToArray<string>();
It returns:
steve's , dog , is , mine , 'not , yours', i , know, you'd, like'it
Here is a half-regex-half-LINQ solution:
string s = "Steve's dog is mine 'not yours' I know you'd like'it";
string[] result = Regex.Matches(s, "\\w+('(s|d|t|ve|m))?")
.Cast<Match>().Select(x => x.Value).ToArray();
I try to match everything that you want to get, instead of the separators you want to split by. And then I just Select
ed the Value
s and turn them all into an array.