I need to split a string into substrings by a set of whole words.
Input:
word wo wordword
Output (split by word
):
str1: word
str2: [space]wo wordword
Output (split by wo
):
str1: word[space]
str2: wo
str3: [space]wordword
The method signature for the desired method should look like this:
public List<string> GetPhrases(string text, List<string> splitters);
Considerations:
whole word matches only
whitespaces should be preserved
splitters list contains distinct words only
a splitter does not contain whitespace
matches should be case insensitive
With this method, I'll be able to highlight whole word matches in a UI window with the ability to match multiple words with different highlights, but I can't wrap my head around using regex.
Currently, I have a non-regex solution, but it's not great:
var words = Regex.Split(text, @"\s+").Where(s => s != string.Empty).ToList();
var str = "";
var list = new List<string>();
foreach (var word in words)
{
if (!splitters.Contains(word))
{
if(words.IndexOf(word) != words.Count - 1)
str += word + " ";
else
str += word;
}
else
{
if(!string.IsNullOrWhitespace(str))
list.Add(str);
list.Add(word);
str = "";
}
}
if(!string.IsNullOrWhitespace(str))
list.Add(str);
The problem is I'm not maintaining any whitespace like newlines and instead replacing them with a space.
If your splitter words are just streaks of alphanumeric or underscore chars, you may use
var results = Regex.Split(s, $@"\b({string.Join("|", splitters)})\b")
.Where(s => !string.IsNullOrEmpty(s))
Here, \b(word1|word2)\b
pattern will match splitter words as whole words, and Regex.Split
will split the string into the matching and non-matching chunks because of the capturing group ((...)
) around the splitter words in the pattern.
The .Where(s => !string.IsNullOrEmpty(s))
will filter out empty strings that usually appear when the match happens be a consecutive match or at the start/end of the string.
See the regex demo: