Search code examples
c#regexstringmatchstring-matching

.NET: Return a list of strings given a list of whole words to match


I need to split a string into substrings by a set of whole words.

Input: word wo wordword

Output (split by word):

str1: word

str2: [space]wo wordword

Output (split by wo):

str1: word[space]

str2: wo

str3: [space]wordword

The method signature for the desired method should look like this:

public List<string> GetPhrases(string text, List<string> splitters);

Considerations:

  • whole word matches only

  • whitespaces should be preserved

  • splitters list contains distinct words only

  • a splitter does not contain whitespace

  • matches should be case insensitive

With this method, I'll be able to highlight whole word matches in a UI window with the ability to match multiple words with different highlights, but I can't wrap my head around using regex.

Currently, I have a non-regex solution, but it's not great:

var words = Regex.Split(text, @"\s+").Where(s => s != string.Empty).ToList();
var str = "";
var list = new List<string>();

foreach (var word in words)
{
    if (!splitters.Contains(word))
    {
        if(words.IndexOf(word) != words.Count - 1)
            str += word + " ";
        else
            str += word;
    }
    else
    {
        if(!string.IsNullOrWhitespace(str))
           list.Add(str);

        list.Add(word);
        str = "";
    }
}

if(!string.IsNullOrWhitespace(str))
   list.Add(str);

The problem is I'm not maintaining any whitespace like newlines and instead replacing them with a space.


Solution

  • If your splitter words are just streaks of alphanumeric or underscore chars, you may use

    var results = Regex.Split(s, $@"\b({string.Join("|", splitters)})\b")
                       .Where(s => !string.IsNullOrEmpty(s))
    

    Here, \b(word1|word2)\b pattern will match splitter words as whole words, and Regex.Split will split the string into the matching and non-matching chunks because of the capturing group ((...)) around the splitter words in the pattern.

    The .Where(s => !string.IsNullOrEmpty(s)) will filter out empty strings that usually appear when the match happens be a consecutive match or at the start/end of the string.

    See the regex demo:

    enter image description here