Search code examples
c#linqstop-words

Remove words if all of them are in a stop words list


I have an array of word(s), it can contain one word or more. In case of one word, it's easy to remove it, but when choose to remove multiple words if they are all in the stop words list is difficult for me to figure it out. I prefer solving it with LINQ.

Imagin, I have this array of strings

then use 
then he
the image
and the
should be in
should be written

I want to get only

then use 
the image
should be written

So, the lines that all it words are in the stop words should be removed, while keep the lines that has mixed words.

My stop words array string[] stopWords = {"a", "an", "x", "y", "z", "this", "the", "me", "you", "our", "we", "I", "them", "then", "ours", "more", "will", "he", "she", "should", "be", "at", "on", "in", "has", "have", "and"};

Thank you,


Solution

  • One way to solve this problem would be to do the following:

    string[] stopWords = { "a", "an", "x", "y", "z", "this", "the", "me", "you", "our", "we", "I", "them", "ours", "more", "will", "he", "she", "should", "be", "at", "on", "in", "has", "have", "and" };
    
    string input = """"
                then use 
                then he
                the image
                and the
                should be in
                should be written
                """";
    
    var array = input.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
    
    var filteredArray = array.Where(x => x.Split(' ').Any(y => !stopWords.Contains(y))).ToList();
    var result = string.Join(Environment.NewLine, filteredArray);
    
    Console.WriteLine(result);
    

    First 2 lines are just to setup the data.

    The third line converts the string into a array of lines by splitting on newline character. (Environment.NewLine ensures that the code works properly on linux as well.)

    Fourth line processes each line by splitting the line on space (which gets us individual words) and then checks if there's any word that doesn't exist in stopWords list. If any of the words doesn't exist then the Where condition is satisfied and the whole line is returned in filteredArray.

    Fifth line simply concatenates all individual lines to form the final result string.

    The result should look something like below:

    then use
    then he
    the image
    should be written
    

    Note that in your stopWords list, you have the word them but not then. So the second result line should not be removed.