Search code examples
c#wpfdictionarytolower

Excluding words from dictionary


I am reading through documents, and splitting words to get each word in the dictionary, but how could I exclude some words (like "the/a/an").

This is my function:

private void Splitter(string[] file)
{
    try
    {
        tempDict = file
            .SelectMany(i => File.ReadAllLines(i)
            .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
            .AsParallel()
            .Distinct())
            .GroupBy(word => word)
            .ToDictionary(g => g.Key, g => g.Count());
    }
    catch (Exception ex)
    {
        Ex(ex);
    }
}

Also, in this scenario, where is the right place to add .ToLower() call to make all the words from file in lowercase? I was thinking about something like this before the (temp = file..):

file.ToList().ConvertAll(d => d.ToLower());

Solution

  • Do you want to filter out stop words?

     HashSet<String> StopWords = new HashSet<String> { 
       "a", "an", "the" 
     }; 
    
     ...
    
     tempDict = file
       .SelectMany(i => File.ReadAllLines(i)
       .SelectMany(line => line.Split(new[] { ' ', ',', '.', '?', '!', }, StringSplitOptions.RemoveEmptyEntries))
       .AsParallel()
       .Select(word => word.ToLower()) // <- To Lower case 
       .Where(word => !StopWords.Contains(word)) // <- No stop words
       .Distinct()
       .GroupBy(word => word)
       .ToDictionary(g => g.Key, g => g.Count());
    

    However, this code is a partial solution: proper names like Berlin will be converted into lower case: berlin as well as acronyms: KISS (Keep It Simple, Stupid) will become just a kiss and some numbers will be incorrect.