Search code examples
c#stringword-countdistinct-values

How to count words frequency by removing non-letters of a string?


I have a string:

var text = @"
I have a long string with a load of words,
and it includes new lines and non-letter characters.
I want to remove all of them and split this text to have one word per line, then I can count how many of each word exist."

What is the best way to go about removing all non-letter characters, then splitting each word onto a new line so I can store and count how many of each word there are?

var words = text.Split(' ');

foreach(var word in words)
{
    word.Trim(',','.','-');
}

I have tried various things such as text.Replace(characters) with whitespace then split. I have tried Regex (which I would rather not use).

I have also tried to use the StringBuilder class to take the characters from the text (string) and only appending the character if it is a letter a-z / A-Z.

Also tried calling sb.Replace or sb.Remove the characters I don't want before storing them in a Dictionary. But I still seem to end up with characters I don't want?

Everything I try, I seem to have at least one character I don't want in there and can't quite figure out why it isn't working.

Thanks!


Solution

  • Using an extension method without RegEx nor Linq

    static public class StringHelper
    {
      static public Dictionary<string, int> CountDistinctWords(this string text)
      {
        string str = text.Replace(Environment.NewLine, " ");
        var words = new Dictionary<string, int>();
        var builder = new StringBuilder();
        char charCurrent;
        Action processBuilder = () =>
        {
          var word = builder.ToString();
          if ( !string.IsNullOrEmpty(word) )
            if ( !words.ContainsKey(word) )
              words.Add(word, 1);
            else
              words[word]++;
        };
        for ( int index = 0; index < str.Length; index++ )
        {
          charCurrent = str[index];
          if ( char.IsLetter(charCurrent) )
            builder.Append(charCurrent);
          else
          if ( !char.IsNumber(charCurrent) )
            charCurrent = ' ';
          if ( char.IsWhiteSpace(charCurrent) )
          {
            processBuilder();
            builder.Clear();
          }
        }
        processBuilder();
        return words;
      }
    }
    

    It parses all chars rejecting all non letters while creating a dictionary of each words having the number of occurrences counted.

    Test

    var result = text.CountDistinctWords();
    Console.WriteLine($"Found {result.Count()} distinct words:");
    Console.WriteLine();
    foreach ( var item in result )
      Console.WriteLine($"{item.Key}: {item.Value}");
    

    Result on your sample

    Found 36 distinct words:
    
    I: 3
    have: 2
    a: 2
    long: 1
    string: 1
    with: 1
    load: 1
    of: 3
    words: 1
    and: 3
    it: 1
    includes: 1
    new: 1
    lines: 1
    non: 1
    letter: 1
    characters: 1
    want: 1
    to: 2
    remove: 1
    all: 1
    them: 1
    split: 1
    this: 1
    text: 1
    one: 1
    word: 2
    per: 1
    line: 1
    then: 1
    can: 1
    count: 1
    how: 1
    many: 1
    each: 1
    exist: 1