I have a string:
var text = @"
I have a long string with a load of words,
and it includes new lines and non-letter characters.
I want to remove all of them and split this text to have one word per line, then I can count how many of each word exist."
What is the best way to go about removing all non-letter characters, then splitting each word onto a new line so I can store and count how many of each word there are?
var words = text.Split(' ');
foreach(var word in words)
I have tried various things such as text.Replace(characters)
with whitespace
then split. I have tried Regex (which I would rather not use).
I have also tried to use the StringBuilder class to take the characters from the text (string) and only appending the character if it is a letter a-z / A-Z.
Also tried calling sb.Replace or sb.Remove the characters I don't want before storing them in a Dictionary. But I still seem to end up with characters I don't want?
Everything I try, I seem to have at least one character I don't want in there and can't quite figure out why it isn't working.
Using an extension method without RegEx nor Linq
static public class StringHelper
static public Dictionary<string, int> CountDistinctWords(this string text)
string str = text.Replace(Environment.NewLine, " ");
var words = new Dictionary<string, int>();
var builder = new StringBuilder();
char charCurrent;
Action processBuilder = () =>
var word = builder.ToString();
if ( !string.IsNullOrEmpty(word) )
if ( !words.ContainsKey(word) )
words.Add(word, 1);
for ( int index = 0; index < str.Length; index++ )
charCurrent = str[index];
if ( char.IsLetter(charCurrent) )
if ( !char.IsNumber(charCurrent) )
charCurrent = ' ';
if ( char.IsWhiteSpace(charCurrent) )
return words;
It parses all chars rejecting all non letters while creating a dictionary of each words having the number of occurrences counted.
var result = text.CountDistinctWords();
Console.WriteLine($"Found {result.Count()} distinct words:");
foreach ( var item in result )
Console.WriteLine($"{item.Key}: {item.Value}");
Result on your sample
Found 36 distinct words:
I: 3
have: 2
a: 2
long: 1
string: 1
with: 1
load: 1
of: 3
words: 1
and: 3
it: 1
includes: 1
new: 1
lines: 1
non: 1
letter: 1
characters: 1
want: 1
to: 2
remove: 1
all: 1
them: 1
split: 1
this: 1
text: 1
one: 1
word: 2
per: 1
line: 1
then: 1
can: 1
count: 1
how: 1
many: 1
each: 1
exist: 1