I need to get total number of words on a WebPage. This method returns the number of 336. But when I manually check from wordcounter.net, it's about 1192 words. How can I get just the word count of the article?
int kelimeSayisi()
{
Uri url = new Uri("https://www.fitekran.com/hamilelik-ve-spor-hamileyken-hangi-spor-nasil-yapilir/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
string html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var kelime = doc.DocumentNode.SelectNodes("//text()").Count;
return kelime;
}
As HereticMonkey mentioned in a comment you're only retrieving the total number of text nodes so you need to count the words inside InnerText
. Also a couple of other things you'll most likely want to do:
I've written a modified version of your code that does that and counts the words by splitting on the space character and only treating strings that start with a letter as a word:
int kelimeSayisi()
{
Uri url = new Uri("https://www.fitekran.com/hamilelik-ve-spor-hamileyken-hangi-spor-nasil-yapilir/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
string html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
char[] delimiter = new char[] {' '};
int kelime = 0;
foreach (string text in doc.DocumentNode
.SelectNodes("//body//text()[not(parent::script)]")
.Select(node => node.InnerText))
{
var words = text.Split(delimiter, StringSplitOptions.RemoveEmptyEntries)
.Where(s => Char.IsLetter(s[0]));
int wordCount = words.Count();
if (wordCount > 0)
{
Console.WriteLine(String.Join(" ", words));
kelime += wordCount;
}
}
return kelime;
}
That returns a total word count of 1487 and also writes to the console everything that's being treated as a word so you can review what's being included. It may be that wordcounter.net is excluding a few things like headers and footers.