I read some words from a file and print the 30 most frequent words but some words are printed
twice as you can see in the output.
#include <iostream>
#include <vector>
#include <map>
#include <iterator>
#include <fstream>
using namespace std;
int main(){
fstream fs, output;
fs.open("/Users/brah79/Downloads/skola/c++/inlämningsuppgifter/labb4/L4_wc/hitchhikersguide.txt");
output.open("/Users/brah79/Downloads/skola/c++/inlämningsuppgifter/labb4/labb4/output.txt");
if(!fs.is_open() || !output.is_open()){
cout << "could not open file" << endl;
}
map <string, int> mp;
string word;
while(fs >> word){
for(int i = 0; i < word.length(); i++){
if(!isalpha(word[i])){
word.erase(i--, 1);
}
}
if(word.empty()){
continue;
}
mp[word]++;
}
vector<pair<int, string>> v;
v.reserve(mp.size());
for (const auto& p : mp){
v.emplace_back(p.second, p.first);
}
sort(v.rbegin(), v.rend());
cout << "Theese are the 30 most frequent words: " << endl;
for(int i = 0; i < 30; i++){
cout << v[i].second << " : " << v[i].first << " times" << endl;
}
output << "Theese are the 30 most frequent words: " << endl;
for(int i = 0; i < 30; i++){
cout << v[i].second << " : " << v[i].first << " times" << endl;
}
return 0;
}
output:
the : 2230 times !!!
of : 1254 times
to : 1177 times
a : 1121 times
and : 1109 times
said : 680 times
it : 665 times
was : 605 times
in : 590 times
he : 546 times
that : 520 times
you : 495 times
I : 428 times
on : 349 times
Arthur : 332 times
his : 324 times
Ford : 314 times
The : 307 times !!!
at : 306 times
for : 284 times
is : 281 times
with : 273 times
had : 252 times
He : 242 times
this : 220 times
as : 207 times
Zaphod : 206 times
be : 188 times
all : 186 times
him : 182 times
"the" is printed twice. Also "could not open file" is printed at the top even
though the file was open and it's content is stored in the map.
Because you've written your program in an case-sensitive manner.
In particular, The
and the
are considered different from each other and so have different frequencies. For example, the
is 2230 times while The
is 307 times.