Search code examples
c++dictionaryword-frequency

Why are some words printed twice when working with word frequency


I read some words from a file and print the 30 most frequent words but some words are printed

twice as you can see in the output.

#include <iostream>
#include <vector>
#include <map>
#include <iterator>
#include <fstream>
using namespace std;

int main(){
  
  fstream fs, output;
  fs.open("/Users/brah79/Downloads/skola/c++/inlämningsuppgifter/labb4/L4_wc/hitchhikersguide.txt");
  output.open("/Users/brah79/Downloads/skola/c++/inlämningsuppgifter/labb4/labb4/output.txt");
  if(!fs.is_open() || !output.is_open()){
    cout << "could not open file" << endl; 
  }

  map <string, int> mp; 
  string word; 
  while(fs >> word){

    for(int i = 0; i < word.length(); i++){
        if(!isalpha(word[i])){
        word.erase(i--, 1);
      }
    }
    if(word.empty()){
        continue; 
    }

  
    mp[word]++; 
  }
  vector<pair<int, string>> v;
  v.reserve(mp.size());

  for (const auto& p : mp){
    v.emplace_back(p.second, p.first);
  }

  sort(v.rbegin(), v.rend()); 

  cout << "Theese are the 30 most frequent words: " << endl; 
  for(int i = 0; i < 30; i++){
      cout << v[i].second << " : " << v[i].first << " times" << endl;
  }


  output << "Theese are the 30 most frequent words: " << endl; 
  for(int i = 0; i < 30; i++){
      cout << v[i].second << " : " << v[i].first << " times" << endl;
  }
 

  return 0; 
}

output:

the : 2230 times !!!

of : 1254 times

to : 1177 times

a : 1121 times

and : 1109 times

said : 680 times

it : 665 times

was : 605 times

in : 590 times

he : 546 times

that : 520 times

you : 495 times

I : 428 times

on : 349 times

Arthur : 332 times

his : 324 times

Ford : 314 times

The : 307 times !!!

at : 306 times

for : 284 times

is : 281 times

with : 273 times

had : 252 times

He : 242 times

this : 220 times

as : 207 times

Zaphod : 206 times

be : 188 times

all : 186 times

him : 182 times

"the" is printed twice. Also "could not open file" is printed at the top even

though the file was open and it's content is stored in the map.


Solution

  • Because you've written your program in an case-sensitive manner.

    In particular, The and the are considered different from each other and so have different frequencies. For example, the is 2230 times while The is 307 times.