Search code examples
c++filefstreamifstream

Stop Words in C++


The following C++ program takes two text files, stop_words.txt, and story.txt. It then removes all the stop word occurrences in the story.txt file. For instance,

Monkey is a common name that may refer to groups or species of mammals, in part, the simians of infraorder L. The term is applied descriptively to groups of primates, such as families of new world monkeys and old world monkeys. Many monkey species are tree-dwelling (arboreal), although there are species that live primarily on the ground, such as baboons. Most species are also active during the day (diurnal). Monkeys are generally considered to be intelligent, especially the old world monkeys of Catarrhini.

the text above is story.txt, and the stop_words.txt file is given below:

is
are 
be

When I run my code, it doesn't delete all the stop words and keeps some of them. The code also creates a file called stop_words_counter.txt which should display the number of stop word occurrences like so:

is 2
are 4
b 1

But my output file shows the following:

is 1
are 4
be 1

I would be very grateful for some help regarding this code! I have posted it below for your reference.


#include <iostream>
#include <string>
#include <fstream>
using namespace std;

const int MAX_NUM_STOPWORDS = 100;

struct Stop_word
{
  string word;  // stop word
  int count;    // removal count
};


int stops[100];
string ReadLineFromStory(string story_filename )
{
  string x = "";
  string b;
  ifstream fin;
  fin.open(story_filename);
  while(getline(fin, b))
  {
    x += b;

  }
  return x;
}

void ReadStopWordFromFile(string stop_word_filename, Stop_word words[], int &num_words)
{
  ifstream fin;
  fin.open(stop_word_filename);
  string a;
  int i = 0;
  if (fin.fail())
  {
    cout << "Failed to open "<< stop_word_filename << endl;
    exit(1);
  }
  words[num_words].count = 0;
  while (fin >> words[num_words].word)
  {
    
    ++num_words;
  }


  fin.close();
}

void WriteStopWordCountToFile(string wordcount_filename, Stop_word words[], int num_words)
{
  ofstream fout;
  fout.open(wordcount_filename);
  for (int i = 0; i < 1; i++)
  {
    fout << words[i].word << " "<< stops[i] + 1 << endl;
  }
  for (int i = 1; i < num_words; i++)
  {
    fout << words[i].word << " "<< stops[i] << endl;
  }

  fout.close();
}

int RemoveWordFromLine(string &line, string word)
{
  int length = line.length();
    int counter = 0;
    int wl = word.length();
    for(int i=0; i < length; i++)
    {
        int x = 0;
        if(line[i] == word[0] && (i==0 || (i != 0 && line[i-1]==' ')))
        {
            for(int j = 1 ; j < wl; j++)
                if (line[i+j] != word[j])
                {
                    x = 1;
                    break;
                }
            if(x == 0 && (i + wl == length || (i + wl != length && line[i+wl] == ' ')))
            {
                for(int k = i + wl; k < length; k++)
                    line[k -wl] =line[k];
                length -= wl;

                counter++;
            }
        }

  }
  line[length] = 0;
  char newl[1000] = {0};
  for(int i = 0; i < length; i++)
    newl[i] = line[i];
  line.assign(newl);
  return counter;
}


int RemoveAllStopwordsFromLine(string &line, Stop_word words[], int num_words)
{
  int counter[100];
  int final = 0;
    for(int i = 1; i <= num_words; i++)
  {
    counter[i] = RemoveWordFromLine(line, words[i].word);
    final += counter[i];
    stops[i] = counter[i];

  }
    return final;

}


int main()
{

  Stop_word stopwords[MAX_NUM_STOPWORDS];     // an array of struct Stop_word
  int num_words = 0, total = 0;
  // read in two filenames from user input
  string a, b, c;
  cin >> a >> b;

  // read stop words from stopword file and
  // store them in an array of struct Stop_word
  ReadStopWordFromFile(a, stopwords, num_words);

  // open text file
  c = ReadLineFromStory(b);


  // open cleaned text file
  ofstream fout;
  fout.open("story_cleaned.txt");


  // read in each line from text file, remove stop words,
  // and write to output cleaned text file

  total = RemoveAllStopwordsFromLine(c, stopwords, num_words) + 1 ;

  fout << c;

  // close text file and cleaned text file

  fout.close();

  // write removal count of stop words to files

  WriteStopWordCountToFile("stop_words_count.txt", stopwords, num_words);

  // output to screen total number of words removed
  cout << "Number of stop words removed = " << total << endl;

  return 0;
}

Solution

  • There is one major bug in your code.

    in function RemoveAllStopwordsFromLine

    you are using the wrong array indices. In C++ the first element in an array has the index 0. Also you must compare with "less" than the size.

    for (int i = 1; i <= num_words; i++)
    

    So the first stop word "is", will never be checked and counted.

    Please modify to

    for (int i = 0; i < num_words; i++)
    

    But then you need also to remove your patch in function WriteStopWordCountToFile . You made a special case for element 0. That is wrong.

    Please remove

        for (int i = 0; i < 1; i++)
        {
            fout << words[i].word << " " << stops[i] + 1 << endl;
        }
    

    and start the next for with 0. And remove the "+" while calculating the total.

    Because you are using C-Style arrays, magic numbers and ultra complex code, I will show you a modern C++ solution.

    In C++ you have many useful algorithms. Some are specifically designed to address your requirments. So, please use them. Try to get away from C and migrate to C++.

    #include <string>
    #include <iostream>
    #include <fstream>
    #include <vector>
    #include <iterator>
    #include <algorithm>
    #include <regex>
    #include <sstream>
    
    
    // The filenames. Whatever you want
    const std::string storyFileName{ "r:\\story.txt" };
    const std::string stopWordFileName{ "r:\\stop_words.txt" };
    const std::string stopWordsCountFilename{ "r:\\stop_words_count.txt" };
    const std::string storyCleanedFileName{ "r:\\story_cleaned.txt" };
    
    
    
    // Becuase of the simplicity of the task, put everything in main
    int main() {
    
        // Open all 4 needed files
        std::ifstream storyFile(storyFileName);
        std::ifstream stopWordFile(stopWordFileName);
        std::ofstream stopWordsCountFile(stopWordsCountFilename);
        std::ofstream storyCleanedFile(storyCleanedFileName);
    
        // Check, if the files could be opened
        if (storyFile && stopWordFile && stopWordsCountFile && storyCleanedFile) {
    
            // 1. Read the complete sourcefile with the story into a std::string
            std::string story( std::istreambuf_iterator<char>(storyFile), {} );
    
            // 2. Read all "stop words" into a std::vector of std::strings
            std::vector stopWords(std::istream_iterator<std::string>(stopWordFile), {});
    
            // 3. Count the occurences of the "stop words" and write them into the destination file
            std::for_each(stopWords.begin(), stopWords.end(), [&story,&stopWordsCountFile](std::string& sw) {
                std::regex re{sw};                          // One of the "stop words"
                stopWordsCountFile << sw << " --> " <<      // Write count to output
                    std::distance(std::sregex_token_iterator(story.begin(), story.end(), re, 1), {}) << "\n";});
    
            // 4. Replace "stop words" in story and write new story into file
            std::ostringstream wordsToReplace;      // Build a list of all stop words, followed by an option white space
            std::copy(stopWords.begin(), stopWords.end(), std::ostream_iterator<std::string>(wordsToReplace, "\\s?|"));
    
            storyCleanedFile << std::regex_replace(story,std::regex(wordsToReplace.str()), "");
        }
        else {
            // In case that any of the files could not be opened.
            std::cerr << "\n*** Error: Could not open one of the files\n";
        }
        return 0;
    }
    

    Please try to study and understand this code. This is a very simple solution.