Search code examples
c++textmarkov-chainsmarkov

Text file as input in C++ program will not work unless the text is copy and pasted


I have a very strange bug in my code that is a little hard to explain. Let me begin with what the program does: basically, the C++ program takes input text (from a file named "input.txt" in the same directory) and uses Markov Chains to generate some artificial output text that resembles the style of the input text and prints it to the terminal.

It works when I copy and paste the text of 'Alice in Wonderland' (http://paulo-jorente.de/text/alice_oz.txt) directly into "input.txt", but if I add any words or characters to the beginning or end of the contents of the text file, then the code stops running (or runs infinitely). However, this does not happen if I add text anywhere in the middle of the contents of the text file.

If you would to test it yourself, try running the code with Alice in Wonderland copied into "input.txt". Then after it runs successfully, go to input.txt and type some random characters or words after the last of the text from 'Alice' ("...home again!") and try to run it again; it will fail.

Here is the code:

#include <ctime>
#include <iostream>
#include <algorithm>
#include <fstream>
#include <string>
#include <vector>
#include <map>
using namespace std;

class markovTweet{
    string fileText;
    map<string, vector<string> > dictionary;

public:

    void create(unsigned int keyLength, unsigned int words) {
        ifstream f("input.txt");
        if(f.good()){
          fileText.assign((istreambuf_iterator<char>(f)), istreambuf_iterator<char>());
        }else{
          cout << "File cannot be read. Ensure there is a file called input.txt in this directory." << "\n" << endl;
          return;
        }
        if(fileText.length() < 1){
          return;
        }
        cout << "\n" << "file imported" << "\n";
        createDictionary(keyLength);
        cout << "\n" << "createDictionary" << "\n" << "\n";
        createText(words - keyLength);
        cout << "\n" << "text created, done" << endl;
    }

private:

    void createText(int w) {
        string key, first, second;
        size_t next;
        map<string, vector<string> >::iterator it = dictionary.begin();
        advance( it, rand() % dictionary.size() );
        key = (*it).first;
        cout << key;
        while(true) {
            vector<string> d = dictionary[key];
            if(d.size() < 1) break;
            second = d[rand() % d.size()];
            if(second.length() < 1) break;
            cout << " " << second;
            if(--w < 0) break;
            next = key.find_first_of( 32, 0 );
            first = key.substr( next + 1 );
            key = first + " " + second;
        }
        cout << "\n";
    }

    void createDictionary(unsigned int kl) {
        string w1, key;
        size_t wc = 0, pos, next;
        next = fileText.find_first_not_of( 32, 0 );
        if(next == string::npos) return;
        while(wc < kl) {
            pos = fileText.find_first_of(' ', next);
            w1 = fileText.substr(next, pos - next);
            key += w1 + " ";
            next = fileText.find_first_not_of(32, pos + 1);
            if(next == string::npos) return;
            wc++;
        }
        key = key.substr(0, key.size() - 1);
        while(true) {
            next = fileText.find_first_not_of(32, pos + 1);
            if(next == string::npos) return;
            pos = fileText.find_first_of(32, next);
            w1 = fileText.substr(next, pos - next);
            if(w1.size() < 1) break;
            if(find( dictionary[key].begin(), dictionary[key].end(), w1) == dictionary[key].end() ) 
                dictionary[key].push_back(w1);
            key = key.substr(key.find_first_of(32) + 1) + " " + w1;
        }
    }
};

int main() {  
    markovTweet t;
    cout << "\n" << "Artificially generated tweet using Markov Chains based off of input.txt: " << "\n" << "\n";
    //lower first number is more random sounding text, second number is how long output is.
    t.create(4, 30);
    return 0;
}

This is a very strange bug and any help that you can offer is much appreciated! Thanks!


Solution

  • This might be something to think about regarding std::map's time complexity for its operator[]().

    Using operator[] : “[]” can also be used to insert elements in map. Similar to above functions and returns the pointer to the newly constructed element. Difference is that this operator always constructs a new element i.e even if a value is not mapped to key, default constructor is called and assigns a “null” or “empty” value to the key. Size of map is always increased by 1. Time complexity : log(n) where n is size of map


    courtesy from: geeksforgeeks



    In your class's createDictionary() function try adding this line of code in the 2nd while loop:

    {
        //...code 
        if (find(dictionary[key].begin(), dictionary[key].end(), w1) == dictionary[key].end()) {
              dictionary[key].push_back(w1);
              std::cout << dictionary.size() << std::endl;
        //code...
    }
    

    When I copied the text from the file it was generating 62037 entries into your dictionary or hashmap. It takes roughly 20 - 30 seconds to run and finish.

    When I added the text " Good Bye! " to the end of the file, saved it and ran the program/debugger it generated 62039 entries. Again it took about 20-30 seconds to run.

    Then I added the text "Hello World " to the beginning of the file, saved it and ran the program/debugger and it generated 62041 entries. Again it took about 20-30 seconds to run.

    However, there were a couple of times during this process, that it generated that many entries into your map, but the code was still going through the loop... The one time it was around 620xx - 640xx. I don't know what was causing it to generate that many keys... but like I said, there were a couple of times that it quit printing the values, but was still iterating through the same while loop, yet the size of the map wasn't increasing...

    This happened the first time that I entered the text at the beginning of the file after trying it with the appended text at the end. This is when I decided to print out the size of your map and noticed that I was getting this infinite loop... Then I stopped the debugger went back to the text file and kept the inserted text at the beginning, but deleted the appended text at the end making sure to leave a single space at the end of the text.

    This time when I ran the program/debugger, It worked correctly and it generated 62039 entries. Again it took about 20-30 seconds to run. After, the first successful run with the inserted text at the beginning is when I added the text at the end, and it ran fine. I then even tried to have "Hello World!" followed by a newline by using enter into the text file and having "Good Bye!" preceded by one as well and it still worked fine.


    Yes, there is something causing a bug, but I don't know exactly what is causing it. However, I believe that I have traced it to be within this while loop and the conditional branching for exiting... It should have broken out of this loop and went into the createText function but it never broke out, the condition you have for:

    if (next == std::string::npos) return
    

    and

    if (w1.size() < 1) break;
    

    somehow were not being met.


    The time complexity is okay, however, it's not the best but it's also not the worst as there are approximately 62-63k entries running in O(log n) time. This also doesn't include counting the space complexity which does need to be taken into consideration.


    It could be that during one run you might be getting stack-overflow which is causing the infinite loop and the next time you run it, it might not. I don't think it has anything to do with adding in text into the text file directly except that it will increase the size of your map in O(log N) time and increase the space complexity as well.

    Regardless of what you add into this text file and after saving it, the way your program or algorithms are written, it is pulling all of the contents of that file as pointer indices by char type through the iterator classes and storing it into a single string, fileText. After this string is constructed there are approximately 336940 characters in your class's member string.


    Hopefully, this information can guide you in narrowing down where the bug is in your program and determining what is actually causing it. It truly is hard to narrow down this culprit.