Search code examples
c++arrayssortingdata-structuresdynamic-arrays

Getting input from text file and storing into array but text file contains more than 20.000 strings


Getting inputs from a text file and storing it into an array but text file contains more than 20.000 strings. I'm trying to read strings from the text file and store them into a huge-sized array. How can I do that?

I can not use vectors. Is it possible to do it without using a hash table?

Afterward, I will try to find the most frequently used words using sorting.


Solution

  • You requirement is to NOT use any standard container like for example a std::vector or a std::unordered_map.

    In this case we need to create a dynamic container by ourself. That is not complicated. And we can use this even for storing strings. So, I will even not use std::string in my example.

    I created some demo for you with ~700 lines of code.

    We will first define the term "capacity". This is the number of elements that could be stored in the container. It is the currently available space. It has nothing to do, how many elements are really stored in the container.

    But there is one and the most important functionality of a dynamic container. It must be able to grow. And this is always necessary, if we want to store add more elements to the container, as its capacity.

    So, if we want to add an additional element at the end of the container, and if the number of elements is >= its capacity, then we need to reallocate bigger memory and then copy all the old elements to the new memory space. For such events, we will usually double the capacity. This should prevent frequent reallocations and copying activities.

    Let me show you one example for a push_back function, which could be implemented like this:

    template <typename T>
    void DynamicArray<T>::push_back(const T& d) {               // Add a new element at the end
        if (numberOfElements >= capacity) {                     // Check, if capacity of this dynamic array is big enough
            capacity *= 2;                                      // Obviously not, we will double the capacity
            T* temp = new T[capacity];                          // Allocate new and more memory
            for (unsigned int k = 0; k < numberOfElements; ++k)
                temp[k] = data[k];                              // Copy data from old memory to new memory
            delete[] data;                                      // Release old memory
            data = temp;                                        // And assign newly allocated memory to old pointer
        }
        data[numberOfElements++] = d;                           // And finally, store the given data at the end of the container
    }
    

    This is a basic approach. I use templates in order to be able to store any type in the dynamic array.

    You could get rid of the templates, by deleting all template stuff and replacing "T" with your intended data type.

    But, I would not do that. See, how easy we can create a "String" class. We just typedef a dynamic array for chars as "String".

    using String = DynamicArray<char>;
    

    will give us basic string functionality. And if we want to have a dynamic array of strings later, we can write:

    using StringArray = DynamicArray<String>;
    

    and this gives us a DynamicArray<DynamicArray<char>>. Cool.

    For this special classes we can overwrite some operators, which will make the handling and our life even more simple.

    Please look in the provided code


    And, to be able to use the container in the typical C++ environment, we can add full iterator capability. That makes life even more simple.

    This needs really some typing effort, but is not complicated. And, it will make life really simpler.


    You also wanted to create a hash map. For counting words.

    For that we will create a key/value pair. The key is the String that we defined above and the value will be the frequency counter.

    We implement a hash function which should be carefully selected to work with strings, has a high entropy and give good results for the bucket size of the hash map.

    The hash map itself is a dynamic container. We will also add iterator functionality to it.


    For all this I drafted some 700 lines of code for you. You can take this as an example for your further studies.

    It can also be easily enhanced with additional functionality.

    But caveat: I did only some basic tests and I even used raw pointers for owned memory. This can be done in a schoolproject to learn some dynamic memory management, but not in reality.

    Additionlly. You can replace all this code, by simply using std::string, std::vector and std::unordered_map. Nobody would use such code and reinvent the wheel.

    But it may give you some ideas on how to implement similar things.

    Because Stackoverlof limits the answer size to 32000 characters, I will put the main part on github.

    Please click here.

    I will just show you main so that you can see how easy the solution can be used:

    int main() {
    
        // Open file and check, if it could be opened
        std::ifstream ifs{ "r:\\test.txt" };
        if (ifs) {
    
            // Define a dynamic array for strings
            StringArray stringArray{};
    
            // Use overwritten extraction operator and read all strings from the file to the dynamic array
            ifs >> stringArray;
    
            // Create a dynamic hash map
            HashMap hm{};
    
            // Now count the frequency of words
            for (const String& s : stringArray)
                hm[s]++;
    
            // Put the resulting key/value pairs into a dynamic array
            DynamicArray<Item> items(hm.begin(), hm.end());
    
            // Sort in descending order by the frequency
            std::sort(items.begin(), items.end(), [](const Item& i1, const Item& i2) { return i1.count > i2.count; });
    
            // SHow resulton screen
            for (const auto& [string, count] : items) 
                std::cout << std::left << std::setw(20) << string << '\t' << count << '\n';
        }
        else std::cerr << "\n\nError: Could not open source file\n\n";
    }