Search code examples
c++vectorbinary-searchinsert-update

how to binary search a vector of structs and insert at appropriate index


What I wanted to do:

I want to have a sorted table of texts(equal sized) which shows the number of occurrences of each text. The table is sorted as the texts are inserted into the table. While words are being inserted into the table, they are checked if they already exist in which case their ref_count is incremented. If they are new their ref_count is set to 1 and are inserted into the vector at the right index making sure that the table is still sorted.

What I did:

I created a struct and a vector of the structs defined as follows. Then I used the shown binary search to identify the appropriate index to use the std::insert() function.

My Problem: My binary search implementation is not returning the correct index position.

#define WORD_LENGTH 6

typedef struct RC_table {
    char word[WORD_LENGTH+1];//+1 for ‘\0’
    unsigned int ref_count;
}RC_table;

std::vector<RC_table>RC;

void update_RC(char *word_to_insert)
{
    int index = 0; 
    bool found=binary_search_RC(word_to_insert, &index);
    if (found == TRUE) {
        //increment reference count
        RC[index].ref_count++;
    }
    else {
        //insert new word at index 
        RC_table new_entry;
        memcpy(new_entry.word, word_to_insert, WORD_LENGTH);
        new_entry.word[WORD_LENGTH] = '\0';
        new_entry.ref_count = 1;

        if(index==0)
            RC.insert(RC.begin(),new_entry);
        else if(index==RC.size()-1)
            RC.insert(RC.end(),new_entry);
        else {
            RC.insert(RC.begin() + index +1, new_entry);
        }
    }
}
bool binary_search_RC(char *word, int *index) {
    int left = 0;
    int right = RC.size() -1;
    int middle = (left + right) / 2;
    bool found = false;
    while (left<=right) {
        middle = (left + right) / 2;
        *index = middle;
        if (strncmp(word, RC[middle].word, WORD_LENGTH) < 0) {
            right = middle - 1;
        }

        else if(strncmp(word, RC[middle].word, WORD_LENGTH) > 0) {
            left = middle + 1;
        }
        else {
            found = true;
            break;
        }
    }
    *index = middle;
    return found;
}

EDIT : I tried to use lower_bound(). it still does not give the expected output(i.e sorted table).

typedef struct RC_table {
    char word[WORD_LENGTH+1];//+1 for ‘\0’
    unsigned int ref_count;
    bool operator<(const RC_table&r){
         return word<r.word;
    }
}RC_table;

Insert to table using:

auto itr=lower_bound(RC.begin(),RC.end(),new_enry);
RC.insert(itr,new_entry);

Solution

  • The biggest problem with your algorithm is that you're trying to fencepost both sides of your partitioning using elements you consider searchable. That's not how you do this; the partitioning can (and will) eventually get stuck with repetitive integer division of 1/2 in worst-case no-find scenarios. The math works out considerably easier when you do this instead:

    • The left end of the partition refers to the first element, and is considered as suspect.
    • The right end of the partition refers to the one-past element. and is NOT considered as suspect.

    The result is a much simpler algorithm that is easier to understand, and easier to maintain.

    bool binary_search_RC(const char *word, int *index)
    {
        *index = 0;
    
        int left = 0;
        int right = static_cast<int>(RC.size());
        bool found = false;
        while (!found && left < right)
        {
            int middle = *index = left + (right-left) / 2;
            int res = strncmp(word, RC[middle].word, WORD_LENGTH);
    
            if (res < 0)
                right = middle;
    
            else if (res > 0)
                *index = left = middle+1;
    
            else
                found = true;
        }
        return found;
    }
    

    Put into practice, a simple little test harness that generates random three-character strings from a simple three letter alphabet. That should present plenty of unique insertions, and eventually plenty of finds. At the end, we'll print the entire table, which, if this works, had damn well better be sorted.

    Code

    #include <iostream>
    #include <vector>
    #include <random>
    
    #define WORD_LENGTH 6
    
    typedef struct RC_table {
        char word[WORD_LENGTH+1];//+1 for ‘\0’
        unsigned int ref_count;
    } RC_table;
    
    std::vector<RC_table>RC;
    
    bool binary_search_RC(const char *word, int *index)
    {
        *index = 0;
    
        int left = 0;
        int right = static_cast<int>(RC.size());
        bool found = false;
        while (!found && left < right)
        {
            int middle = *index = left + (right-left) / 2;
            int res = strncmp(word, RC[middle].word, WORD_LENGTH);
    
            if (res < 0)
                right = middle;
    
            else if (res > 0)
                *index = left = middle+1;
    
            else
                found = true;
        }
        return found;
    }
    
    void update_RC(const char *word_to_insert)
    {
        int index = 0;
        bool found = binary_search_RC(word_to_insert, &index);
    
        if (found)
        {
            ++RC[index].ref_count;
            std::cout << "Found entry for " << word_to_insert;
            std::cout << " (" << RC[index].ref_count << ")\n";
        }
        else {
            std::cout << "Adding entry for " << word_to_insert << '\n';
    
            //insert new word at index
            RC_table new_entry;
            strncpy(new_entry.word, word_to_insert, WORD_LENGTH);
            new_entry.word[WORD_LENGTH] = 0;
            new_entry.ref_count = 1;
    
            if(index==0)
                RC.insert(RC.begin(),new_entry);
    
            else if(index == RC.size())
                RC.insert(RC.end(),new_entry);
    
            else
                RC.insert(RC.begin() + index, new_entry);
        }
    }
    
    
    
    int main()
    {
        // generate some random values and start adding them. a few dozen
        //  with a severely limited alphabet should suffice.
        const char alphabet[] = "abc";
        std::mt19937 rng{ std::random_device{}() };
        std::uniform_int_distribution<std::size_t> dist(0, sizeof alphabet - 2);
    
        for (int i=0; i<50; ++i)
        {
            char word[WORD_LENGTH+1] = {};
            for (int j=0; j<3; ++j)
                word[j] = alphabet[ dist(rng) ];
            update_RC(word);
        }
    
        // print the table
        for (auto const& x : RC)
            std::cout << x.word << " : " << x.ref_count << '\n';
    }
    

    Output (varies, obviously)

    Adding entry for cab
    Adding entry for cac
    Adding entry for bcc
    Adding entry for bbb
    Adding entry for cbc
    Adding entry for abb
    Found entry for cab (2)
    Adding entry for aba
    Adding entry for cca
    Adding entry for acc
    Found entry for aba (2)
    Found entry for bcc (2)
    Adding entry for cbb
    Found entry for cac (2)
    Found entry for cac (3)
    Adding entry for aaa
    Found entry for acc (2)
    Adding entry for bbc
    Adding entry for baa
    Adding entry for acb
    Found entry for aaa (2)
    Found entry for cca (2)
    Found entry for baa (2)
    Found entry for cbb (2)
    Adding entry for aac
    Found entry for cac (4)
    Adding entry for aca
    Adding entry for ccc
    Found entry for bbc (2)
    Adding entry for bba
    Adding entry for bac
    Adding entry for aab
    Found entry for bac (2)
    Found entry for aca (2)
    Found entry for bcc (3)
    Adding entry for caa
    Found entry for aaa (3)
    Found entry for bbc (3)
    Found entry for caa (2)
    Found entry for abb (2)
    Found entry for baa (3)
    Found entry for acc (3)
    Found entry for bba (2)
    Found entry for bbb (2)
    Found entry for cbc (2)
    Found entry for aaa (4)
    Found entry for baa (4)
    Adding entry for cba
    Found entry for bac (3)
    Found entry for bbc (4)
    aaa : 4
    aab : 1
    aac : 1
    aba : 2
    abb : 2
    aca : 2
    acb : 1
    acc : 3
    baa : 4
    bac : 3
    bba : 2
    bbb : 2
    bbc : 4
    bcc : 3
    caa : 2
    cab : 2
    cac : 4
    cba : 1
    cbb : 2
    cbc : 2
    cca : 2
    ccc : 1
    

    I didn't bother doing the math, but add up those reference counts and you should find they sum to 50, the number of insertions we performed.

    Best of luck.