Search code examples
c++vectorinner-join

How to join two vector<vector<string>> in c++ based on a common column


So what I want to do is essentially do an inner join of two tables in c++. However, the only guide I could is this one from geeksforgeeks: https://www.geeksforgeeks.org/joining-tables-using-multimaps/

An example below is as follows:

// First Column Name: Numbers
// Second Column Name: Alphabets
vector<vector<string>> v1 = { vector<string> { "Numbers", "Alphabets" }
                              vector<string> { "1", "a" },
                              vector<string> { "2", "b" },
                              vector<string> { "3", "a" } }

// First Column Name: Fruits
// Second Column Name: Alphabets
vector<vector<string>> v2 = { vector<string> { "Fruits", "Alphabets" }
                              vector<string> { "apple", "a" },
                              vector<string> { "pear", "a" },
                              vector<string> { "peach", "c" },
                              vector<string> { "orange", "b" } }

So if I want to do a inner join via alphabets, (like the solution in the link, I can pass two arguments which is the index of the columns I want to compare) and get the following output:

vector<vector<string>> result = { vector<string> { "Numbers", "Alphabets", "Fruits" }
                                  vector<string> { "1", "a", "apple" },
                                  vector<string> { "1", "a", "pear" },
                                  vector<string> { "3", "a", "apple" } 
                                  vector<string> { "3", "a", "pear" }, 
                                  vector<string> { "2", "b", "orange" }, }

As you can see, I currently also need the column names in the final table, in the sense that I need to know that column 1 refers to Numbers, column 2 is Alphabets, and column 3 is Fruits, as I would need to extract out the data in a specific column later in my code, kind of like a SQL "get Numbers from result" which I can easily do with a for loop.

How do I do so? I've been stuck on this since I'm sure when building the new table, I would need to omit the first row of headers for both table when doing the comparison etc, which makes things complicated if there a strings of the same value.

EDIT: My data does not necessarily have to be in that format of vector<vector>. How I do it is I have an API that gives me a vector once I pass it my column Name, so I can easily store this in an unordered_map as well. I modeled into a "table-like" structure after that solution.

My original code looked something like this:

joinTwoTable(vector<vector<string>>& v1, int columnIndex1, 
             vector<vector<string>>& v2, int columnIndex2) 
{
    vector<vector<string>> result;
    vector<string> header;
    
    // Add in the header_row
    for (string s : v1[0]) {
        current.push_back(s);
    }
    
    for (string s2 : v2[0]) {
        // Skip the common column again
        if (s2 == v2[0][columnIndex2]) {
            continue;
        }
        current.push_back(s2);
    }
    
    for (vector<string> v : v1) {
        for (vector<string> v2 : v2) {
            if (v[columnIndex1] == v2[columnIndex2]) {
                vector<string> current;
                
                for (string s : v) {
                    current.push_back(s);
                }
                
                for (string s2 : v2) {
                    // Skip the common column again
                    if (s2 == v2[columnIndex2]) {
                        continue;
                    }
                    current.push_back(s2);
                }
                result.push_back(current);
            }
        }
    }
} 

But the issue is also because here I can only inner join on one column, but what if I need to inner join on every common column names?

Thank you in advance!


Solution

  • In your code, you are actually computing for the column header names two times. First you are pushing them manually, then in the main loop you have for (vector<string> v : v1) which actually will start from the first row, which is again those column names.

    Instead you can use a normal for loop and start from the second row.

    Another problem is here

    for (string s2 : v2) {
                        // Skip the common column again
                        if (s2 == v2[columnIndex2]) {
                            continue;
                        }
                        current.push_back(s2);
                    }
    

    Here you are assuming that no other row item has the same string value as the row item of common column, which is not a good assumption at all. This worked in case of headings as column names have to be unique in SQL.

    Here is my try

    joinTwoTable(vector<vector<string>>& v1, int columnIndex1, vector<vector<string>>& v2, int columnIndex2) 
    {
        vector<vector<string>> result;
        vector<string> columns;
    
        string commonColumn = v1[0][columnIndex1];
    
        for(string s: v1[0]) {
            columns.push_back(s);
        }
        for(string s: v2[0]) {
            if(s == commonColumn) continue;
            columns.push_back(s);
        }
    
        // push headers
        result.push_back(columns);
    
        for(int i=1; i!=v1.size(); i++) {
    
            string joiner = v1[i][columnIndex1];
    
            for(int j=1; j!=v2.size(); j++) {
                if(v2[j][columnIndex2] == joiner) {
    
                    // push a new joined row
                    vector<string> temp;
    
                    for(string s1: v1[i]) {
                        temp.push_back(s1);
                    }
                    for(auto k=0; k!=v2[j].size(); k++) {
                        if(k==columnIndex2) continue;
                        temp.push_back(v2[j][k]);
                    }
    
                    result.push_back(temp);
                }
            }
        }
    
        return result;
    }