python c++algorithm optimization string-matching

C++ function returns extremely slowly, far slower than functionally equivalent Python code

I have a function that is used in a script that I am writing to remove redundant blocking keywords from a list. Basically, with the input (in any order) of:

{"apple", "bapple", "banana", "cherry", "bananaman", "sweetherrypie", "sweet", "b"}

It should output a shrunk array of strings (in any order) of:

{"apple", "cherry", "sweet", "b"}

This is done to reduce the number of keywords a matching algorithm needs to pass through as a personal learning project (If a string contained "banana", it would also contain "b" so "banana" is unnecessary, and can be discarded

It needed to be extremely performant, as it had to run on lists of over 10,000,000 strings.

Initially, I wrote some Python code to do this:

def deduplicate_keywords(keywords):
    def inner_loop(unique_keywords, keyword):
        for unique_keyword in unique_keywords:
            # Check if the current keyword includes any of the unique keywords
            if unique_keyword in keyword:
                return  # Match so it is not added to unique keywords
        # If the keyword does not include any of the shorter keywords, add it to unique_keywords
        unique_keywords.append(keyword)
        return

    # Remove duplicates and sort the keywords by length in ascending order
    keywords = sorted(set(keywords), key=len)
    unique_keywords = []

    for keyword in keywords:
        inner_loop(unique_keywords, keyword)
    return unique_keywords

And a performance benchmark to measure its performance:

class TestDeduplicateKeywords(unittest.TestCase):
    def test_performance_intensive_case(self):
        keywords = [
            "abcdef",
            "def",
            "ghidef",
            "jkl",
            "mnop",
            "qrst",
            "uvwxyz",
            "ghijkl",
            "abcdefgh",
            "ijklmnop",
            "mnopqrstuv",
            "rstuvwx",
            "mnopqrstuvwx",
            "uvwx",
        ]

        # Extending the list to make it very long and complex
        base_keywords = keywords[:]
        for i in range(200000):
            base_keywords.extend([f"{k}{i}" for k in keywords])
            base_keywords.extend([f"{i}{k}" for k in keywords])
            base_keywords.extend([f"{i}{k}{i}" for k in keywords])

        keywords = base_keywords

        old_function_time = 0
        new_function_time = 0

        for repeat in range(100): 
            # This loop (not present in the C++ code) is used to loop it multiple times to find an average
            start_time = time.time()
            result = deduplicate_keywords(keywords)
            end_time = time.time()
            old_function_time = (
                repeat * old_function_time + (end_time - start_time)
            ) / (repeat + 1)
            start_time = time.time()
            result = deduplicate_keywords_new(keywords) 
            # Above is a function that can be edited to check whether changes speed up the code
            end_time = time.time()
            new_function_time = (
                repeat * new_function_time + (end_time - start_time)
            ) / (repeat + 1)

        # As the list is extended with numbers, the result will be unique non-redundant keywords
        expected_result = ["def", "jkl", "mnop", "qrst", "uvwx"]

        self.assertEqual(set(result), set(expected_result))
        print(
            f"Test performance_intensive_case for old function took {old_function_time} seconds\n"
            f"Test performance_intensive_case for new function took {new_function_time} seconds\n"
        )

        if old_function_time < new_function_time:
            print(f"Old is faster than new")
        else:
            print(f"New is faster than old")


if __name__ == "__main__":
    unittest.main()

This was too slow for my needs at 5.677098312377932 seconds on my weak laptop (when ran 100 times to find an average time) given by the console log:

.Test performance_intensive_case for old function took 5.677098312377932 seconds

Therefore, I tried writing the same thing in C++ to see whether I would net a performance gain. It was so much slower, so I added some timing code to the function itself to see where the slowdown was. Here is the function:

#include <iostream>
#include <vector>
#include <string>
#include <set>
#include <algorithm>
#include <chrono>
#include <cassert>

std::vector<std::string> deduplicate_keywords(const std::vector<std::string> &keywords)
{
    auto t_start = std::chrono::steady_clock::now();
    auto inner_loop = [](std::vector<std::string> &unique_keywords, const std::string &keyword)
    {
        for (const auto &unique_keyword : unique_keywords)
        {
            if (keyword.find(unique_keyword) != std::string::npos)
            {
                return; // Match so it is not added to unique keywords
            }
        }
        // Otherwise, add it to unique_keywords
        unique_keywords.push_back(keyword);
    };

    // Remove duplicates using a set
    auto t1 = std::chrono::steady_clock::now();
    std::cout << "started old function!" << std::endl;
    std::set<std::string> unique_set(keywords.begin(), keywords.end());
    auto t2 = std::chrono::steady_clock::now();
    std::cout << "removed duplicates in " << std::chrono::duration<double>(t2 - t1).count() << " seconds" << std::endl;

    // Convert set back to vector and sort by length
    auto t3 = std::chrono::steady_clock::now();
    std::vector<std::string> sorted_keywords(unique_set.begin(), unique_set.end());
    std::sort(sorted_keywords.begin(), sorted_keywords.end(), [](const std::string &a, const std::string &b)
              { return a.length() < b.length(); });
    auto t4 = std::chrono::steady_clock::now();
    std::cout << "Sorted by length in " << std::chrono::duration<double>(t4 - t3).count() << " seconds" << std::endl;

    // Filter out redundant keywords
    auto t5 = std::chrono::steady_clock::now();
    std::vector<std::string> unique_keywords;
    for (const auto &keyword : sorted_keywords)
    {
        inner_loop(unique_keywords, keyword);
    }
    auto t6 = std::chrono::steady_clock::now();
    std::cout << "Filtered in " << std::chrono::duration<double>(t6 - t5).count() << " seconds" << std::endl;

    auto t_end = std::chrono::steady_clock::now();
    auto total_time = std::chrono::duration<double>(t_end - t_start).count();
    std::cout << "Total function time: " << total_time << " seconds" << std::endl;

    return unique_keywords;
}

Along with the performance benchmark:

using Time = std::chrono::steady_clock;
void test_performance_intensive_case()
{
    std::cout << "Keywords generation start." << std::endl;
    std::vector<std::string> keywords = {
        "abcdef", "def", "ghidef", "jkl", "mnop", "qrst", "uvwxyz",
        "ghijkl", "abcdefgh", "ijklmnop", "mnopqrstuv", "rstuvwx",
        "mnopqrstuvwx", "uvwx"};

    // Extending the list to make it very long and complex
    std::vector<std::string> base_keywords = keywords;
    for (int i = 0; i < 200000; ++i)
    {
        for (const auto &k : keywords)
        {
            base_keywords.push_back(k + std::to_string(i));
            base_keywords.push_back(std::to_string(i) + k);
            base_keywords.push_back(std::to_string(i) + k + std::to_string(i));
        }
    }

    auto keywords_extended = base_keywords;
    std::cout << "Keywords generated!" << std::endl;
    std::set<std::string> expected_result = {"def", "jkl", "mnop", "qrst", "uvwx"};

    // One time variable declaration for "for loop"
    // The total times are use to confirm that timing itself had no negative impact on the times
    double old_function_time = 0;
    double new_function_time = 0;
    double total_test_time = 0;
    double total_test_time_new = 0;
    double average_old_time = 0;
    double average_new_time = 0;
    Time::time_point start_time_total;
    Time::time_point start_time;
    Time::time_point end_time;
    Time::time_point end_time_total;
    Time::time_point start_time_total_new;
    Time::time_point start_time_new;
    Time::time_point end_time_new;
    Time::time_point end_time_total_new;
    std::cout << "vars declared and loop of code about to start" << std::endl;
    for (int repeat_no = 0; repeat_no < 5; ++repeat_no) // Loop to calculate average.
    {
        start_time_total = Time::now();
        start_time = Time::now();
        std::vector<std::string> result_old = deduplicate_keywords(keywords_extended);
        end_time = Time::now();
        old_function_time = std::chrono::duration<double>(end_time - start_time).count();
        end_time_total = Time::now();
        total_test_time = std::chrono::duration<double>(end_time_total - start_time_total).count();
        assert(std::set<std::string>(result_old.begin(), result_old.end()) == expected_result);
        average_old_time = (repeat_no * average_old_time + old_function_time) / (repeat_no + 1);
        std::cout << "Current average old time: " << average_old_time << " seconds" << std::endl;
        std::cout << "Old function time: " << old_function_time << " seconds" << std::endl;
        std::cout << "Total test case time: " << total_test_time << " seconds" << std::endl;

        start_time_total_new = Time::now();
        start_time_new = Time::now();
        std::vector<std::string> result_new = deduplicate_keywords_new(keywords_extended); // same as in the python code, useful for comparison
        end_time_new = Time::now();
        new_function_time = std::chrono::duration<double>(end_time_new - start_time_new).count();
        end_time_total_new = Time::now();
        total_test_time_new = std::chrono::duration<double>(end_time_total_new - start_time_total_new).count();
        assert(std::set<std::string>(result_new.begin(), result_new.end()) == expected_result);
        average_new_time = (repeat_no * average_new_time + new_function_time) / (repeat_no + 1);
        std::cout << "Current average new time: " << average_new_time << " seconds" << std::endl;
        std::cout << "New function time: " << new_function_time << " seconds" << std::endl;
        std::cout << "Total test case time: " << total_test_time_new << " seconds" << std::endl;
    }

    std::cout << "End of intense function runs." << std::endl;
    std::cout << "Average old function time: " << average_old_time << " seconds" << std::endl;
    std::cout << "Average new function time: " << average_new_time << " seconds" << std::endl;

    std::cout << "Test performance intensive case for old function took " << average_old_time
              << " seconds but new function took " << average_new_time << " seconds" << std::endl;

    if (average_old_time < average_new_time)
    {
        std::cout << "Old is faster than new" << std::endl;
    }
    else if (average_old_time == average_new_time)
    {
        std::cout << "They take the same time" << std::endl;
    }
    else
    {
        std::cout << "New is faster than old" << std::endl;
    }
}

int main() {
    test_performance_intensive_case();
    return 0;
}

This outputted to the terminal (shortened, removing the logs for the currently empty deduplicate_keywords_new function) this:

started old function!
removed duplicates in 9.64173 seconds
Sorted by length in 3.36138 seconds
Filtered in 0.366933 seconds
Total function time: 21.0037 seconds
old function time: 405.213 seconds
Total test case time: 405.213 seconds

Aside from the general speed reduction, the logs seem to indicate that returning the function (return unique keywords) seems to take extremely long. This can be deduced by the fact that std::cout << "old function time: " << old_function_time << " seconds" << std::endl; exists outside the function and so the only thing that could cause the slowdown is the return statement in the original code. Why is this? And what can I do to stop this? I am fairly new to C++ coding and development, so it would be great if your explanations were detailed and thorough.

Compilation and other factors

The compilation flags that I was using were -g -O3 -std=c++17. It would seem that this would not be the cause of part of the issue, however, it seems that it did. Because I don't use C++ that often, I stuck to using VSCode and used the C/C++ extension with a build task that contained those arguments. I did this for ease of use - now I could just click 1 button and the code would compile and run. Turns out that because I clicked that button, the code was run in a semi-debug mode, with no debug functionality, but the code was run equally slowly. This led to the horrendous times. This problem was mostly mitigated through the use of running the code manually through the terminal.

So lesson learned... when coding in C++ in VSCode, don't speed test code by clicking the run button, just build first and then run the executable, through the terminal.

This did provide a significant speedup. Now the code performs almost as fast, but still slower than the Python code, as can be seen in this new console log:

started old function!
removed duplicates in 3.2457 seconds
Sorted by length in 1.83285 seconds
Filtered in 0.327369 seconds
Total function time: 5.60095 seconds
old function time: 6.46725 seconds
Total test case time: 6.46725 seconds

Effects of running repeats

Now that my code's run time had reduced to a reasonable size, I decided to run 100 repeats of the C++ code to get an accurate picture of my code's run time. Turns out my initial code was somewhat of an outlier in terms of runtime and the most representative times that I could get was:

Average old function time: 5.96633 seconds

As you can see, the 5.96633 seconds it took is longer than the 5.7 seconds that the Python code took. Why does this gap still exist? And how could I rewrite the code to make it faster? The code in both seems logically identical at first glance.

Solution

The "sort and remove duplicates" part is unnecessarily complicated. It could be done in a few lines:

// Sort by length, and then alphabetically for strings of same length
std::sort(keywords.begin(), keywords.end(),
    [](const std::string& a, const std::string& b) {
        return std::forward_as_tuple(a.length(), a) <
               std::forward_as_tuple(b.length(), b);
    });
// Remove duplicates
keywords.erase(std::unique(keywords.begin(), keywords.end()), keywords.end());

Demo. Takes ~1.3 sec with the generation loop of 50,000 and an average over 5 runs (I had to reduce the parameters enough so the online compiler wouldn't time out).