I have a huge set of binary strings, all of the same length. I know the 1 step neighborhood of every binary string (meaning, all the other binary strings that differ by only 1 bit). This relationships essentially creates a graph. Now, starting from any node in the graph, I would like to know what is the minimum number of steps I have to take before I reach a differing node based on some derived value (not based on the binary string. I'm using some dummy function GetABC() in my example below). If we meet a dead end before finding a different node, we assume the minimum number of steps to reach a different node is infinite. A dead end is when no new nodes can be visited (based on the history of visited nodes). If our entire population has the same derived value, then the worst case is we would visit every node then return infinite (-1).
My first attempt was using recursive function and keeping track of all visited nodes through the recursion by passing a reference to a vector of visited nodes. It worked with small dataset, but with 1 million binary string I ran into stack overflow error (literally).
Then, I refactored my code to run in loops instead by building iteratively the set of neighbors to visit then going over all of them and check if any has a different derived value (GetABC()), then start again from all the neighbors we just visited. Again, works on a small dataset but with 1 million records my code has been running for 4 days and still hasn't completed (using parallel for_each on 12 threads to do this analysis on all nodes).
//Previously had this function as a recursive algorithm, but it triggered stack overflow on big dataset
int MyClass::FindDifferentNodeInNeighborhood2(uint64_t node_hash) {
std::vector<uint64_t> visited;//init empty visited vector
visited.push_back(node_hash);
int level = 0;
bool found_diff_node = false;
std::unordered_map<int, std::vector<std::pair<uint64_t, uint64_t>>> to_visit;//<level, vec<node_to_visit, parent_node>>
to_visit[level].push_back(std::make_pair(node_hash, 0));//0 because no parent. Not used in first loop anyway so the value doesn't matter
while (!found_diff_node && !to_visit[level].empty()) {
for (auto n : to_visit[level]) {
//_negihborhood is a member variable of MyClass defined as
//std::unordered_map<uint64_t, std::unordered_set<uint64_t>> _neighborhood;
for (const auto neighbor : _neighborhood[n.first]) {
//make sure we didn't visit this neighbor already
if (std::find(visited.begin(), visited.end(), neighbor) == visited.end()) {
//if not, add to next set of node to visit
to_visit[level + 1].push_back(std::make_pair(neighbor, n.first));
}
}
}
level++;
for (auto n : to_visit[level]) {
//auto node = n.first;
//auto parent = n.second;
visited.push_back(n.first);
//_population is the map of the actual binary strings objects
//We retrieve the object using it's hash
//GetABC() is just a function that gets a specific value from the node
if (_population[n.first].GetABC() != _population[n.second].GetABC()) {
found_diff_node = true;
break;
}
}
}
if (!found_diff_node) {
return -1;
}
return level;
}
I will debug further to see if it's not an issue with infinite loop, but I thought I should ask here if I'm not overlooking some easy algorithm that could achieve this.
thanks
Found out my first loops where adding the same node multiple times since from the set of nodes to visit it is very likely that some of them have a few neighbors in common. Easily fixed by erasing duplicates in the set of nodes to visit next. I also implemented the suggestions made in the other answer. Here is the result:
int MyClass::FindDifferentNodeInNeighborhood2(uint64_t node_hash) {
std::unordered_set<uint64_t> visited;//init empty visited vector
visited.insert(node_hash);
int level = 0;
bool found_diff_node = false;
std::vector<std::pair<uint64_t, uint64_t>> visit_now;
std::vector<std::pair<uint64_t, uint64_t>> visit_next;
visit_now.emplace_back(std::make_pair(node_hash, 0));
while (!found_diff_node && !visit_now.empty()) {
for (auto n : visit_now) {
for (const auto neighbor : _neighborhood[n.first]) {
if(visited.find(neighbor) == visited.end()){
visit_next.emplace_back(std::make_pair(neighbor, n.first));
}
}
}
level++;
//Remove duplicates
std::sort(visit_next.begin(), visit_next.end());
visit_next.erase(std::unique(visit_next.begin(), visit_next.end()), visit_next.end());
visit_now.clear();
visit_now = std::move(visit_next);
visit_next.clear();
for (auto n : visit_now) {
visited.insert(n.first);
if (_population[n.first].GetABC() != _population[n.second].GetABC()) {
found_diff_node = true;
break;
}
}
}
if (!found_diff_node) {
return -1;
}
return level;
}