Search code examples
phppythonalgorithmsynonymdata-analysis

Synonym finder algorithm


I think example will be much better than loooong description :)

Let's assume we have an array of arrays:

("Server1", "Server_1", "Main Server", "192.168.0.3")
("Server_1", "VIP Server", "Main Server")
("Server_2", "192.168.0.4")
("192.168.0.3", "192.168.0.5")
("Server_2", "Backup")

Each line contains strings which are synonyms. And as a result of processing of this array I want to get this:

("Server1", "Server_1", "Main Server", "192.168.0.3", "VIP Server", "192.168.0.5")
("Server_2", "192.168.0.4", "Backup")

So I think I need a kind of recursive algorithm. Programming language actually doesn't matter — I need only a little help with idea in general. I'm going to use php or python.

Thank you!


Solution

  • This problem can be reduced to a problem in graph theory where you find all groups of connected nodes in a graph.

    An efficient way to solve this problem is doing a "flood fill" algorithm, which is essentially a recursive breath first search. This wikipedia entry describes the flood fill algorithm and how it applies to solving the problem of finding connected regions of a graph.

    To see how the original question can be made into a question on graphs: make each entry (e.g. "Server1", "Server_1", etc.) a node on a graph. Connect nodes with edges if and only if they are synonyms. A matrix data structure is particularly appropriate for keeping track of the edges, provided you have enough memory. Otherwise a sparse data structure like a map will work, especially since the number of synonyms will likely be limited.

    • Server1 is Node #0
    • Server_1 is Node #1
    • Server_2 is Node #2

    Then edge[0][1] = edge[1][0] = 1, indicated that there is an edge between nodes #0 and #1 ( which means that they are synonyms ). While edge[0][2] = edge[2][0] = 0, indicating that Server1 and Server_2 are not synonyms.

    Complexity Analysis

    Creating this data structure is pretty efficient because a single linear pass with a lookup of the mapping of strings to node numbers is enough to crate it. If you store the mapping of strings to node numbers in a dictionary then this would be a O(n log n) step.

    Doing the flood fill is O(n), you only visit each node in the graph once. So, the algorithm in all is O(n log n).