I have a big structure with items and relations between the items. I need to find all transitive relations for all items. I duplicate all links and use transitive closure. E.g.:
A --- B --- C E --- F --- G
|
|
D
As a result I need to get the pairs:
A-B, A-C, A-D, B-A, B-C, B-D, C-A, C-B, C-D, D-A, D-B, D-C,
E-F, E-G, F-E, F-G, G-E, G-F
For using transitive closure I should use pairs [A-B, B-A, B-C, C-B, B-D, D-B, E-F, F-E, F-G, G-F]. It's big problem for me because the dataset is very large. The best way to solve my problem would be an algorithm, that allows get all relations using only one-side links (A-B, B-C, C-D, E-F, F-G). Are there any algorithms to get all relations for each element of the graph without duplicate links?
You may model this problem as a graph problem, and traverse the entire dataset you have, using either DFS(depth-first search) or BFS(breadth-first-search). During traversal, you may assign a component number to each tree in the forest of data you are investigating, and as a result, you can find all the connected components of this graph of data you have. Then for each connected compnent, you may simply form groups of 2 using its members, and use those to describe the relation. If there are odd number of elements, you can pick an already use item and link it to the last remaining one.
This obviously assumes that your goal is to find the connected components alone, and not print the relations, as you put it, in a specific manner. For instance, if you were trying to print the links so that the maximum distance between the items would be as minimal as possible, the problem becomes much more complex.
Another approach which shares the same assumption I mentioned above would be to use the method of union-find, also known as the data structure called disjoint set. You can start with N sets which have N of your items. Then, as you traverse these relations, for each relation (x, y)
, you unite the sets which contain the items x
and y
. In the end, all the connected components will be in the same set.
The first approach has O(V + E)
time complexity, V
and E
being the number of items and relations in your data, respectively. The second approach has O(V + E . k(V))
time complexity, where k is a function called Inverse Ackermann, that increases really slowly. (i.e. even slower than logarithmic function)