Search code examples
rgraph-theorysubgraphedge-list

To find number of patterns


For a dataset like:

21 79 
78 245 
21 186 
65 522 
4 21 
3 4 
4 212 
4 881 
124 303 
28 653 
28 1231 
7 464 
7 52 
17 102 
16 292 
65 837 
28 203 
28 1689 
136 2216 
7 1342 
56 412 

I need to find the number of associated patterns. For example 21-79 and 21-186 have 21 in common. So they form 1 pattern. Also 21 is present in 4-21. This edge also contributes to the same pattern. Now 4-881, 4-212, 3-4 have 4 in their edge. So also contribute to the same pattern. Thus edges 21-79, 21-186, 4-21, 4-881, 4-212, 3-4 form 1 pattern. Similarly there are other patterns. Thus we need to group all edges that have any 1 node common to form a pattern (or subgraph). For the dataset given there are total 4 patterns.

I need to write code (preferably in R) that will find such no. of patterns.


Solution

  • Since you're describing the data as subgraphs, why not use the igraph package which is very knowledgeable about graphs. So here's your data in data.frame form

    dd <- structure(list(V1 = c(21L, 78L, 21L, 65L, 4L, 3L, 4L, 4L, 124L, 
      28L, 28L, 7L, 7L, 17L, 16L, 65L, 28L, 28L, 136L, 7L, 56L), V2 = c(79L, 
      245L, 186L, 522L, 21L, 4L, 212L, 881L, 303L, 653L, 1231L, 464L, 
      52L, 102L, 292L, 837L, 203L, 1689L, 2216L, 1342L, 412L)), .Names = c("V1", 
      "V2"), class = "data.frame", row.names = c(NA, -21L))
    

    We can treat each value as a vertex name so the data you provide is really like an edge list. Thus we create our graph with

    library(igraph)
    gg <- graph.edgelist(cbind(as.character(dd$V1), as.character(dd$V2)), 
        directed=F)
    

    That defines the nodes and vertex resulting in the following graph (plot(gg))

    graph data

    Now you wanted to know the number of "patterns" which are really represented as connected subgraphs in this data. You can extract that information with the clusters() command. Specifically,

    clusters(gg)$no
    # [1] 10
    

    Which shows there are 10 clusters in the data you provided. But you only want the ones that have more than two vertices. That we can get with

    sum(clusters(gg)$csize>2)
    # [1] 4
    

    Which is 4 as you were expecting.