Identifying and summarizing discrete groups of nodes in R

I am working on a networking problem related to family/household composition. I have multiple edge tables containing id1, id2 and a relationship code to state the type of relationship between the identity variables. These tables are large, upwards of 7 million rows in each. I also have a node table which contains the same id and various attributes.

What I want to achieve is an adjacency matrix which will give summary statistics similar to something like this:

                      Children

             1  2  3  4   total 
            --------------------
          1 | 1  0  1  0    2
            |
 Adults   2 | 3  5  4  1    13  
            |
          3 | 1  2  0  0    3
            |
      total | 5  7  5  1    18

Essentially I want to be able to identify and count distinct networks in my data.

My data is in the form:

             ID1  ID2   Relationship_Code

              X1   X2    Married 
              X1   X3    Parent/Child
              X1   X4    Parent/Child 
              X5   X6    Married
              X5   X7    Parent/Child 
              X6   X5    Married
               .    .     .
               .    .     .
               .    .     .

I also have a node table which contains date of birth and other variables from which adult/child status can be identified.

Any tips/hints on how to extract this summary information from the graph data frame would be very helpful and much appreciated.

Thanks

Solution

Some of the work that is required to get the final table that you want requires access to the node table which you are not showing us, but I can get you pretty far along in your problem.

I think that the key to getting your result is identifying the households. You can do this in igraph using components. The connected components are households. I will illustrate with a slightly more elaborate version of your example.

Data:

Census = read.table(text="ID1  ID2   Relationship_Code
              X1   X2    Married 
              X2   X1    Married 
              X1   X3    Parent/Child
              X1   X4    Parent/Child 
              X2   X3    Parent/Child
              X2   X4    Parent/Child 
              X5   X6    Married
              X5   X7    Parent/Child 
              X6   X7    Parent/Child 
              X6   X5    Married
              X8   X9    Married
              X9   X8    Married",
    header=T)

Now turn it into a graph, find the components and check by plotting.

library(igraph)
EL = as.matrix(Census[,1:2])
Pop = graph_from_edgelist(EL)
Households = components(Pop)
plot(Pop, vertex.color=rainbow(3, alpha=0.5)[Households$membership])

You said that you could label the nodes as to whether they represent adults or children. I will assume that we have such a labeling. From that, it is easy to count the number of adults by household and children by household and to make a table of household decomposition by adults and children.

V(Pop)$AdultChild = c('A', 'A', 'C', 'C', 'A', 'A', 'C', 'A', 'A')
AdultsByHousehold = aggregate(V(Pop)$AdultChild, list(Households$membership), 
    function(p) sum(p=='A'))
AdultsByHousehold
  Group.1 x
1       1 2
2       2 2
3       3 2

ChildrenByHousehold = aggregate(V(Pop)$AdultChild, list(Households$membership), 
    function(p) sum(p=='C'))
ChildrenByHousehold
  Group.1 x
1       1 2
2       2 1
3       3 0

table(AdultsByHousehold$x, ChildrenByHousehold$x)
    0 1 2
  2 1 1 1

In my bogus example, all households have two adults.