Search code examples
redgessna

Creating pairs from Group Membership Data


first of all, sorry for this probably stupid question but I'm getting really frustrated and desperate after 7 hours of using google and doing trail & error...

I have a list of user IDs and groups they belong to. I need a list of all combinations of users that share a group (edgelist to create a networkgraph). I found this right away and was really happy because it's exactly what I need. I've never used R before but it seemed like it could solve my problem very easily. The code provided in the other thread works perfectly fine as it is, but after I startet to customize it for my needs and especially my data input I ran into problems:

#import a csv, the column "group" consists of groupID, column "user" the userID
group <- read.csv("E:/r/input.csv")[ ,c('group')]
user <- read.csv("E:/r/input.csv")[ ,c('user')]
data.frame(group,user)

the output in R gives me this:

       group       user
1  596230112 1514748421
2  596230112 1529087841
3  596230112 1518194516
4  596230112 1514852264
5  596230112 1514748421
6  596230112 1511768387
7  596230112 1514748421
8  596230112 1514852264
9  596230112 1511768387
10 596231111 1535990615
11 596232665 1536087573
12 596232665 1488758238
13 596232665 1536087573
14 596234505 1511768387
15 596234505 1535990615

So far, so good! The next step should pair the users, e.g

1512748421 -> 1529097841
1512748421 -> 1518194516 

and so on... The code I used is:

#create pairs
pairs <- do.call(rbind, sapply(split(user, group), function(x) t(combn(x,2))))

The error I get is:

Error : cannot allocate vector of size 5.7 Gb
In addition: Warning messages:
1: In combn(x, 2) :
  Reached total allocation of 3981Mb: see help(memory.size)
2: In combn(x, 2) :
  Reached total allocation of 3981Mb: see help(memory.size)
3: In combn(x, 2) :
  Reached total allocation of 3981Mb: see help(memory.size)
4: In combn(x, 2) :
  Reached total allocation of 3981Mb: see help(memory.size)

The dataset I want to work with in the end is pretty big but for the start I tried to just have those 15 user/group entries I posted above and even that doesn't work... what am I not seeing here? The memory limit is already set to the maximum of my computer (4GB) and I also did everything the help-function or any R-website suggested.

R version 3.3.1, Platform: x86_64-w64-mingw32/x64


Solution

  • The problem is

    combn(x,2)
    

    When x is an integer, combn creates the sequence 1 ... x and returns all pairs from that sequence which will be a huge array if x is large. This will happen if you have any group with a single user in it.

    A solution is to filter out all groups that only have one user:

    #create pairs
    pairs <- do.call(rbind, sapply(Filter(function(x)
        length(x) > 1, split(user, group)), function(x) t(combn(x,2))))