Search code examples
matchintersectuserid

Intersecting two columns with different lengths


I have a dataset1 containing 5000 user_ids from Twitter. I want to intersect the user_ids from this dataset with another dataset2 containing other user_ids from Twitter and at the same time create a new column in my dataset1, where each user_id in dataset1 either get the score '1' (if intersect) or '0' (if no intersect). I tried the following code below, but I just get an output in the new column 'intersect' with some (random) zeros and then a lot of NA's.

for(i in 1:ncol(data1)){
  
  #intersect with other data
  ids_intersect = intersect(data1$user_id, data2$user_id)
  if(length(ids_intersect == 0)){
    data1[i, "intersect"] <- 0 # no intersect
  } else {
    data1[i, "intersect"] <- 1 # intersect
  }
}

I also tried another code, which I find more intuitive, but this one won't work since the two datasets have different rowlengths ("replacement has 3172 rows, data has 5181"). But in the same way as above the intention here would be that you get the score 1 'if intersect' or 0/NA 'if no intersect' in the new column 'intersect'. However i'm not sure how to implement it in the following code:

data$intersect <- intersect(data1$user_id, data2$user_id)

Any way of assigning either 1 or 0 to the user_ids in a new column depending on whether there is an intersect/match?


Solution

  • A comfortable option is using mutate() from the dplyr package together with the Base R %in% command as follows.

    Data

    data1 <- data.frame(user_id = c("Test1", 
                                    "Test2", 
                                    "Test4", 
                                    "Test5")) 
    data2 <- data.frame(user_id = c("Test1", 
                                    "Test3",
                                     "Test4"))
    

    Code

    data1 %<>% 
           mutate(Existence = ifelse(user_id %in% data2$user_id, 
                                                  1, 
                                                  0))
    

    Output

    > data1
      user_id Existence
    1   Test1         1
    2   Test2         0
    3   Test4         1
    4   Test5         0