Search code examples
rdplyrintersect

Get intersection in data.frame of some variables without omitting others


I have a huge dataframe (15 million rows), e.g.

    data = data.frame(
       human = c(1,0,0,1,1,0,0,0,0,1,1),
       hair = c(3,1,5,3,1,1,3,4,4,5,5),
       eye_colour = c(1,4,2,1,4,3,1,3,3,3),
       fuel = c(1,2,3,3,4,7,5,6,1,4,6)
    )

and I want to find the intersection for human being 0 and 1 of hair and eye_colour (so only if hair and eye_colour are the same for at least human==0 and human==1, I want to keep the row) and mark it with a cyclon_individual. So for my application one cyclon_individual is somebody, who is at least once recorded as human==1 and human==0 and has same hair and eye_colour coding, i.e. the following result:

    cyclon_individual human hair eye_colour fuel
    1                 1     3    1          1
    1                 1     3    1          3
    1                 0     3    1          5
    2                 0     1    4          2
    2                 1     1    4          4

I think, I have taken an awkward way, and yet I haven't found a clever way to code the cyclon_individual with dplyr:

    require('dplyr')
    hum = subset(data, human == 1)
    non_hum = subset(data, human == 0)
    feature_intersection = c("hair", "eye_colour")

    cyclon = intersect(hum[,feature_intersection],non_hum[,feature_intersection])
    cyclon_data = cyclon %>%
                    rowwise() %>%
                    do(filter(data,hair==.$hair,eye_colour==.$eye_colour))

So is there a more direct way to get to cyclon_data, since the current coding will take at least 26h? And is there a clever way to include the variable cyclon_individual without using a loop by going through all rows of cyclon?


Solution

  • You can simply group by hair and eye_color and keep the ones where human has both 0 and 1, i.e.

    library(dplyr)
    
    data %>% 
     group_by(hair, eye_colour) %>% 
     filter(length(unique(human)) > 1)
    

    which gives,

    # A tibble: 5 x 4
    # Groups:   hair, eye_colour [2]
      human  hair eye_colour  fuel
      <dbl> <dbl>      <dbl> <dbl>
    1     1     3          1     1
    2     0     1          4     2
    3     1     3          1     3
    4     1     1          4     4
    5     0     3          1     5