I have a huge dataframe (15 million rows), e.g.
data = data.frame(
human = c(1,0,0,1,1,0,0,0,0,1,1),
hair = c(3,1,5,3,1,1,3,4,4,5,5),
eye_colour = c(1,4,2,1,4,3,1,3,3,3),
fuel = c(1,2,3,3,4,7,5,6,1,4,6)
)
and I want to find the intersection for human
being 0 and 1 of hair
and eye_colour
(so only if hair
and eye_colour
are the same for at least human==0
and human==1
, I want to keep the row) and mark it with a cyclon_individual
. So for my application one cyclon_individual
is somebody, who is at least once recorded as human==1
and human==0
and has same hair
and eye_colour
coding, i.e. the following result:
cyclon_individual human hair eye_colour fuel
1 1 3 1 1
1 1 3 1 3
1 0 3 1 5
2 0 1 4 2
2 1 1 4 4
I think, I have taken an awkward way, and yet I haven't found a clever way to code the cyclon_individual
with dplyr
:
require('dplyr')
hum = subset(data, human == 1)
non_hum = subset(data, human == 0)
feature_intersection = c("hair", "eye_colour")
cyclon = intersect(hum[,feature_intersection],non_hum[,feature_intersection])
cyclon_data = cyclon %>%
rowwise() %>%
do(filter(data,hair==.$hair,eye_colour==.$eye_colour))
So is there a more direct way to get to cyclon_data
, since the current coding will take at least 26h?
And is there a clever way to include the variable cyclon_individual
without using a loop by going through all rows of cyclon
?
You can simply group by hair and eye_color and keep the ones where human has both 0 and 1, i.e.
library(dplyr)
data %>%
group_by(hair, eye_colour) %>%
filter(length(unique(human)) > 1)
which gives,
# A tibble: 5 x 4 # Groups: hair, eye_colour [2] human hair eye_colour fuel <dbl> <dbl> <dbl> <dbl> 1 1 3 1 1 2 0 1 4 2 3 1 3 1 3 4 1 1 4 4 5 0 3 1 5