So I have a column (category) that contains either "Yes" or "No" in my df and in order to create a more balanced sample I want to select the rows with the first 500 cases of "Yes" and the first 500 cases of "No" from my dataset.
I've tried this code:
top_n(df,500, category=="Yes")
But this select ALL cases of yes instead of only the first 500 I also tried this but this gave me an error though I'm sure it makes no sense
df %>% filter(top_n(500, category == "Yes") & top_n(500, category=="No"))
I need a bit of help with the right direction
I'd probably just use head
for this, and filter directly on the data frame
df1 <- head(df[df$category == "Yes",], 500)
df2 <- head(df[df$category == "No",], 500)
# to combine
out <- rbind(df1, df2)
I'm guessing top_n
does something similar. I expect there is a nicer way with dplyr
but this should work :)