Search code examples
rdplyrsubsettop-n

How to use top_n for conditional extraction


So I have a column (category) that contains either "Yes" or "No" in my df and in order to create a more balanced sample I want to select the rows with the first 500 cases of "Yes" and the first 500 cases of "No" from my dataset.

I've tried this code:

top_n(df,500, category=="Yes")

But this select ALL cases of yes instead of only the first 500 I also tried this but this gave me an error though I'm sure it makes no sense

df %>% filter(top_n(500, category == "Yes") & top_n(500, category=="No")) I need a bit of help with the right direction


Solution

  • I'd probably just use head for this, and filter directly on the data frame

    df1 <- head(df[df$category == "Yes",], 500)
    df2 <- head(df[df$category == "No",], 500)
    
    # to combine
    out <- rbind(df1, df2)
    

    I'm guessing top_n does something similar. I expect there is a nicer way with dplyr but this should work :)