Search code examples
rdataframesamplemutatesubsampling

Is there a way to balance data in R without reordering a dataframe?


First, here is some toy data:

df <- data.frame(
  "stim" = c("face", "object", "pareidolia", "face", "face", "object", "pareidolia", "object"),
  "RT" = c(23, 24, 25, 26, 27, 22, 25, 23),
  "Opac" = c(70, 60, 80, 65, 60, 61, 59, 70)
)

I want to ensure that there are equal numbers of each stim variable in the dataset. I am using the following code to attempt this:

library(dplyr)

newdf <- df %>%
  mutate(mn = min(table(stim))) %>%
  group_by(stim) %>%
  sample_n(mn[1]) %>%
  ungroup()

This works almost perfectly, except that it reorders the data. My desired output would look like the following:

stim   RT   Opac
face   23   70
object 24   60
pareidolia 25 80
face   26   65
object 22   61
pareidolia 25 59

But this code outputs this:

stim   RT   Opac
face   23   70
face   26   65
object 24   60
object 22   61
pareidolia 25 80
pareidolia 25 59

I realize that this is likely happening because I am using table(), but I'm not sure how else to go about this. Any suggestions would be appreciated.

Also, bonus side question: is there a way to determine (a function, code snippet, etc) the row number where the data is being cut from as part of this process?


Solution

  • You could use a filtering strategy rather than slice_n

    df %>%
      mutate(mn = min(table(stim))) %>%
      filter(sample(seq_along(stim)) <= mn, .by=stim)