Search code examples
rfiltercountdplyrsample

Is there a way to show the "zero-counts" by using dplyr on sample data?


Hello Ladies and Gentlemen, I have a problem with summarizing my datasample while simultaneously wanting to see the "zero-counts" resulting from my attempted method. My data looks like this:

library(dplyr)
set.seed(529)
sampledata <- data.frame(StartPos = rep(1:10, times = 10),
              Velocity = c(sample(c(-36, 36), 100, replace = T)),
              Response = c(sample(c("H", "M", "W"), 50, replace=T),
                           sample(c("M", "W"), 50, replace = T)))

The data consists of 100 rows with the Start Positions ranging from 1-10 ( each randomly generated 10 times (some 20 times like Start Position 3 which could exist 20 times)). Each of the Start Positions also has a response which could be H for Hit, M for Miss or W for wrong. It iss possible that there are no H for certain StartPositions. There is also a column called Velocity with the values -36 and 36 which describe the direction of the Stimlus which started at the certain StartPos (-36 to the right, 36 to the left).

The only thing that I really care about here are the StartPos and Velocitys with Hits - for the percentage calculation that follows.

To calculate the number of test-trials which were run per side I created the following filter/counter:

numbofrunsperside <- sampledata %>%
  mutate(Direction = case_when( # add direction
    Velocity < 0 ~ "Right",
    Velocity > 0 ~ "Left",
    TRUE ~ "None")) %>%
  group_by(StartPos, Direction) %>% # for each combination
  count(Velocity, .drop=FALSE) # count
numbofrunsperside

For the Hit-Counts with their respective StartPos and Direction (Left/Right):

sampledata_hit_counts <- sampledata %>%
  mutate(Direction = case_when( # add direction 
    Velocity < 0 ~ "Right",
    Velocity > 0 ~ "Left",
    TRUE ~ "None")) %>% 
  filter(Response == "H") %>% 
  group_by(StartPos, Direction, .drop=FALSE) %>% # for each combination 
  count(StartPos, .drop=FALSE) # count
sampledata_hit_counts

The problem occurs here: the number of runs per side dataframe has 20 rows, while the sampledata_hit_counts one only has 12.

I get the following error-message, when I try to calculate the percentage of hits using:

sampledata_hit_counts$PTest = sampledata_hit_counts$n / 
numbofrunsperside$n

Error in $<-.data.frame(*tmp*, PTest, value = c(0.2, 0.2, 0.25, 0.166666666666667, : replacement has 20 rows, data has 12 In addition: Warning message: In sampledata_hit_counts$n/numbofrunsperside$n : longer object length is not a multiple of shorter object length

A way which would fix this, would be to include the "zero-counts" for the different directions and startpos in sampledata_hit_counts - so that the number of rows would be the same in each df. I sadly don't know a way to do this... Help would be greatly appreciated!


Solution

  • You can do a left join:

    library(dplyr)
    
    numbofrunsperside %>%
        left_join(
            sampledata_hit_counts, 
            by = c("StartPos", "Direction"), 
            suffix = c("_runs", "_hits")
        ) %>% 
        mutate(
            p_test = ifelse(is.na(n_hits), 0, n_hits) / n_runs
        ) %>% 
        pull(p_test)
    #[1] 0.2000000 0.0000000 0.0000000 0.1666667 0.0000000 0.0000000 0.3333333 0.1428571 0.0000000 0.1250000 0.1666667 0.5000000 0.2000000
    #[14] 0.4000000 0.1666667 0.0000000 0.0000000 0.3333333 0.5000000 0.0000000