Search code examples
rdplyrunique

Find first example of unique values and return row number


I have this data frame:

df <- structure(list(Name = c("Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", 
                          "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", 
                          "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub1", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", 
                          "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", 
                          "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2", "Sub2"), 
                 StimulusName = c("Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", 
                                  "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", 
                                  "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", 
                                  "Stim2", "Stim2", "Stim2", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", 
                                  "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim1", "Stim2", 
                                  "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", 
                                  "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2", "Stim2"), 
                 Fixation = c(NA, NA, 1L, 1L, NA, NA, 2L, 2L, 3L, 3L, NA, NA, NA, NA, NA, 4L, 4L, 5L, 5L, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
                              2L, NA, NA, NA, 3L, 3L, 3L, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 2L, 2L, NA, NA, 3L, 3L, 3L, 4L, 4L, 4L, NA, NA, 1L, 1L, NA, 
                              NA, 2L, 2L, 3L, 3L, NA, NA, NA, NA, NA, 4L, 4L, 5L, 5L, NA)), 
            row.names = c(NA, -79L), class = c("tbl_df", "tbl", "data.frame"))

There are 3 columns: Name, StimulusName, and Fixation.

I'd like to be able to return the row number for the first examples of unique values in column Fixation and group these by Name, and StimulusName.

Here's what I've tried so far (based on a partial solution found elsewhere):

# function to return rows
Unique_Indices <- function(Values){
  unik <- !duplicated(Values)  ## logical vector of unique values
  return(seq_along(Values)[unik])  ## indices
}

But when I use it with a dplyr chain it doesn't return original row numbers but rather starts the row count afresh by the grouping:

library(tidyr)

# This doesn't work
Unique_Index <- df %>%
  group_by(Name, StimulusName) %>%
  summarise(Indices = list(Unique_Indices(Fixation))) %>%
  unnest()

The incorrect output looks like this:

enter image description here

You can see that Indices doesn't contain the original row numbers once it moves onto the next StimulusName dues to the group_byinstruction. Is there any way I can group_by as I desire while retaining the original row number from the df?


Solution

  • You can filter directly the non-duplicated values of Fixation per group and first turn the rownames to a proper column to keep the indices.

    library(dplyr)
    library(tibble)
    
    df %>% 
      rownames_to_column() %>% 
      group_by(Name, StimulusName) %>%
      filter(!duplicated(Fixation))
    
    # A tibble: 21 x 4
    # Groups:   Name, StimulusName [4]
    #    rowname Name  StimulusName Fixation
    #    <chr>   <chr> <chr>           <int>
    #  1 1       Sub1  Stim1              NA
    #  2 3       Sub1  Stim1               1
    #  3 7       Sub1  Stim1               2
    #  4 9       Sub1  Stim1               3
    #  5 16      Sub1  Stim1               4
    #  6 18      Sub1  Stim1               5
    #  7 20      Sub1  Stim2              NA
    #  8 24      Sub1  Stim2               1
    #  9 28      Sub1  Stim2               2
    # 10 37      Sub1  Stim2               3
    # ... with 11 more rows
    

    With Ronak Shah's suggestion a dplyr-only solution can look like this:

    df %>% 
      mutate(Index = row_number()) %>% 
      group_by(Name, StimulusName) %>%
      filter(!duplicated(Fixation))