Search code examples
rdplyrmissing-data

Add missing rows to data frame in R based on a maximum length


I'm a bit stuck on how to approach this problem. I've looked into the complete() function, but I'm not sure it works for my problem.

I have a data frame that is missing some data points that I would like to fill in. Basically, in this task, a participant reads a sentence word-by-word and responds YES or NO to a question. When they respond NO, the sentence ends and there are no data points left. What I'd like to do is fill in those missing data points with NOs.

Here is a sample data frame:

test <- tibble(
  participant = c(001, 001, 001, 002, 002, 003, 003, 003, 003, 001, 002, 002, 003, 003, 003, 003, 003),
  sentence_id = c("dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence"),
  word = c("the", "dog", "went", "the", "dog", "the", "dog", "went", "home.", "I", "I", "watered", "I", "watered", "my", "plants", "today."),
  word_position = c(1, 2, 3, 1, 2, 1, 2, 3, 4, 1, 1, 2, 1, 2, 3, 4, 5),
  max_length = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5),
  response = c("YES", "YES", "NO", "YES", "NO", "YES", "YES", "YES", "YES", "NO", "YES", "NO", "YES", "YES", "YES", "YES", "YES")
)

Which looks like this:

participant sentence_id word word_position max_length response
001 dog_sentence the 1 4 YES
001 dog_sentence dog 2 4 YES
001 dog_sentence went 3 4 NO
002 dog_sentence the 1 4 YES
002 dog_sentence dog 2 4 NO
003 dog_sentence the 1 4 YES
003 dog_sentence dog 2 4 YES
003 dog_sentence went 3 4 YES
003 dog_sentence home. 4 4 YES
001 plant_sentence I 1 5 NO
002 plant_sentence I 1 5 YES
002 plant_sentence watered 2 5 NO
003 plant_sentence I 1 5 YES
003 plant_sentence watered 2 5 YES
003 plant_sentence my 3 5 YES
003 plant_sentence plants 4 5 YES
003 plant_sentence today. 5 5 YES

I'd like to fill in the missing data points whenever a participant says NO so that each data point is filled in between the word position at which they said NO and the max_length position. So, for participant 001's plant_sentence, there should be three rows filled in underneath that have word positions 2, 3, 4, and 5 and retain all of the other static information in columns 1-5 (bonus points if "response" can be filled in with NO for the new rows, but I can just do that after the fact if the new rows just say NA).

Thank you for your help in advance!


Solution

  • Something like this (using dplyr_1.1.0 or newer):

    library(dplyr)
    group_by(test, participant, sentence_id, max_length) %>%
      reframe(word_position = seq_len(first(max_length))) %>%
      full_join(test) %>%
      mutate(response = coalesce(response, "NO"))
    # Joining with `by = join_by(participant, sentence_id, max_length, word_position)`
    # # A tibble: 27 × 6
    #    participant sentence_id    max_length word_position word  response
    #          <dbl> <chr>               <dbl>         <dbl> <chr> <chr>   
    #  1           1 dog_sentence            4             1 the   YES     
    #  2           1 dog_sentence            4             2 dog   YES     
    #  3           1 dog_sentence            4             3 went  NO      
    #  4           1 dog_sentence            4             4 <NA>  NO      
    #  5           1 plant_sentence          5             1 I     NO      
    #  6           1 plant_sentence          5             2 <NA>  NO      
    #  7           1 plant_sentence          5             3 <NA>  NO      
    #  8           1 plant_sentence          5             4 <NA>  NO      
    #  9           1 plant_sentence          5             5 <NA>  NO      
    # 10           2 dog_sentence            4             1 the   YES     
    # # ℹ 17 more rows
    # # ℹ Use `print(n = ...)` to see more rows