I'm a bit stuck on how to approach this problem. I've looked into the complete()
function, but I'm not sure it works for my problem.
I have a data frame that is missing some data points that I would like to fill in. Basically, in this task, a participant reads a sentence word-by-word and responds YES or NO to a question. When they respond NO, the sentence ends and there are no data points left. What I'd like to do is fill in those missing data points with NOs.
Here is a sample data frame:
test <- tibble(
participant = c(001, 001, 001, 002, 002, 003, 003, 003, 003, 001, 002, 002, 003, 003, 003, 003, 003),
sentence_id = c("dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "dog_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence", "plant_sentence"),
word = c("the", "dog", "went", "the", "dog", "the", "dog", "went", "home.", "I", "I", "watered", "I", "watered", "my", "plants", "today."),
word_position = c(1, 2, 3, 1, 2, 1, 2, 3, 4, 1, 1, 2, 1, 2, 3, 4, 5),
max_length = c(4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5),
response = c("YES", "YES", "NO", "YES", "NO", "YES", "YES", "YES", "YES", "NO", "YES", "NO", "YES", "YES", "YES", "YES", "YES")
)
Which looks like this:
participant | sentence_id | word | word_position | max_length | response |
---|---|---|---|---|---|
001 | dog_sentence | the | 1 | 4 | YES |
001 | dog_sentence | dog | 2 | 4 | YES |
001 | dog_sentence | went | 3 | 4 | NO |
002 | dog_sentence | the | 1 | 4 | YES |
002 | dog_sentence | dog | 2 | 4 | NO |
003 | dog_sentence | the | 1 | 4 | YES |
003 | dog_sentence | dog | 2 | 4 | YES |
003 | dog_sentence | went | 3 | 4 | YES |
003 | dog_sentence | home. | 4 | 4 | YES |
001 | plant_sentence | I | 1 | 5 | NO |
002 | plant_sentence | I | 1 | 5 | YES |
002 | plant_sentence | watered | 2 | 5 | NO |
003 | plant_sentence | I | 1 | 5 | YES |
003 | plant_sentence | watered | 2 | 5 | YES |
003 | plant_sentence | my | 3 | 5 | YES |
003 | plant_sentence | plants | 4 | 5 | YES |
003 | plant_sentence | today. | 5 | 5 | YES |
I'd like to fill in the missing data points whenever a participant says NO so that each data point is filled in between the word position at which they said NO and the max_length position. So, for participant 001's plant_sentence
, there should be three rows filled in underneath that have word positions 2, 3, 4, and 5 and retain all of the other static information in columns 1-5 (bonus points if "response" can be filled in with NO for the new rows, but I can just do that after the fact if the new rows just say NA).
Thank you for your help in advance!
Something like this (using dplyr_1.1.0 or newer):
library(dplyr)
group_by(test, participant, sentence_id, max_length) %>%
reframe(word_position = seq_len(first(max_length))) %>%
full_join(test) %>%
mutate(response = coalesce(response, "NO"))
# Joining with `by = join_by(participant, sentence_id, max_length, word_position)`
# # A tibble: 27 × 6
# participant sentence_id max_length word_position word response
# <dbl> <chr> <dbl> <dbl> <chr> <chr>
# 1 1 dog_sentence 4 1 the YES
# 2 1 dog_sentence 4 2 dog YES
# 3 1 dog_sentence 4 3 went NO
# 4 1 dog_sentence 4 4 <NA> NO
# 5 1 plant_sentence 5 1 I NO
# 6 1 plant_sentence 5 2 <NA> NO
# 7 1 plant_sentence 5 3 <NA> NO
# 8 1 plant_sentence 5 4 <NA> NO
# 9 1 plant_sentence 5 5 <NA> NO
# 10 2 dog_sentence 4 1 the YES
# # ℹ 17 more rows
# # ℹ Use `print(n = ...)` to see more rows