I'm looking for help with R. I want to add three columns to existing data frames that contain time series data and have a lot of NA values. The data is about test scores. The first column I want to add is the first test score available. In the second column, I want the last test score available. In the third column, I want to calculate the derivative for each row by dividing the difference between the first and last scores by the number of tests that have passed. Important is that some of these past tests have NA values but I still want to include these when dividing. However, NA values that come after the last available test score I don't want to count.
Some explanation of my data: A have a couple of data frames that all have test scores of different people. The different people are the rows and each column represents a test score. There are multiple test scores per person for the same test in the data frame. Column T1 shows their first score, T2 the second score, which was gathered a week later, and so on. Some people have started sooner than others and therefore have more test scores available. Also, some scores at the beginning and the middle are missing for various reasons. See the two example tables below where the index column is the actual index of the data frame and not a separate column. Some numbers are missing from the index (like 3) because this person had only NA values in their row, which I removed. It is important for me that the index stays this way.
Example 1 (test A):
INDEX | T1 | T2 | T3 | T4 | T5 | T6 |
---|---|---|---|---|---|---|
1 | NA | NA | NA | 3 | 4 | 5 |
2 | 57 | 57 | 57 | 57 | NA | NA |
4 | 44 | NA | NA | NA | NA | NA |
5 | 9 | 11 | 11 | 17 | 12 | NA |
Example 2 (test B):
INDEX | T1 | T2 | T3 | T4 |
---|---|---|---|---|
1 | NA | NA | NA | 17 |
2 | 11 | 16 | 20 | 20 |
4 | 1 | 20 | NA | NA |
5 | 20 | 20 | 20 | 20 |
My goal now is to add to these data frames the three columns mentioned before. For example 1 this would look like:
INDEX | T1 | T2 | T3 | T4 | T5 | T6 | FirstScore | LastScore | Derivative |
---|---|---|---|---|---|---|---|---|---|
1 | NA | NA | NA | 3 | 4 | 5 | 3 | 5 | 0.33 |
2 | 57 | 57 | 57 | 57 | NA | NA | 57 | 57 | 0 |
4 | 44 | NA | NA | NA | NA | NA | 44 | 44 | 0 |
5 | 9 | 11 | 11 | 17 | 12 | NA | 9 | 12 | 0.6 |
And for example 2:
INDEX | T1 | T2 | T3 | T4 | FirstScore | LastScore | Derivative |
---|---|---|---|---|---|---|---|
1 | NA | NA | NA | 17 | 17 | 17 | 0 |
2 | 11 | 16 | 20 | 20 | 11 | 20 | 2.25 |
4 | 1 | 20 | NA | NA | 1 | 20 | 9.5 |
5 | 20 | 20 | 20 | 20 | 20 | 20 | 0 |
I hope I have made myself clear and that someone can help me, thanks in advance!
I think you can use the following solution. It surprisingly turned out to be a little verbose and convoluted but I think it is quite effective. I assumed that if the Last
available score is not actually the last T
, so I need to detect its index and divide the difference by it meaning NA
values after the last one do not count. Otherwise it is divided by the number of all T
s available.
library(dplyr)
library(purrr)
df %>%
select(T1:T6) %>%
pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
exec(rbind, !!!.) %>%
as_tibble() %>%
set_names(c("First", "Last")) %>%
bind_cols(df) %>%
relocate(First, Last, .after = last_col()) %>%
rowwise() %>%
mutate(Derivative = ifelse(!is.na(T6) & T6 == Last, (Last - First)/(length(df)-1),
(Last - First)/last(which(c_across(T1:T6) == Last))))
# First Sample Data
# A tibble: 4 x 10
# Rowwise:
INDEX T1 T2 T3 T4 T5 T6 First Last Derivative
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 3 4 5 3 5 0.333
2 2 57 57 57 57 NA NA 57 57 0
3 4 44 NA NA NA NA NA 44 44 0
4 5 9 11 11 17 12 NA 9 12 0.6
Second Sample Data
df2 %>%
select(T1:T4) %>%
pmap(., ~ {x <- c(...)[!is.na(c(...))]; c(x[1], x[length(x)])}) %>%
exec(rbind, !!!.) %>%
as_tibble() %>%
set_names(c("First", "Last")) %>%
bind_cols(df2) %>%
relocate(First, Last, .after = last_col()) %>%
rowwise() %>%
mutate(Derivative = ifelse(!is.na(T4) & T4 == Last, (Last - First)/(length(df2)-1),
(Last - First)/last(which(c_across(T1:T4) == Last))))
# A tibble: 4 x 8
# Rowwise:
INDEX T1 T2 T3 T4 First Last Derivative
<int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 NA NA NA 17 17 17 0
2 2 11 16 20 20 11 20 2.25
3 4 1 20 NA NA 1 20 9.5
4 5 20 20 20 20 20 20 0