I have data frames which contain an Age difference column (AgeDiff
). The data frame looks like:
library("dplyr")
test <- data.frame("Age1"=c(42, 48, 58, 25, 53, 55, 32, 58, 71, 24, 48, 48, 64, 55, 45, 55, 34, 33, 51, 22),
"Age2"=c(8, 2, 1, 16, 14, 1, 11, 14, 0, 5, 2, 10, 16, 13, 3, 4, 8, 13, 8, 5))
test <- test %>%
mutate(AgeDiff = Age1 - Age2)
For the function I am writing, the user can enter a rejection minimum difference and/or a rejection maximum difference. Any age differences smaller than/larger than either threshold create an "out of scope" age difference.
In the work I am doing, any age difference less than 18 years or older than 54 years is "out of scope". By coincidence (I used a random number generator for both sets of ages), there are exactly two age differences that are too young, and two age differences that are too old.
I can find the maximum of "too young" or "too old", for example by comparing
TooYoung <- test %>%
filter(AgeDiff < 18) %>%
summarise(Count = n()) %>%
pull(Count)
with
TooOld <- test %>%
filter(AgeDiff > 54) %>%
summarise(Count = n()) %>%
pull(Count)
and then just working out which value is larger. The larger of TooYoung
and TooOld
gives me the number of rows I need to swap in the test
data.
I can (pre)sort the test
data frame so that AgeDiff
is ascending in value:
test <- test %>%
arrange(AgeDiff)
That gives me the order I want, where the data frame is sorted from most extreme AgeDiff
minimum to most extreme AgeDiff
maximum. Now what I want to do is to swap the top-most and bottom-most Age1
values for the maximum of TooYoung
and TooOld
. Because I have 2 as my maximum "out of scope" minima/maxima counts in this exanple, I need to swap:
Age1
in row 1 with Age1
in row 20Age1
in row 2 with Age1
in row 19It does not matter if the swap results in an "out of scope" AgeDiff
.
The data frame for the swap can be any length. The number of swaps to be made can be any number, including 0. So the problem becomes, for any swap value >0,
Age1
in row 1 with Age1
in nrow(foo)
Age1
in row 2 with Age1
in nrow(foo)-1
AgeDiff
is going to be recalculated after the swaps are made. There are other variables in my data frame, such as Sex
, so it is critical that only the Age1
values are swapped.
Reordering of the rows is unimportant. The only requirement is for the solution to swap the correct pairs of Age1
values.
I've searched for similar questions, but the ones I have found have been quite different. The other questions are a two-row swap for an initial percentage of the data frame, a swap of two known values for each other, swapping entire rows, swap of two randomly selected rows, value swaps based on grouping variables. In my problem, the number of swaps will be calculated with certainty, but the number varies between populations, the Age1
values to swap will differ, the number of Age1
values to swap must be exactly the maximum "out of scope" count, and there are no grouping variables.
Edited to add: assuming you have my data and have done the arrange, you will see that row 1 looks like:
Age1 Age2 AgeDiff
25 16 9
and row 20 looks like:
Age1 Age2 AgeDiff
71 0 71
Post swap these two rows would be: row 1:
Age1 Age2 AgeDiff
71 16 9
row 20:
Age1 Age2 AgeDiff
25 0 71
so just the two Age1
values are swapped.
Then the row 2 and row 19 swap, to end up with
row 2
Age1 Age2 AgeDiff
58 5 17
and for row 19
Age1 Age2 AgeDiff
22 1 57
The AgeDiff
column is ignored as it is recalculated after the finishing the swap(s).
(I also missed that the initial data frame should also have been called test, I have now fixed that.)
im sure theres a much neater way to do this but....
library("dplyr")
test <- data.frame("Age1"=c(42, 48, 58, 25, 53, 55, 32, 58, 71, 24, 48, 48, 64, 55, 45, 55, 34, 33, 51, 22),
"Age2"=c(8, 2, 1, 16, 14, 1, 11, 14, 0, 5, 2, 10, 16, 13, 3, 4, 8, 13, 8, 5))
test <- test %>%
mutate(AgeDiff = Age1 - Age2) %>%
arrange(AgeDiff) %>%
dplyr::mutate(row_no = row_number())
test
swap <- function(df) {
TooYoung <- df %>%
filter(AgeDiff < 18) %>%
summarise(Count = n()) %>%
pull(Count)
TooOld <- df %>%
filter(AgeDiff > 54) %>%
summarise(Count = n()) %>%
pull(Count)
top_bottom <- max(TooYoung, TooOld)
df2 <- df %>%
filter(row_number() > max(row_number()) - top_bottom | row_number() <= top_bottom) %>%
mutate(final_age1 = Age1) %>%
dplyr::select(final_age1, row_no)
df2$row_no <- sort(df2$row_no, decreasing = T)
df_final <- df %>%
left_join(df2) %>%
mutate(final_age1 = ifelse(is.na(final_age1), Age1, final_age1)) %>%
dplyr::select(-Age1, -row_no)
df_final
}
swap(test)
which i think gives you what you want?
# Joining, by = "row_no"
# Age2 AgeDiff final_age1
# 1 16 9 71
# 2 5 17 58
# 3 5 19 24
# 4 13 20 33
# 5 11 21 32
# 6 8 26 34
# 7 8 34 42
# 8 10 38 48
# 9 14 39 53
# 10 13 42 55
# 11 3 42 45
# 12 8 43 51
# 13 14 44 58
# 14 2 46 48
# 15 2 46 48
# 16 16 48 64
# 17 4 51 55
# 18 1 54 55
# 19 1 57 22
# 20 0 71 25