Here we can see some example data:
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5)
t <- c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)
md <- c(NA,NA,NA,1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 4)
dat <- data.frame(id, t, md)
dat
> dat
id t md
1 1 1 NA
2 1 2 NA
3 1 3 NA
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 1
11 4 2 1
12 4 3 1
13 5 1 4
14 5 2 4
15 5 3 4
We have a person id (id), three different points in time (t) and a mother id (md). One can see that one person has no mother in the sample data (person with ID "1"). The person with ID "1" is the mother of the persons with ID's 2-4. At the same time, the person with ID "4" is the mother of the person with ID "5".
Now to the question: Going on from this dataset, how to create new variables for the child ID's? The result should look like this:
> dat
id t md kd1 kd2 kd3 kd4
1 1 1 NA 2 3 4 NA
2 1 2 NA 2 3 4 NA
3 1 3 NA 2 3 4 NA
4 2 1 1 NA NA NA NA
5 2 2 1 NA NA NA NA
6 2 3 1 NA NA NA NA
7 3 1 1 NA NA NA NA
8 3 2 1 NA NA NA NA
9 3 3 1 NA NA NA NA
10 4 1 1 5 NA NA NA
11 4 2 1 5 NA NA NA
12 4 3 1 5 NA NA NA
13 5 1 4 NA NA NA NA
14 5 2 4 NA NA NA NA
15 5 3 4 NA NA NA NA
As wen can see, as many new variables are to be created as children are identifiable. For each identifiable child, the person-specific ID should be recorded in a separate variable. How could this be realized using dplyr?
Note: This is only example data, the real data set is much more complex. Therefore, a flexible approach would be desirable.
library(tidyverse)
# Make dataframe
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5)
t <- c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)
md <- c(NA,NA,NA,1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 4)
dat <- tibble(id, t, md)
dat
#> # A tibble: 15 × 3
#> id t md
#> <dbl> <dbl> <dbl>
#> 1 1 1 NA
#> 2 1 2 NA
#> 3 1 3 NA
#> 4 2 1 1
#> 5 2 2 1
#> 6 2 3 1
#> 7 3 1 1
#> 8 3 2 1
#> 9 3 3 1
#> 10 4 1 1
#> 11 4 2 1
#> 12 4 3 1
#> 13 5 1 4
#> 14 5 2 4
#> 15 5 3 4
# Find each child id for each mom
children_by_mom <- dat |>
group_by(md) |>
summarize(kids = unique(id)) |>
rename(mother = md)
#> `summarise()` has grouped output by 'md'. You can override using the `.groups`
#> argument.
children_by_mom
#> # A tibble: 5 × 2
#> # Groups: mother [3]
#> mother kids
#> <dbl> <dbl>
#> 1 1 2
#> 2 1 3
#> 3 1 4
#> 4 4 5
#> 5 NA 1
# Pivot the data wider to fit the column format
child_order <- children_by_mom |>
mutate(kid_no = paste0("kd", row_number())) |>
pivot_wider(id_cols = mother, values_from = kids, names_from = kid_no)
child_order
#> # A tibble: 3 × 4
#> # Groups: mother [3]
#> mother kd1 kd2 kd3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4
#> 2 4 5 NA NA
#> 3 NA 1 NA NA
# Join child order with original data
combined_df <- dat |>
left_join(child_order, by = c("id" = "mother"))
combined_df
#> # A tibble: 15 × 6
#> id t md kd1 kd2 kd3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 NA 2 3 4
#> 2 1 2 NA 2 3 4
#> 3 1 3 NA 2 3 4
#> 4 2 1 1 NA NA NA
#> 5 2 2 1 NA NA NA
#> 6 2 3 1 NA NA NA
#> 7 3 1 1 NA NA NA
#> 8 3 2 1 NA NA NA
#> 9 3 3 1 NA NA NA
#> 10 4 1 1 5 NA NA
#> 11 4 2 1 5 NA NA
#> 12 4 3 1 5 NA NA
#> 13 5 1 4 NA NA NA
#> 14 5 2 4 NA NA NA
#> 15 5 3 4 NA NA NA
Created on 2022-09-09 by the reprex package (v2.0.1)