I have an R data frame like this one (but data wouldn't be sorted by any column):
ppl <- structure(list(id = c("I0000", "I0001", "I0002", "I0003", "I0004","I0005", "I0006", "I0007", "I0008", "I0009"), Birth_Date = structure(c(NA, 517, -10246, -8723, 2349, -25125, NA, -12141, 2349, NA), class = "Date"), Father_id = c(NA, "I0002", "I0005", "I0037", "I0002", "I0018", "I0056", "I0005", "I0002", "I0005"), Mother_id = c(NA, "I0003", "I0006", "I0038", "I0003", "I0019", "I0057", "I0006", "I0003", "I0006"), marriage = structure(c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, -12119, -12119, NA_real_, NA_real_, NA_real_), class = "Date")), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
> ppl
# A tibble: 10 x 5
id Birth_Date Father_id Mother_id marriage
<chr> <date> <chr> <chr> <date>
1 I0000 NA NA NA NA
2 I0001 1971-06-02 I0002 I0003 NA
3 I0002 1941-12-13 I0005 I0006 NA
4 I0003 1946-02-13 I0037 I0038 NA
5 I0004 1976-06-07 I0002 I0003 NA
6 I0005 1901-03-19 I0018 I0019 1936-10-27
7 I0006 NA I0056 I0057 1936-10-27
8 I0007 1936-10-05 I0005 I0006 NA
9 I0008 1976-06-07 I0002 I0003 NA
10 I0009 NA I0005 I0006 NA
Children and parents relationships are stablished through their different IDs.
For each individual (id) without a marriage date value, I want to estimate a date value for that column, based on the Birth_date of his/her first child (of course this is just an assumption, since for some people Birth_Date is not available).
So, in this example, some individues which would get a marriage date would be I0002 and I0003 (calculated marriage would be "1971-06-02" in rows 3 and 4, because it is the minimum Birth_Date of the 3 people which have Father_id=='I0002' and Mother_id=='I0003' -rows 2, 5 and 9-).
The same way, individues I0005 and I0006 would get marriage date "1936-10-05", which is the minimum known Birth_Date of their children (I0002, I0007 and I0009 -which has NA as Birth_Date-). But in this case, all children Birth_Date values should not be taken in account because the data frame has already a real marriage_date value for these individues ("1936-10-27").
As you can see, dataframe structure has not to be changed (same number of rows and same columns; but the last one gets some NA updated with a Date value).
Expected result:
> ppl
# A tibble: 10 x 5
id Birth_Date Father_id Mother_id marriage
<chr> <date> <chr> <chr> <date>
1 I0000 NA NA NA NA
2 I0001 1971-06-02 I0002 I0003 NA
3 I0002 1941-12-13 I0005 I0006 1971-06-02
4 I0003 1946-02-13 I0037 I0038 1971-06-02
5 I0004 1976-06-07 I0002 I0003 NA
6 I0005 1901-03-19 I0018 I0019 1936-10-27
7 I0006 NA I0056 I0057 1936-10-27
8 I0007 1936-10-05 I0005 I0006 NA
9 I0008 1976-06-07 I0002 I0003 NA
10 I0009 NA I0005 I0006 NA
Is it possible to accomplish this task avoiding a function to iterate the data frame?
I know there are libraries dealing with joins, like those mentioned here. But I still can't figure out how to use them to do this task.
I was thinking to calulate it row by row (one marriage date per iteration), but I guess there must be some fasters ways to do it. Please, elaborate a little bit your answer because I am a complete R-newbie. It's not just a matter of making it work, but of understanding how it works.
We can select a row with minimum value of Birth_Date
for each father and mother and join with the dataframe itself.
library(dplyr)
ppl %>%
#Keep only NA values
filter(is.na(marriage)) %>%
#For each father and mother
group_by(Father_id, Mother_id) %>%
#Select the minimum date
slice(which.min(Birth_Date)) %>%
#Get father and mother in same column
tidyr::pivot_longer(cols = c(Father_id, Mother_id)) %>%
#rename Birth_Date to marriage and select it with value
select(marriage = Birth_Date, value) %>%
#Join with the dataframe itself
right_join(ppl, by = c('value' = 'id')) %>%
#If marriage data is already present select that
mutate(marriage_date = coalesce(marriage.y, marriage.x)) %>%
#select only columns needed.
select(id = value, Birth_Date, Father_id, Mother_id, marriage_date)
id Birth_Date Father_id Mother_id marriage_date
<chr> <date> <chr> <chr> <date>
1 I0000 NA NA NA NA
2 I0001 1971-06-02 I0002 I0003 NA
3 I0002 1941-12-13 I0005 I0006 1971-06-02
4 I0003 1946-02-13 I0037 I0038 1971-06-02
5 I0004 1976-06-07 I0002 I0003 NA
6 I0005 1901-03-19 I0018 I0019 1936-10-27
7 I0006 NA I0056 I0057 1936-10-27
8 I0007 1936-10-05 I0005 I0006 NA
9 I0008 1976-06-07 I0002 I0003 NA
10 I0009 NA I0005 I0006 NA