I have a 300x300 df. Each row represents the data of a different patient, named with an id. The first 2 columns are the patients id´s. Column 'id_all' contains more participants than needed, and than 'id'. My goal is to keep only the patient information that corresponds with the first column, 'id'.
The first 9x9 looks like this:
id | id_all | MMSE BL | MMSE 12 | MMSE 24 | MMSE 36 | MMSE 48 | MMSE 60 | MMSE 72 |
---|---|---|---|---|---|---|---|---|
aaa002 | aaa000 | 22 | 18 | NA | NA | NA | NA | NA |
aaa003 | aaa002 | 29 | 28 | 28 | 29 | 30 | 29 | NA |
aaa005 | aaa003 | 30 | 29 | 30 | 30 | 30 | NA | NA |
aaa024 | aaa005 | 29 | 28 | 25 | NA | NA | 25 | NA |
aaa026 | aaa024 | 30 | 29 | 29 | 29 | NA | NA | NA |
aaa048 | aaa026 | 28 | 30 | 28 | 27 | 30 | 30 | NA |
aaa095 | aaa038 | 29 | 29 | 29 | 26 | NA | NA | NA |
aaa222 | aaa048 | 30 | 29 | 29 | 28 | 28 | 29 | NA |
So, based on the first column, I would like to somehow iterate through the second column, and delete all the information of the participants who do not have a match with the first column, 'id'. This means, in the first row, deleting columns 2:300, and keeping the first column intact.
At the end, I want my df to look like this:
id | id_all | MMSE BL | MMSE 12 | MMSE 24 | MMSE 36 | MMSE 48 | MMSE 60 | MMSE 72 |
---|---|---|---|---|---|---|---|---|
aaa002 | aaa002 | 29 | 28 | 28 | 29 | 30 | 29 | NA |
aaa003 | aaa003 | 30 | 29 | 30 | 30 | 30 | NA | NA |
aaa005 | aaa005 | 29 | 28 | 25 | NA | NA | 25 | NA |
aaa024 | aaa024 | 30 | 29 | 29 | 29 | NA | NA | NA |
aaa026 | aaa026 | 28 | 30 | 28 | 27 | 30 | 30 | NA |
aaa048 | aaa048 | 30 | 29 | 29 | 28 | 28 | 29 | NA |
aaa095 | ... | |||||||
aaa222 | ... |
In this example, id_all aaa000, and aaa038 dont have a match in the id column, so I would like the delete all the information about those participants. I can´t figure it out how to delete almost all the row (almost because only the first two cells remain), and to move all the cells up.
Another possibility will be, to move the first column down until id matches id_all, and then delete the whole row that has an empty cell in the id column.
Thanks!
I think this will serve the purpose
library(tidyverse)
df %>% filter(!is.na(match(id_all, df$id))) %>% select(-id)
id_all MMSE.BL MMSE.12 MMSE.24 MMSE.36 MMSE.48 MMSE.60 MMSE.72
1 aaa002 29 28 28 29 30 29 NA
2 aaa003 30 29 30 30 30 NA NA
3 aaa005 29 28 25 NA NA 25 NA
4 aaa024 30 29 29 29 NA NA NA
5 aaa026 28 30 28 27 30 30 NA
6 aaa048 30 29 29 28 28 29 NA
Since id == id_all
in the final data, there is no need for a duplicate column. If it is needed, it can be easily created by adding mutate(id = id_all)
at the end of the above syntax
df %>% filter(!is.na(match(id_all, df$id))) %>% select(-id) %>% mutate(id = id_all) %>%
select(id, everything())
id id_all MMSE.BL MMSE.12 MMSE.24 MMSE.36 MMSE.48 MMSE.60 MMSE.72
1 aaa002 aaa002 29 28 28 29 30 29 NA
2 aaa003 aaa003 30 29 30 30 30 NA NA
3 aaa005 aaa005 29 28 25 NA NA 25 NA
4 aaa024 aaa024 30 29 29 29 NA NA NA
5 aaa026 aaa026 28 30 28 27 30 30 NA
6 aaa048 aaa048 30 29 29 28 28 29 NA
dput
of df
used
df <- structure(list(id = c("aaa002", "aaa003", "aaa005", "aaa024",
"aaa026", "aaa048", "aaa095", "aaa222"), id_all = c("aaa000",
"aaa002", "aaa003", "aaa005", "aaa024", "aaa026", "aaa038", "aaa048"
), MMSE.BL = c(22L, 29L, 30L, 29L, 30L, 28L, 29L, 30L), MMSE.12 = c(18L,
28L, 29L, 28L, 29L, 30L, 29L, 29L), MMSE.24 = c(NA, 28L, 30L,
25L, 29L, 28L, 29L, 29L), MMSE.36 = c(NA, 29L, 30L, NA, 29L,
27L, 26L, 28L), MMSE.48 = c(NA, 30L, 30L, NA, NA, 30L, NA, 28L
), MMSE.60 = c(NA, 29L, NA, 25L, NA, 30L, NA, 29L), MMSE.72 = c(NA,
NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-8L))
> df
id id_all MMSE.BL MMSE.12 MMSE.24 MMSE.36 MMSE.48 MMSE.60 MMSE.72
1 aaa002 aaa000 22 18 NA NA NA NA NA
2 aaa003 aaa002 29 28 28 29 30 29 NA
3 aaa005 aaa003 30 29 30 30 30 NA NA
4 aaa024 aaa005 29 28 25 NA NA 25 NA
5 aaa026 aaa024 30 29 29 29 NA NA NA
6 aaa048 aaa026 28 30 28 27 30 30 NA
7 aaa095 aaa038 29 29 29 26 NA NA NA
8 aaa222 aaa048 30 29 29 28 28 29 NA