I am trying to move data from one column to another, due to the underlying forms being filled out incorrectly.
In the form it asks for information on a household and asks for their age(AGE) and gender(SEX) for each member, allowing up to 5 people per household. However some users have filled in information for person 1,3 and 4, but not filled in any info for person 2 because they filled out person 2 incorrectly, crossed out the details and have filled person 2 details into the person 3 boxes etc.
The data looks like this (ref 1 and 5 are correct in this data, all others are incorrect)
df <- data.frame(
ref = c(1, 2, 3, 4, 5, 6),
AGE1 = c(45, 36, 26, 47, 24, NA),
AGE2 = c(NA, 24, NA, 13, 57, 28),
AGE3 = c(NA, NA, 35, NA, NA, 26),
AGE4 = c(NA, NA, 15, 11, NA, NA),
AGE5 = c(NA, 15, NA, NA, NA, NA),
SEX1 = c("M", "F", "M", "M", "M", NA),
SEX2 = c(NA, "M", NA, "F", "F", "F"),
SEX3 = c(NA, NA, "M", NA, NA, "M"),
SEX4 = c(NA, NA, "F", "F", NA, NA),
SEX5 = c(NA, "F", NA, NA, NA, NA)
)
This is what the table looks like currently (I have replaced NA with - to make reading easier)
ref | AGE1 | AGE2 | AGE3 | AGE4 | AGE5 | SEX1 | SEX2 | SEX3 | SEX4 | SEX5 |
---|---|---|---|---|---|---|---|---|---|---|
1 | 45 | - | - | - | - | M | - | - | - | - |
2 | 36 | 24 | - | - | 15 | F | M | - | - | F |
3 | 26 | - | 35 | 15 | - | M | - | M | F | - |
4 | 47 | 13 | - | 11 | - | M | F | - | F | - |
5 | 24 | 57 | - | - | - | M | F | - | - | - |
6 | - | 28 | 26 | - | - | - | F | M | - | - |
but i would like it to look like this
ref | AGE1 | AGE2 | AGE3 | AGE4 | AGE5 | SEX1 | SEX2 | SEX3 | SEX4 | SEX5 |
---|---|---|---|---|---|---|---|---|---|---|
1 | 45 | - | - | - | - | M | - | - | - | - |
2 | 36 | 24 | 15 | - | - | F | M | F | - | - |
3 | 26 | 35 | 15 | - | - | M | M | F | - | - |
4 | 47 | 13 | 11 | - | - | M | F | F | - | - |
5 | 24 | 57 | - | - | - | M | F | - | - | - |
6 | 28 | 26 | - | - | - | F | M | - | - | - |
Is there a way of correcting this using dplyr
? If not, is there another way in R of correcting the data
Here is a way using dplyr
and tidyr
. The approach involves pivoting the data to longer format, sorting the NA
values to the end, renumbering the column names, and the pivoting to wide form again.
library(dplyr)
library(tidyr)
df <- data.frame(ref, AGE1, AGE2, AGE3, AGE4, AGE5,
SEX1, SEX2, SEX3, SEX4, SEX5)
df %>%
mutate(across(starts_with("AGE"), as.character)) %>%
pivot_longer(2:11) %>%
separate(name, into = c("cat", "num"), 3) %>%
arrange(is.na(value)) %>%
group_by(ref, cat) %>%
mutate(num = seq_along(value)) %>%
ungroup() %>%
arrange(cat) %>%
unite(name, cat, num, sep = "") %>%
pivot_wider(id_cols = ref) %>%
mutate(across(starts_with("AGE"), as.numeric))
# A tibble: 6 x 11
ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1 45 NA NA NA NA M NA NA NA NA
2 2 36 24 15 NA NA F M F NA NA
3 3 26 35 15 NA NA M M F NA NA
4 4 47 13 11 NA NA M F F NA NA
5 5 24 57 NA NA NA M F NA NA NA
6 6 28 26 NA NA NA F M NA NA NA