Search code examples
rdplyrdata-cleaningcolumnsorting

How to move data from one column to another in R


I am trying to move data from one column to another, due to the underlying forms being filled out incorrectly.

In the form it asks for information on a household and asks for their age(AGE) and gender(SEX) for each member, allowing up to 5 people per household. However some users have filled in information for person 1,3 and 4, but not filled in any info for person 2 because they filled out person 2 incorrectly, crossed out the details and have filled person 2 details into the person 3 boxes etc.

The data looks like this (ref 1 and 5 are correct in this data, all others are incorrect)

df <- data.frame(
  ref = c(1, 2, 3, 4, 5, 6),
  AGE1 = c(45, 36, 26, 47, 24, NA),
  AGE2 = c(NA, 24, NA, 13, 57, 28),
  AGE3 = c(NA, NA, 35, NA, NA, 26),
  AGE4 = c(NA, NA, 15, 11, NA, NA),
  AGE5 = c(NA, 15, NA, NA, NA, NA),
  SEX1 = c("M", "F", "M", "M", "M", NA),
  SEX2 = c(NA, "M", NA, "F", "F", "F"),
  SEX3 = c(NA, NA, "M", NA, NA, "M"),
  SEX4 = c(NA, NA, "F", "F", NA, NA),
  SEX5 = c(NA, "F", NA, NA, NA, NA)
)

This is what the table looks like currently (I have replaced NA with - to make reading easier)

ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
1 45 - - - - M - - - -
2 36 24 - - 15 F M - - F
3 26 - 35 15 - M - M F -
4 47 13 - 11 - M F - F -
5 24 57 - - - M F - - -
6 - 28 26 - - - F M - -

but i would like it to look like this

ref AGE1 AGE2 AGE3 AGE4 AGE5 SEX1 SEX2 SEX3 SEX4 SEX5
1 45 - - - - M - - - -
2 36 24 15 - - F M F - -
3 26 35 15 - - M M F - -
4 47 13 11 - - M F F - -
5 24 57 - - - M F - - -
6 28 26 - - - F M - - -

Is there a way of correcting this using dplyr? If not, is there another way in R of correcting the data


Solution

  • Here is a way using dplyr and tidyr. The approach involves pivoting the data to longer format, sorting the NA values to the end, renumbering the column names, and the pivoting to wide form again.

    library(dplyr)
    library(tidyr)
    
    df <-  data.frame(ref, AGE1, AGE2, AGE3, AGE4, AGE5,
                      SEX1, SEX2, SEX3, SEX4, SEX5)
    df %>% 
      mutate(across(starts_with("AGE"), as.character)) %>% 
      pivot_longer(2:11) %>%
      separate(name, into = c("cat", "num"), 3) %>%
      arrange(is.na(value)) %>%
      group_by(ref, cat) %>%
      mutate(num = seq_along(value)) %>%
      ungroup() %>%
      arrange(cat) %>%
      unite(name, cat, num, sep = "") %>%
      pivot_wider(id_cols = ref) %>%
      mutate(across(starts_with("AGE"), as.numeric))
    
    
    # A tibble: 6 x 11
        ref  AGE1  AGE2  AGE3  AGE4  AGE5 SEX1  SEX2  SEX3  SEX4  SEX5 
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
    1     1    45    NA    NA    NA    NA M     NA    NA    NA    NA   
    2     2    36    24    15    NA    NA F     M     F     NA    NA   
    3     3    26    35    15    NA    NA M     M     F     NA    NA   
    4     4    47    13    11    NA    NA M     F     F     NA    NA   
    5     5    24    57    NA    NA    NA M     F     NA    NA    NA   
    6     6    28    26    NA    NA    NA F     M     NA    NA    NA