Search code examples
rdplyrdata-wrangling

Add values across dataframe columns


I have a dataframe where missingness in indicated by "Z" (there may also be some "z" and NA entries present in the data), and values are entered as characters ("0", "1", etc). I need to create scores ("updrs1", "updrs2", updrs3") that add up the non-missing values across columns selected by colname prefix ("NP1", "NP2", "NP3").

dummy data:

dummy_df <- data.frame(
  subject_id = seq(1,6,1), 
  OTHERV1 = c(1,1,0,0,1,1),
  NP1VAR1 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP1VAR3 = c("Z","3","Z","Z","Z","Z"),
  NP2VAR1 = c("Z","2","Z","Z","Z","Z"), 
  NP2VAR2 = c("Z","0","Z","Z","Z","Z"),
  NP2VAR3 = c("Z","0","Z","Z","Z","Z"),
  NP3VAR1 = c("Z","4","Z","Z","Z","Z"),
  NP3VAR2 = c("Z","0","Z","Z","z","Z"),
  NP3VAR3 = c("Z","0","Z","Z","Z",NA),
  OTHERV2 = c(NA,NA,NA,NA,NA,NA)
)

desired output:

subject_id updrs1 updrs2 updrs3
1 1 Z Z Z
2 2 3 2 4
3 3 Z Z Z
4 4 Z Z Z
5 5 Z Z Z
6 6 Z Z Z

NOTE: all values in the output are characters

NOTE: treating NA/Z/z as 0 (i.e., transforming all values with as.numeric()) is problematic.

I've tried variations on this answer, with no luck.

I have used a combination of tidyr::pivot_longer(), dplyr::group_by(), and summarize():

desired_output <- select(dummy_df, c(subject_id, starts_with("NP"))) %>%
  mutate(across(all_of(everything()), ~ifelse(. %in% c("Z", "z"), NA, .))) %>%
  pivot_longer(cols = starts_with("NP"), names_to = c(".value", "np_var"), names_sep = "VAR") %>%
  group_by(subject_id) %>%
  summarize(updrs1 = sum(as.numeric(NP1), na.rm = FALSE),
            updrs2 = sum(as.numeric(NP2), na.rm = FALSE), 
            updrs3 = sum(as.numeric(NP3), na.rm = FALSE), .groups = "drop") %>%
  mutate(across(all_of(everything()), as.character)) %>%
  replace(is.na(.), "Z")

This works. L Tyrone's answer also works, so I've accepted it. Thanks all.


Solution

  • If you don't care whether:

    • the missingness values are upper or lower case
    • Z/z are the only missingness values
    • existing NA values can be represented as Z

    then this works. There is likely to be a more direct way to do this, but I find a step-wise approach easier to follow. Note that it will return a warning, which you can ignore:

    library(dplyr)
    library(tidyr)
    
    dummy_df |>
      pivot_longer(
        cols = starts_with("NP"),
        names_to = c("grp", "var"),
        names_pattern = "^(NP\\d+)(VAR\\d+)$",
        values_to = "value") |>
      mutate(value = if_else(grepl("^[0-9]+$", value), as.integer(value), NA)) |>
      summarise(value = sum(value), .by = c(subject_id, grp)) |>
      pivot_wider(id_cols = subject_id,
                  names_from = grp,
                  values_from = value) |>
      rename(updrs1 = NP1, updrs2 = NP2, updrs3 = NP3) |>
      mutate(across(where(is.numeric), as.character),
             across(everything(), ~replace_na(.x, "Z")))
        
    # # A tibble: 6 × 4
    #   subject_id updrs1 updrs2 updrs3
    #   <chr>      <chr>  <chr>  <chr> 
    # 1 1          Z      Z      Z     
    # 2 2          3      2      4     
    # 3 3          Z      Z      Z     
    # 4 4          Z      Z      Z     
    # 5 5          Z      Z      Z     
    # 6 6          Z      Z      Z