Search code examples
rdplyrtidyrdata-wrangling

Swap column names with values by splitting multiple strings


my example data contains a column holding multiple text strings separated by a comma (charges), and then two adjacent columns that contain the charge(s) the person was found guilty or not_guilty of:

dat <- data.frame(decision_id = c("001", "002", "003", "004", "005"),
                  charges = c("theft", "speeding,theft,robbery", "robbery,drunkedness", NA, "speeding"),
                  guilty = c(NA, "robbery", "robbery,drunkedness", "theft", NA),
                  not_guilty = c("theft", "speeding,theft", NA, NA, "speeding"))

  decision_id                charges              guilty     not_guilty
1         001                  theft                <NA>          theft
2         002 speeding,theft,robbery             robbery speeding,theft
3         003    robbery,drunkedness robbery,drunkedness           <NA>
4         004                   <NA>               theft           <NA>
5         005               speeding                <NA>       speeding

You'll notice it also contains some errors like in decision 004 where the charges have not been added to the charges column (but I think this is unlikely to matter).

I'd like to swap the column names and values so that the values in the 'charges' column become column names, and the existing column names ('guilty' and 'not_guilty') become values. I'd like the following kind of output:

dat2 <- data.frame(decision_id = c("001", "002", "003", "004", "005"),
                   theft = c("not_guilty", "not_guilty", NA, "guilty", NA),
                   speeding = c(NA, "not_guilty", NA, NA, "not_guilty"),
                   robbery = c(NA, "guilty", "guilty", NA, NA),
                   drunkedness = c(NA, NA, "guilty", NA, NA))

  decision_id      theft   speeding robbery drunkedness
1         001 not_guilty       <NA>    <NA>        <NA>
2         002 not_guilty not_guilty  guilty        <NA>
3         003       <NA>       <NA>  guilty      guilty
4         004     guilty       <NA>    <NA>        <NA>
5         005       <NA> not_guilty    <NA>        <NA>

I've found some examples that involved mutating with strsplit(), unnesting, mutating a value, and then pivoting wider, but this only seems to work on a single column, and I can only get it to output 0/1 or TRUE/FALSE values.

EDIT Here is an example of what the whole dataset looks like:

dat <- data.frame(decision_id = c("001", "002", "003", "004", "005"),
                  date = c("2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05"),
                  majority_verdict = c("YES", "NO", "YES", "YES", "NO"),
                  judge = c("A. Smith", "A. Smith", "B . Williams", "C. Roberts", "D. Brown"),
                  charges = c("theft", "speeding,theft,robbery", "robbery,drunkedness", NA, "speeding"),
                  guilty = c(NA, "robbery", "robbery,drunkedness", "theft", NA),
                  not_guilty = c("theft", "speeding,theft", NA, NA, "speeding"),
                  previous_record = c(TRUE, FALSE, FALSE, TRUE, FALSE))

  decision_id       date majority_verdict        judge                charges              guilty     not_guilty previous_record
1         001 2023-01-01              YES     A. Smith                  theft                <NA>          theft            TRUE
2         002 2023-01-02               NO     A. Smith speeding,theft,robbery             robbery speeding,theft           FALSE
3         003 2023-01-03              YES B . Williams    robbery,drunkedness robbery,drunkedness           <NA>           FALSE
4         004 2023-01-04              YES   C. Roberts                   <NA>               theft           <NA>            TRUE
5         005 2023-01-05               NO     D. Brown               speeding                <NA>       speeding           FALSE

Solution

  • You may get the data in long format (pivot_longer), split comma separated values in separate rows (separate_longer_delim) and get data in wide format (pivot_wider).

    library(dplyr)
    library(tidyr)
    
    dat %>%
      select(-charges) %>%
      pivot_longer(cols = -decision_id, values_drop_na = TRUE) %>%
      separate_longer_delim(value, ",") %>%
      pivot_wider(names_from = value, values_from = name)
    
    #decision_id theft      robbery speeding   drunkedness
    #  <chr>       <chr>      <chr>   <chr>      <chr>      
    #1 001         not_guilty NA      NA         NA         
    #2 002         not_guilty guilty  not_guilty NA         
    #3 003         NA         guilty  NA         guilty     
    #4 004         guilty     NA      NA         NA         
    #5 005         NA         NA      not_guilty NA