my example data contains a column holding multiple text strings separated by a comma (charges), and then two adjacent columns that contain the charge(s) the person was found guilty or not_guilty of:
dat <- data.frame(decision_id = c("001", "002", "003", "004", "005"),
charges = c("theft", "speeding,theft,robbery", "robbery,drunkedness", NA, "speeding"),
guilty = c(NA, "robbery", "robbery,drunkedness", "theft", NA),
not_guilty = c("theft", "speeding,theft", NA, NA, "speeding"))
decision_id charges guilty not_guilty
1 001 theft <NA> theft
2 002 speeding,theft,robbery robbery speeding,theft
3 003 robbery,drunkedness robbery,drunkedness <NA>
4 004 <NA> theft <NA>
5 005 speeding <NA> speeding
You'll notice it also contains some errors like in decision 004 where the charges have not been added to the charges column (but I think this is unlikely to matter).
I'd like to swap the column names and values so that the values in the 'charges' column become column names, and the existing column names ('guilty' and 'not_guilty') become values. I'd like the following kind of output:
dat2 <- data.frame(decision_id = c("001", "002", "003", "004", "005"),
theft = c("not_guilty", "not_guilty", NA, "guilty", NA),
speeding = c(NA, "not_guilty", NA, NA, "not_guilty"),
robbery = c(NA, "guilty", "guilty", NA, NA),
drunkedness = c(NA, NA, "guilty", NA, NA))
decision_id theft speeding robbery drunkedness
1 001 not_guilty <NA> <NA> <NA>
2 002 not_guilty not_guilty guilty <NA>
3 003 <NA> <NA> guilty guilty
4 004 guilty <NA> <NA> <NA>
5 005 <NA> not_guilty <NA> <NA>
I've found some examples that involved mutating with strsplit(), unnesting, mutating a value, and then pivoting wider, but this only seems to work on a single column, and I can only get it to output 0/1 or TRUE/FALSE values.
EDIT Here is an example of what the whole dataset looks like:
dat <- data.frame(decision_id = c("001", "002", "003", "004", "005"),
date = c("2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05"),
majority_verdict = c("YES", "NO", "YES", "YES", "NO"),
judge = c("A. Smith", "A. Smith", "B . Williams", "C. Roberts", "D. Brown"),
charges = c("theft", "speeding,theft,robbery", "robbery,drunkedness", NA, "speeding"),
guilty = c(NA, "robbery", "robbery,drunkedness", "theft", NA),
not_guilty = c("theft", "speeding,theft", NA, NA, "speeding"),
previous_record = c(TRUE, FALSE, FALSE, TRUE, FALSE))
decision_id date majority_verdict judge charges guilty not_guilty previous_record
1 001 2023-01-01 YES A. Smith theft <NA> theft TRUE
2 002 2023-01-02 NO A. Smith speeding,theft,robbery robbery speeding,theft FALSE
3 003 2023-01-03 YES B . Williams robbery,drunkedness robbery,drunkedness <NA> FALSE
4 004 2023-01-04 YES C. Roberts <NA> theft <NA> TRUE
5 005 2023-01-05 NO D. Brown speeding <NA> speeding FALSE
You may get the data in long format (pivot_longer
), split comma separated values in separate rows (separate_longer_delim
) and get data in wide format (pivot_wider
).
library(dplyr)
library(tidyr)
dat %>%
select(-charges) %>%
pivot_longer(cols = -decision_id, values_drop_na = TRUE) %>%
separate_longer_delim(value, ",") %>%
pivot_wider(names_from = value, values_from = name)
#decision_id theft robbery speeding drunkedness
# <chr> <chr> <chr> <chr> <chr>
#1 001 not_guilty NA NA NA
#2 002 not_guilty guilty not_guilty NA
#3 003 NA guilty NA guilty
#4 004 guilty NA NA NA
#5 005 NA NA not_guilty NA