I'm new to R, so please bear with me. I am looking at incarceration data, and have a variable conviction
, which is a messy string that looks like this:
[1] "Ct. 1: Conspiracy to distribute"
[2] "Aggravated Assault"
[3] "Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture"
[4] "Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling"
Ideally, I want to do two things. First, I want to parse on Ct.
into multiple columns. For the first three rows, the data would look like this:
convictions conviction_1 conviction_2
[1,] "Ct. 1: Conspiracy to distribute" "Conspiracy to distribute" NA
[2,] "Aggravated Assault" "Aggravated Assault" NA
[3,] "Ct. 1: Possession of prohibited object" "Possession of prohibited object" "criminal forfeiture"
but things get hairy when I get to the third row, because I would want to parse the first part of the string(Ct. 1-6: Human Trafficking
) into 6 columns, and then Ct. 7,8: Unlawful contact
into 2 more columns.
The second part is that I then want to generate a variable convictions_total
which would find the highest number in the conviction
string that follows after Ct:
. for the three example entries I included here, convictions_total
would look like:
[1] 1 2 36
This is the code I used to parse a much more straight-forward string variable, but I'm unsure how to tweak it for this variable:
cols <- data.frame(str_split_fixed(data$convictions`,",",Inf))
colnames(cols) <- paste0("conviction_",rep(1:length(cols)))
data <- cbind(data,cols)
Thank you in advance!
After going down a two-day rabbit hole, I figured out a tidy version of @LMc's code, which ended up working better because calling plyr
was messing up other code I had written:
test_data <-
tibble(id = 1:5,
convictions = c("Ct. 1: Conspiracy to distribute" ,
"Aggravated Assault" ,
"Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture" ,
"Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling 50 grams",
"Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28: Money Laundering"))
test_data <- test_data %>%
mutate(c2 = convictions) #this just duplicates the original variable convictions because I want to preserve it
test_data <- test_data %>%
separate_rows(c2, sep = ";") %>%
mutate(c2 = str_remove(c2, "Ct(s)?(\\. )(\\d|-|:|,|\\s)+")) %>%
group_by(id) %>%
mutate(conviction_number = paste0("c_", row_number())) %>%
pivot_wider(values_from = c2, names_from = conviction_number)
test_data <- test_data %>%
mutate(c2 = convictions) #again, just preserving the original variable
test_data <- test_data %>%
separate_rows(c2, sep = ";") %>%
mutate(total_counts = as.numeric(ifelse(is.na(str_extract(c2, "((?<=\\-)\\d+)")), str_extract(c2, "\\d+"), str_extract(c2, "((?<=\\-)\\d+)")))) %>%
mutate(total_counts = ifelse(is.na(total_counts), 1, total_counts)) %>%
group_by(id) %>%
slice_max(total_counts)
which produces the following dataframe:
id convictions c_1 c_2 c_3 c_4 c2 total_counts
<int> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 1 Ct. 1: Conspiracy to distribute Conspiracy to dis~ NA NA NA "Ct. 1: Conspirac~ 1
2 2 Aggravated Assault Aggravated Assault NA NA NA "Aggravated Assau~ 1
3 3 Ct. 1: Possession of prohibited object; Ct. 2: criminal for~ Possession of pro~ " criminal f~ NA NA " Ct. 2: criminal~ 2
4 4 Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct.~ Human Trafficking " Unlawful c~ " Involuntary~ " Smuggling~ " Ct. 36: Smuggli~ 36
5 5 Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28: Money ~ Conspiracy " Wire Fraud" " Money Laund~ NA " Cts. 8-28: Mon~ 28
The first chunk of code parses the counts into separate rows, and then pivots back to the c_
columns. The second code chunk does the same parsing, but then looks across each entry to parse out the digits, instead of the words.
//d+
looks for any digit, but it turns out I had data that looked like Cts. 2-7
where I wanted the value 7, and not 2.
((?<=\\-)\\d+)"))
Looks for the hyphen, and then parses the digits after it. If there is no hyphen, it defaults back to \\d+
.
Finally, slice_max
collapses the data down to 1 entry per ID based on the highest value of total_counts
.