Search code examples
rstring-parsing

How can I parse a variable into multiple columns according to multiple conditions in R?


I'm new to R, so please bear with me. I am looking at incarceration data, and have a variable conviction, which is a messy string that looks like this:

[1] "Ct. 1: Conspiracy to distribute"                                                                         
[2] "Aggravated Assault"                                                                                      
[3] "Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture"                                      
[4] "Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling"

Ideally, I want to do two things. First, I want to parse on Ct. into multiple columns. For the first three rows, the data would look like this:

     convictions                              conviction_1                      conviction_2                    
[1,] "Ct. 1: Conspiracy to distribute"        "Conspiracy to distribute"        NA                   
[2,] "Aggravated Assault"                     "Aggravated Assault"              NA                   
[3,] "Ct. 1: Possession of prohibited object" "Possession of prohibited object" "criminal forfeiture"

but things get hairy when I get to the third row, because I would want to parse the first part of the string(Ct. 1-6: Human Trafficking) into 6 columns, and then Ct. 7,8: Unlawful contact into 2 more columns.

The second part is that I then want to generate a variable convictions_total which would find the highest number in the conviction string that follows after Ct:. for the three example entries I included here, convictions_total would look like:

[1]  1  2 36

This is the code I used to parse a much more straight-forward string variable, but I'm unsure how to tweak it for this variable:

cols <- data.frame(str_split_fixed(data$convictions`,",",Inf))
colnames(cols) <- paste0("conviction_",rep(1:length(cols)))
data <- cbind(data,cols)

Thank you in advance!


Solution

  • After going down a two-day rabbit hole, I figured out a tidy version of @LMc's code, which ended up working better because calling plyr was messing up other code I had written:

    test_data <- 
      tibble(id = 1:5, 
             convictions = c("Ct. 1: Conspiracy to distribute"    ,                                                                     
                             "Aggravated Assault"              ,                                                                        
                             "Ct. 1: Possession of prohibited object; Ct. 2: criminal forfeiture"  ,                                    
                             "Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct. 11: Involuntary Servitude; Ct. 36: Smuggling 50 grams",
                             "Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28:  Money Laundering"))
    test_data <- test_data %>% 
      mutate(c2 = convictions) #this just duplicates the original variable convictions because I want to preserve it
    
    test_data <- test_data %>%
      separate_rows(c2, sep = ";") %>%
      mutate(c2 = str_remove(c2, "Ct(s)?(\\. )(\\d|-|:|,|\\s)+")) %>%
      group_by(id) %>%
      mutate(conviction_number = paste0("c_", row_number())) %>%
      pivot_wider(values_from = c2, names_from = conviction_number) 
    
    
    test_data <- test_data %>% 
      mutate(c2 = convictions) #again, just preserving the original variable
    
    test_data <- test_data %>%
      separate_rows(c2, sep = ";") %>% 
      mutate(total_counts = as.numeric(ifelse(is.na(str_extract(c2, "((?<=\\-)\\d+)")), str_extract(c2, "\\d+"), str_extract(c2, "((?<=\\-)\\d+)")))) %>% 
      mutate(total_counts = ifelse(is.na(total_counts), 1, total_counts)) %>% 
      group_by(id) %>% 
      slice_max(total_counts) 
    

    which produces the following dataframe:

         id convictions                                                  c_1                c_2           c_3            c_4          c2                 total_counts
      <int> <chr>                                                        <chr>              <chr>         <chr>          <chr>        <chr>                     <dbl>
    1     1 Ct. 1: Conspiracy to distribute                              Conspiracy to dis~  NA            NA             NA          "Ct. 1: Conspirac~            1
    2     2 Aggravated Assault                                           Aggravated Assault  NA            NA             NA          "Aggravated Assau~            1
    3     3 Ct. 1: Possession of prohibited object; Ct. 2: criminal for~ Possession of pro~ " criminal f~  NA             NA          " Ct. 2: criminal~            2
    4     4 Ct. 1-6: Human Trafficking; Cts. 7, 8 Unlawful contact; Ct.~ Human Trafficking  " Unlawful c~ " Involuntary~ " Smuggling~ " Ct. 36: Smuggli~           36
    5     5 Ct. 1: Conspiracy; Cts. 2-7: Wire Fraud; Cts. 8-28:  Money ~ Conspiracy         " Wire Fraud" " Money Laund~  NA          " Cts. 8-28:  Mon~           28
    

    The first chunk of code parses the counts into separate rows, and then pivots back to the c_ columns. The second code chunk does the same parsing, but then looks across each entry to parse out the digits, instead of the words.

    //d+ looks for any digit, but it turns out I had data that looked like Cts. 2-7 where I wanted the value 7, and not 2.

    ((?<=\\-)\\d+)")) Looks for the hyphen, and then parses the digits after it. If there is no hyphen, it defaults back to \\d+.

    Finally, slice_max collapses the data down to 1 entry per ID based on the highest value of total_counts.