Search code examples
rregexstrsplittibble

Spreading character string into multiple columns with strsplit pattern matching


This is my first time scraping text from a PDF document. I'm presenting what I think is the most useful format for the data for what I'm doing, but I could be wrong. Once I cleaned the PDF text, I formatted it into a tibble (below).

I tried utilizing strsplit(dmt, \\s+) to split the character string into three separate columns but that just separated everything completely. I had used str_squish() to eliminate the spaces in the middle text portion of the string but that did not help the pattern matching.

The first numeric part of the character string sometimes ends with a ) or a number. Here is what I'm working with:

dmt
# A tibble: 612 x 1
   datamatrixtest[,1]                                             
   <chr>                                                          
 1 110.05          Human Service Vehicle Inspection Reqd         6
 2 23.33(12)(b)    ATV-Fail/Stop for Law Enforce. Official       1
 3 23.33(6)(a)     ATV-Fail/Display Lighted Headlamp             1
 4 341.03          Oper Veh After Sus/Rev or Can of Reg      8,862
 5 341.04(1)       Non-Registration of Vehicle              10,125
 6 341.04(2)       Improper Registration of Vehicle              4
 7 341.15(1)       Fail/Display Vehicle License Plates       2,010
 8 341.15(1m)(a)   Fail/Attach Rear Regis. Decal/Tag             3
 9 341.15(1m)(b)   Fail/Attach Front Regis. Decal/Tag            2
10 341.15(2)       Improperly Attached License Plates            7
# ... with 602 more rows

Ideally, I could utilize strsplit with an accurate pattern match to put the data into three separate columns.

dmt
# A tibble: 612 x 3
   statute         offense                                    cases
   <chr>           <chr>                                       <num>        
 1 110.05          Human Service Vehicle Inspection Reqd         6
 2 23.33(12)(b)    ATV-Fail/Stop for Law Enforce. Official       1
 3 23.33(6)(a)     ATV-Fail/Display Lighted Headlamp             1
 4 341.03          Oper Veh After Sus/Rev or Can of Reg      8,862
 5 341.04(1)       Non-Registration of Vehicle              10,125
 6 341.04(2)       Improper Registration of Vehicle              4
 7 341.15(1)       Fail/Display Vehicle License Plates       2,010
 8 341.15(1m)(a)   Fail/Attach Rear Regis. Decal/Tag             3
 9 341.15(1m)(b)   Fail/Attach Front Regis. Decal/Tag            2
10 341.15(2)       Improperly Attached License Plates            7

Solution

  • I am assuming that your data is essentially as presented, with multiple spaces between columns. In other words, check that your dmt is comparable to the one I've created below. In that case, we can just split each line on any section of more than one space with \\s{2,} like this. If your data is not like this, or if any individual field happens to contain multiple spaces, then use dput and head to provide a sample so we can find a more precise pattern that will work.

    library(tidyverse)
    dmt <- read_lines(
    "110.05          Human Service Vehicle Inspection Reqd         6
    23.33(12)(b)    ATV-Fail/Stop for Law Enforce. Official       1
    23.33(6)(a)     ATV-Fail/Display Lighted Headlamp             1
    341.03          Oper Veh After Sus/Rev or Can of Reg      8,862
    341.04(1)       Non-Registration of Vehicle              10,125
    341.04(2)       Improper Registration of Vehicle              4
    341.15(1)       Fail/Display Vehicle License Plates       2,010
    341.15(1m)(a)   Fail/Attach Rear Regis. Decal/Tag             3
    341.15(1m)(b)   Fail/Attach Front Regis. Decal/Tag            2
    1341.15(2)       Improperly Attached License Plates            7"
    ) %>%
      enframe(name = NULL, value = "line")
    
    dmt %>%
      separate(line, c("statute", "offense", "cases"), sep = "\\s{2,}") %>%
      mutate(cases = cases %>% str_remove_all(",") %>% as.integer)
    #> # A tibble: 10 x 3
    #>    statute       offense                                 cases
    #>    <chr>         <chr>                                   <int>
    #>  1 110.05        Human Service Vehicle Inspection Reqd       6
    #>  2 23.33(12)(b)  ATV-Fail/Stop for Law Enforce. Official     1
    #>  3 23.33(6)(a)   ATV-Fail/Display Lighted Headlamp           1
    #>  4 341.03        Oper Veh After Sus/Rev or Can of Reg     8862
    #>  5 341.04(1)     Non-Registration of Vehicle             10125
    #>  6 341.04(2)     Improper Registration of Vehicle            4
    #>  7 341.15(1)     Fail/Display Vehicle License Plates      2010
    #>  8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag           3
    #>  9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag          2
    #> 10 1341.15(2)    Improperly Attached License Plates          7
    

    Created on 2019-09-23 by the reprex package (v0.3.0)