This is my first time scraping text from a PDF document. I'm presenting what I think is the most useful format for the data for what I'm doing, but I could be wrong. Once I cleaned the PDF text, I formatted it into a tibble
(below).
I tried utilizing strsplit(dmt, \\s+)
to split the character string into three separate columns but that just separated everything completely. I had used str_squish()
to eliminate the spaces in the middle text portion of the string but that did not help the pattern matching.
The first numeric part of the character string sometimes ends with a )
or a number
. Here is what I'm working with:
dmt
# A tibble: 612 x 1
datamatrixtest[,1]
<chr>
1 110.05 Human Service Vehicle Inspection Reqd 6
2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
5 341.04(1) Non-Registration of Vehicle 10,125
6 341.04(2) Improper Registration of Vehicle 4
7 341.15(1) Fail/Display Vehicle License Plates 2,010
8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
10 341.15(2) Improperly Attached License Plates 7
# ... with 602 more rows
Ideally, I could utilize strsplit
with an accurate pattern match to put the data into three separate columns.
dmt
# A tibble: 612 x 3
statute offense cases
<chr> <chr> <num>
1 110.05 Human Service Vehicle Inspection Reqd 6
2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
4 341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
5 341.04(1) Non-Registration of Vehicle 10,125
6 341.04(2) Improper Registration of Vehicle 4
7 341.15(1) Fail/Display Vehicle License Plates 2,010
8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
10 341.15(2) Improperly Attached License Plates 7
I am assuming that your data is essentially as presented, with multiple spaces between columns. In other words, check that your dmt
is comparable to the one I've created below. In that case, we can just split each line on any section of more than one space with \\s{2,}
like this. If your data is not like this, or if any individual field happens to contain multiple spaces, then use dput
and head
to provide a sample so we can find a more precise pattern that will work.
library(tidyverse)
dmt <- read_lines(
"110.05 Human Service Vehicle Inspection Reqd 6
23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
341.03 Oper Veh After Sus/Rev or Can of Reg 8,862
341.04(1) Non-Registration of Vehicle 10,125
341.04(2) Improper Registration of Vehicle 4
341.15(1) Fail/Display Vehicle License Plates 2,010
341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
1341.15(2) Improperly Attached License Plates 7"
) %>%
enframe(name = NULL, value = "line")
dmt %>%
separate(line, c("statute", "offense", "cases"), sep = "\\s{2,}") %>%
mutate(cases = cases %>% str_remove_all(",") %>% as.integer)
#> # A tibble: 10 x 3
#> statute offense cases
#> <chr> <chr> <int>
#> 1 110.05 Human Service Vehicle Inspection Reqd 6
#> 2 23.33(12)(b) ATV-Fail/Stop for Law Enforce. Official 1
#> 3 23.33(6)(a) ATV-Fail/Display Lighted Headlamp 1
#> 4 341.03 Oper Veh After Sus/Rev or Can of Reg 8862
#> 5 341.04(1) Non-Registration of Vehicle 10125
#> 6 341.04(2) Improper Registration of Vehicle 4
#> 7 341.15(1) Fail/Display Vehicle License Plates 2010
#> 8 341.15(1m)(a) Fail/Attach Rear Regis. Decal/Tag 3
#> 9 341.15(1m)(b) Fail/Attach Front Regis. Decal/Tag 2
#> 10 1341.15(2) Improperly Attached License Plates 7
Created on 2019-09-23 by the reprex package (v0.3.0)