I extracted table from pdf using pdftools in r. The table in PDF has multi-line texts for the columns. I replaced the spaces with more than 2 spaces with "|" so that it's easier. But the problem I'm running into is that because of the multi-line and the way the table is formatted in the PDF, the data is coming in out of order. The original looks like this
The data that I extracted looks like this:
scale_definitions <- c("", " to lack passion easily annoyed",
" Excitable", " to lack a sense of urgency emotionally volatile",
"", " naive mistrustful",
" Skeptical", " gullible cynical",
"", " overly confident too conservative",
" Cautious", " to make risky decisions risk averse",
"", " to avoid conflict aloof and remote",
" Reserved", " too sensitive indifferent to others' feelings",
"", " unengaged uncooperative",
" Leisurely", " self-absorbed stubborn",
"", " unduly modest arrogant",
" Bold", " self-doubting entitled and self-promoting",
"", " over controlled charming and fun",
" Mischievous", " inflexible careless about commitments",
"", " repressed dramatic",
" Colorful", " apathetic noisy",
"", " too tactical impractical",
" Imaginative", " to lack vision eccentric",
"", " careless about details perfectionistic",
" Diligent", " easily distracted micromanaging",
"", " possibly insubordinate respectful and deferential",
" Dutiful", " too independent eager to please"
)
scale_definitions <- scale_definitions %>% str_replace_all("\\s{2,}", "|")
How do I best put this in dataframe?
Unfortunately a reprex will be to complex so here goes a description of how you can achive a structured df:
I am afraid you have to use pdftools::pdf_data()
instead of pdftools::pdf_text()
.
This way you get a df for each page in a list. In these dfs you get a line for each word on the page and the exact location (plus extensions IRCC). With this at hands you can write a parser to accomplish your task... which will be a bit of work but this is the only way I know to solve this sort of problem.
I found a readr
function that helps for your case, since we can assume a fixed lenght (nchar()
) for the colum positions:
library(tidyverse)
scale_definitions %>%
# parse into columns by lenght and there for implicitely start position
readr::read_fwf(fwf_widths(c(39, 40, 40), c("col1", "col2", "col3"))) %>%
# build group ID from row number
dplyr::mutate(grp = (dplyr::row_number() - 1) %/% 3) %>%
# firm groupings
dplyr::group_by(grp) %>%
# impute missing value in col 1
tidyr::fill(col1, .direction = "downup") %>%
# remove groupings to prevent unwanted behaviour down stream
dplyr::ungroup() %>%
# remove auxiliary variable
dplyr::select(-grp) %>%
# convert to long format (saver to remove NAs)
tidyr::pivot_longer(-col1, names_to = "cols", values_to = "vals") %>%
# remove NAs
dplyr::filter(!is.na(vals))
# A tibble: 44 x 3
col1 cols vals
<chr> <chr> <chr>
1 Excitable col2 to lack passion
2 Excitable col3 easily annoyed
3 Excitable col2 to lack a sense of urgency
4 Excitable col3 emotionally volatile
5 Skeptical col2 naive
6 Skeptical col3 mistrustful
7 Skeptical col2 gullible
8 Skeptical col3 cynical
9 Cautious col2 overly confident
10 Cautious col3 too conservative
# ... with 34 more rows