Search code examples
rpdftools

How do I combine some vector elements in the same vector using r?


I extracted table from pdf using pdftools in r. The table in PDF has multi-line texts for the columns. I replaced the spaces with more than 2 spaces with "|" so that it's easier. But the problem I'm running into is that because of the multi-line and the way the table is formatted in the PDF, the data is coming in out of order. The original looks like this

enter image description here

The data that I extracted looks like this:

    scale_definitions <- c("", "                                        to lack passion                        easily annoyed", 
"      Excitable", "                                        to lack a sense of urgency             emotionally volatile", 
"", "                                        naive                                  mistrustful", 
"      Skeptical", "                                        gullible                               cynical", 
"", "                                        overly confident                       too conservative", 
"      Cautious", "                                        to make risky decisions                risk averse", 
"", "                                        to avoid conflict                      aloof and remote", 
"      Reserved", "                                        too sensitive                          indifferent to others' feelings", 
"", "                                        unengaged                              uncooperative", 
"      Leisurely", "                                        self-absorbed                          stubborn", 
"", "                                        unduly modest                          arrogant", 
"      Bold", "                                        self-doubting                          entitled and self-promoting", 
"", "                                        over controlled                        charming and fun", 
"      Mischievous", "                                        inflexible                             careless about commitments", 
"", "                                        repressed                              dramatic", 
"      Colorful", "                                        apathetic                              noisy", 
"", "                                        too tactical                           impractical", 
"      Imaginative", "                                        to lack vision                         eccentric", 
"", "                                        careless about details                 perfectionistic", 
"      Diligent", "                                        easily distracted                      micromanaging", 
"", "                                        possibly insubordinate                 respectful and deferential", 
"      Dutiful", "                                        too independent                        eager to please"
)

scale_definitions <-  scale_definitions %>% str_replace_all("\\s{2,}", "|")

How do I best put this in dataframe?


Solution

  • Unfortunately a reprex will be to complex so here goes a description of how you can achive a structured df:

    I am afraid you have to use pdftools::pdf_data() instead of pdftools::pdf_text().

    This way you get a df for each page in a list. In these dfs you get a line for each word on the page and the exact location (plus extensions IRCC). With this at hands you can write a parser to accomplish your task... which will be a bit of work but this is the only way I know to solve this sort of problem.

    update:

    I found a readr function that helps for your case, since we can assume a fixed lenght (nchar()) for the colum positions:

    library(tidyverse)
    
    scale_definitions %>%
        # parse into columns by lenght and there for implicitely start position
        readr::read_fwf(fwf_widths(c(39, 40, 40), c("col1", "col2", "col3"))) %>%
        # build group ID from row number
        dplyr::mutate(grp = (dplyr::row_number() - 1) %/% 3) %>%
        # firm groupings
        dplyr::group_by(grp) %>%
        # impute missing value in col 1
        tidyr::fill(col1, .direction = "downup") %>%
        # remove groupings to prevent unwanted behaviour down stream
        dplyr::ungroup() %>%
        # remove auxiliary variable
        dplyr::select(-grp) %>%
        # convert to long format (saver to remove NAs)
        tidyr::pivot_longer(-col1, names_to = "cols", values_to = "vals") %>%
        # remove NAs
        dplyr::filter(!is.na(vals))
    
    # A tibble: 44 x 3
       col1      cols  vals
       <chr>     <chr> <chr>
     1 Excitable col2  to lack passion
     2 Excitable col3  easily annoyed
     3 Excitable col2  to lack a sense of urgency
     4 Excitable col3  emotionally volatile
     5 Skeptical col2  naive
     6 Skeptical col3  mistrustful
     7 Skeptical col2  gullible
     8 Skeptical col3  cynical
     9 Cautious  col2  overly confident
    10 Cautious  col3  too conservative
    # ... with 34 more rows