How do I combine some vector elements in the same vector using r?

I extracted table from pdf using pdftools in r. The table in PDF has multi-line texts for the columns. I replaced the spaces with more than 2 spaces with "|" so that it's easier. But the problem I'm running into is that because of the multi-line and the way the table is formatted in the PDF, the data is coming in out of order. The original looks like this

The data that I extracted looks like this:

    scale_definitions <- c("", "                                        to lack passion                        easily annoyed", 
"      Excitable", "                                        to lack a sense of urgency             emotionally volatile", 
"", "                                        naive                                  mistrustful", 
"      Skeptical", "                                        gullible                               cynical", 
"", "                                        overly confident                       too conservative", 
"      Cautious", "                                        to make risky decisions                risk averse", 
"", "                                        to avoid conflict                      aloof and remote", 
"      Reserved", "                                        too sensitive                          indifferent to others' feelings", 
"", "                                        unengaged                              uncooperative", 
"      Leisurely", "                                        self-absorbed                          stubborn", 
"", "                                        unduly modest                          arrogant", 
"      Bold", "                                        self-doubting                          entitled and self-promoting", 
"", "                                        over controlled                        charming and fun", 
"      Mischievous", "                                        inflexible                             careless about commitments", 
"", "                                        repressed                              dramatic", 
"      Colorful", "                                        apathetic                              noisy", 
"", "                                        too tactical                           impractical", 
"      Imaginative", "                                        to lack vision                         eccentric", 
"", "                                        careless about details                 perfectionistic", 
"      Diligent", "                                        easily distracted                      micromanaging", 
"", "                                        possibly insubordinate                 respectful and deferential", 
"      Dutiful", "                                        too independent                        eager to please"
)

scale_definitions <-  scale_definitions %>% str_replace_all("\\s{2,}", "|")

How do I best put this in dataframe?

Solution

Unfortunately a reprex will be to complex so here goes a description of how you can achive a structured df:

I am afraid you have to use pdftools::pdf_data() instead of pdftools::pdf_text().

This way you get a df for each page in a list. In these dfs you get a line for each word on the page and the exact location (plus extensions IRCC). With this at hands you can write a parser to accomplish your task... which will be a bit of work but this is the only way I know to solve this sort of problem.

update:

I found a readr function that helps for your case, since we can assume a fixed lenght (nchar()) for the colum positions:

library(tidyverse)

scale_definitions %>%
    # parse into columns by lenght and there for implicitely start position
    readr::read_fwf(fwf_widths(c(39, 40, 40), c("col1", "col2", "col3"))) %>%
    # build group ID from row number
    dplyr::mutate(grp = (dplyr::row_number() - 1) %/% 3) %>%
    # firm groupings
    dplyr::group_by(grp) %>%
    # impute missing value in col 1
    tidyr::fill(col1, .direction = "downup") %>%
    # remove groupings to prevent unwanted behaviour down stream
    dplyr::ungroup() %>%
    # remove auxiliary variable
    dplyr::select(-grp) %>%
    # convert to long format (saver to remove NAs)
    tidyr::pivot_longer(-col1, names_to = "cols", values_to = "vals") %>%
    # remove NAs
    dplyr::filter(!is.na(vals))

# A tibble: 44 x 3
   col1      cols  vals
   <chr>     <chr> <chr>
 1 Excitable col2  to lack passion
 2 Excitable col3  easily annoyed
 3 Excitable col2  to lack a sense of urgency
 4 Excitable col3  emotionally volatile
 5 Skeptical col2  naive
 6 Skeptical col3  mistrustful
 7 Skeptical col2  gullible
 8 Skeptical col3  cynical
 9 Cautious  col2  overly confident
10 Cautious  col3  too conservative
# ... with 34 more rows