Search code examples
rtext

How do I combine lines of text in R based on column alignment


I am trying to parse text data from a questionnaire that I pulled out of a PDF with {pdftools}. I end up with a data frame that looks like this aligned text nightmare:

example <- data.frame(
  lines = c("Beverages", 
            "What beverages did you drink?", 
            "  Please check the box next to each beverage that you drank at least once in the past 12 months.",
            "         Tomato juice or vegetable juice", 
            "         Orange juice or grapefruit juice", 
            "         Grape juice",
            "         Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)", 
            "         Fruit or vegetable smoothies", 
            "         Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry",
            "            cocktail)", 
            "         Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and",
            "            coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)", 
            "         Chocolate milk or hot chocolate",
            "Tomato juice or vegetable juice",
            "         You drank tomato juice or vegetable juice in the past 12 months.",
            "  Over the past 12 months, how often did you drink tomato juice or vegetable juice?",
            "         1 time per month or less",
            "         2-3 times per month"
            )
)

Each response begins with a box \uf06f and sometimes the responses are long enough to appear on two lines.

Can anybody offer advice on how to concatenate the text when a response is split over two lines?


Solution

  • You could use

    library(dplyr)
    library(stringr)
    
    example %>%
      group_by(
        category = cumsum(str_detect(lines, "^[^\\s]")),
        group_1  = cumsum(str_detect(lines, "^\\s{2}(?!\\s)")),
        group_3  = cumsum(str_detect(lines, "\uf06f|\uf0a1"))) %>% 
      mutate(
        lines = ifelse(group_3 > 0 & !str_detect(lines, "\uf06f|\uf0a1"), str_trim(lines), lines),
        lines = case_when(
          group_3 > 0 ~ str_c(lines, collapse = " "),
          TRUE ~ lines
          )
        ) %>% 
      distinct() %>% 
      ungroup() %>% 
      select(lines)
    

    to get

    # A tibble: 11 x 1
       lines                                                                                                    
       <chr>                                                                                                    
     1 "Beverages"                                                                                              
     2 "What beverages did you drink?"                                                                          
     3 "  Please check the box next to each beverage that you drank at least once in the past 12 months."       
     4 "        \uf06f Tomato juice or vegetable juice"                                                         
     5 "        \uf06f Orange juice or grapefruit juice"                                                        
     6 "        \uf06f Grape juice"                                                                             
     7 "        \uf06f Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)"
     8 "        \uf06f Fruit or vegetable smoothies"                                                            
     9 "        \uf06f Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry cocktail)"
    10 "        \uf06f Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)"
    11 "        \uf06f Chocolate milk or hot chocolate"                                                         
    12 "Tomato juice or vegetable juice"                                                                        
    13 "        \uf06f You drank tomato juice or vegetable juice in the past 12 months."                        
    14 "Over the past 12 months, how often did you drink tomato juice or vegetable juice?"                      
    15 "        \uf0a1 1 time per month or less"                                                                
    16 "        \uf0a1 2-3 times per month" 
    

    What are we trying to do?

    1. First we try to build a "category". These are rows that are not starting with a space character, so we are looking for "^[^\s]". ^ means "starting with", [^\\s] means "not a space character".
    2. The next grouping level are rows starting with exactly two space characters and are not followed by another space, thus ^\\s{2}(?!\\s).
    3. Last grouping level are rows containing those UTF characters "\uf06f|\uf0a1".