I am trying to parse text data from a questionnaire that I pulled out of a PDF with {pdftools}
. I end up with a data frame that looks like this aligned text nightmare:
example <- data.frame(
lines = c("Beverages",
"What beverages did you drink?",
" Please check the box next to each beverage that you drank at least once in the past 12 months.",
" Tomato juice or vegetable juice",
" Orange juice or grapefruit juice",
" Grape juice",
" Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)",
" Fruit or vegetable smoothies",
" Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry",
" cocktail)",
" Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and",
" coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)",
" Chocolate milk or hot chocolate",
"Tomato juice or vegetable juice",
" You drank tomato juice or vegetable juice in the past 12 months.",
" Over the past 12 months, how often did you drink tomato juice or vegetable juice?",
" 1 time per month or less",
" 2-3 times per month"
)
)
Each response begins with a box \uf06f
and sometimes the responses are long enough to appear on two lines.
Can anybody offer advice on how to concatenate the text when a response is split over two lines?
You could use
library(dplyr)
library(stringr)
example %>%
group_by(
category = cumsum(str_detect(lines, "^[^\\s]")),
group_1 = cumsum(str_detect(lines, "^\\s{2}(?!\\s)")),
group_3 = cumsum(str_detect(lines, "\uf06f|\uf0a1"))) %>%
mutate(
lines = ifelse(group_3 > 0 & !str_detect(lines, "\uf06f|\uf0a1"), str_trim(lines), lines),
lines = case_when(
group_3 > 0 ~ str_c(lines, collapse = " "),
TRUE ~ lines
)
) %>%
distinct() %>%
ungroup() %>%
select(lines)
to get
# A tibble: 11 x 1
lines
<chr>
1 "Beverages"
2 "What beverages did you drink?"
3 " Please check the box next to each beverage that you drank at least once in the past 12 months."
4 " \uf06f Tomato juice or vegetable juice"
5 " \uf06f Orange juice or grapefruit juice"
6 " \uf06f Grape juice"
7 " \uf06f Other 100% fruit juices or 100% fruit juice mixtures (such as apple, pineapple, or others)"
8 " \uf06f Fruit or vegetable smoothies"
9 " \uf06f Other fruit drinks, regular or diet (such as Hi-C, fruit punch, lemonade, or cranberry cocktail)"
10 " \uf06f Milk as a beverage (NOT in coffee, tea, or cereal; including soy, rice, almond, and coconut milk; NOT including chocolate milk, hot chocolate, and milkshake)"
11 " \uf06f Chocolate milk or hot chocolate"
12 "Tomato juice or vegetable juice"
13 " \uf06f You drank tomato juice or vegetable juice in the past 12 months."
14 "Over the past 12 months, how often did you drink tomato juice or vegetable juice?"
15 " \uf0a1 1 time per month or less"
16 " \uf0a1 2-3 times per month"
What are we trying to do?
^
means "starting with", [^\\s]
means "not a space character".^\\s{2}(?!\\s)
."\uf06f|\uf0a1"
.