I am new to R and have a very large irregular column in a data frame like this:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation
I need to concatenate this column to look like this:
section
BOOK I: Introduction
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation
Basically the goal is to extract the value of the upper string based in a condition and then concatenate with the lower actualizing the value with a regex expression, but I really don't know how to do it.
Thanks in advance.
Here is one method:
x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))
x <- dplyr::mutate(x,
isSection = stringr::str_starts(section, "Page", negate = TRUE),
sectionNum = cumsum(isSection)
) |>
dplyr::group_by(sectionNum) |>
dplyr::mutate(newSection = dplyr::if_else(
condition = isSection,
true = section,
false = paste(dplyr::first(section), section, sep = " / ")
)) |>
ungroup()
x
#> # A tibble: 9 × 4
#> section isSection sectionNum newSection
#> <chr> <lgl> <int> <chr>
#> 1 BOOK I: Introduction TRUE 1 BOOK I: Introduction
#> 2 Page one: presentation FALSE 1 BOOK I: Introduction / Page…
#> 3 Page two: acknowledgments FALSE 1 BOOK I: Introduction / Page…
#> 4 MAGAZINE II: Considerations TRUE 2 MAGAZINE II: Considerations
#> 5 Page one: characters FALSE 2 MAGAZINE II: Considerations…
#> 6 Page two: index FALSE 2 MAGAZINE II: Considerations…
#> 7 BOOK III: General Principles TRUE 3 BOOK III: General Principles
#> 8 BOOK III: General Principles TRUE 4 BOOK III: General Principles
#> 9 Page one: invitation FALSE 4 BOOK III: General Principle…
Created on 2022-03-25 by the reprex package (v2.0.1)
Here, we first determine if the section
is a section title or a page title and save that as TRUE
or FALSE
.
Then, we label the pages belonging to a section by using cumsum()
(cumulative sum). When we add up TRUE
and FALSE
values, TRUE
(here, sections) become 1
and increment the cumulative sum, but FALSE
(here, pages) become 0
and don't increment the cumulative sum, so all of the pages within a specific section receive the same value.
Lastly, we make a new section variable, this time using group_by()
and if_else()
to conditionally set the value. If isSection
is TRUE
, we just keep the existing value of section
(the section title). If isSection
is FALSE
, we concatenate the first value of section
from the group with the existing value of section
, separated by " / "
.