I'm using tidytext
(and the tidyverse
) to analyze some text data (as in Tidy Text Mining with R).
My input text file, myfile.txt
, looks like this:
# Section 1 Name
Lorem ipsum dolor
sit amet ... (et cetera)
# Section 2 Name
<multiple lines here again>
with 60 or so sections.
I would like to generate a column section_name
with the strings "Category 1 Name"
or "Category 2 Name"
as values for the corresponding lines. For instance, I have
library(tidyverse)
library(tidytext)
library(stringr)
fname <- "myfile.txt"
all_text <- readLines(fname)
all_lines <- tibble(text = all_text)
tidiedtext <- all_lines %>%
mutate(linenumber = row_number(),
section_id = cumsum(str_detect(text, regex("^#", ignore_case = TRUE)))) %>%
filter(!str_detect(text, regex("^#"))) %>%
ungroup()
which adds a column in tidiedtext
for the corresponding section number for each line.
Is it possible to add a single line to the call to mutate()
to add such a column? Or is there another approach I ought to be using?
Here's an approach using grepl
for simplicity with if_else
and tidyr::fill
, but there's nothing wrong with the original approach; it's pretty similar to one used in the tidytext book. Also note that filtering after adding line numbers will make some nonexistent. If it matters, add line numbers after filter
.
library(tidyverse)
text <- '# Section 1 Name
Lorem ipsum dolor
sit amet ... (et cetera)
# Section 2 Name
<multiple lines here again>'
all_lines <- data_frame(text = read_lines(text))
tidied <- all_lines %>%
mutate(line = row_number(),
section = if_else(grepl('^#', text), text, NA_character_)) %>%
fill(section) %>%
filter(!grepl('^#', text))
tidied
#> # A tibble: 3 × 3
#> text line section
#> <chr> <int> <chr>
#> 1 Lorem ipsum dolor 2 # Section 1 Name
#> 2 sit amet ... (et cetera) 3 # Section 1 Name
#> 3 <multiple lines here again> 5 # Section 2 Name
Or if you just want to format the numbers you've already got, just add section_name = paste('Category', section_id, 'Name')
to your mutate
call.