Search code examples
rtidyversetidytext

Simple section labeling with tidytext for plain text input


I'm using tidytext (and the tidyverse) to analyze some text data (as in Tidy Text Mining with R).

My input text file, myfile.txt, looks like this:

# Section 1 Name
Lorem ipsum dolor
sit amet ... (et cetera)
# Section 2 Name
<multiple lines here again>

with 60 or so sections.

I would like to generate a column section_name with the strings "Category 1 Name" or "Category 2 Name" as values for the corresponding lines. For instance, I have

library(tidyverse)
library(tidytext)
library(stringr)

fname <- "myfile.txt"
all_text <- readLines(fname)
all_lines <- tibble(text = all_text)
tidiedtext <- all_lines %>%
  mutate(linenumber = row_number(),
         section_id = cumsum(str_detect(text, regex("^#", ignore_case = TRUE)))) %>%
  filter(!str_detect(text, regex("^#"))) %>%
  ungroup()

which adds a column in tidiedtext for the corresponding section number for each line.

Is it possible to add a single line to the call to mutate() to add such a column? Or is there another approach I ought to be using?


Solution

  • Here's an approach using grepl for simplicity with if_else and tidyr::fill, but there's nothing wrong with the original approach; it's pretty similar to one used in the tidytext book. Also note that filtering after adding line numbers will make some nonexistent. If it matters, add line numbers after filter.

    library(tidyverse)
    
    text <- '# Section 1 Name
    Lorem ipsum dolor
    sit amet ... (et cetera)
    # Section 2 Name
    <multiple lines here again>'
    
    all_lines <- data_frame(text = read_lines(text))
    
    tidied <- all_lines %>% 
        mutate(line = row_number(),
               section = if_else(grepl('^#', text), text, NA_character_)) %>% 
      fill(section) %>% 
      filter(!grepl('^#', text))
    
    tidied
    #> # A tibble: 3 × 3
    #>                          text  line          section
    #>                         <chr> <int>            <chr>
    #> 1           Lorem ipsum dolor     2 # Section 1 Name
    #> 2    sit amet ... (et cetera)     3 # Section 1 Name
    #> 3 <multiple lines here again>     5 # Section 2 Name
    

    Or if you just want to format the numbers you've already got, just add section_name = paste('Category', section_id, 'Name') to your mutate call.