Search code examples
rtextsubset

How to subset several paragraphs between two unique phrases from a text in r?


Here is my data:

text <- "text I do not want 

A:paragraph1

paragraph2

B:text I do not want

A:paragraph3

paragraph4

B:text I do not want"

Desired Output: Paragraph1Paragraph2,Paragraph3Paragraph4

I scanned the data (a text file) by line, how can I sub set all paragraphs between "A:" and "B:" in the whole text to get this: Paragraph1Paragraph2,Paragraph3Paragraph4 Thanks a lot!


Solution

  • So, cleaning up the question a little, here are 2 methods you can do to get this

    text <- "text I do not want
    
    A:paragraph1
    
    paragraph2
    
    B:text I do not want
    
    A:paragraph3
    
    paragraph4
    
    B:text I do not want"
    

    Method A Use vector of matches to group. Split your text on newlines so you have a vector of lines of text. Find all matches for the start, and all matches for the end. Then the cumsum of the 2 vectors creates groups that you can use to extract the text.

    Downside here is the risk of A: and B: appearing on the same line, worst would be B: appearing before A:. (See method B to get around this).

    # Method A: No newline substitution, split on newlines and use cumsum to handle groups
    
    # We split so that we can more easily work across lines
    split_text <- strsplit(text, "\n")[[1]]
    
    # Note, this method implicitly assumes that B: starts a line, and that B: is not found before A: on a line
    #  if that is not the case you would want to also split on "B:" and A so strsplit(text, "(\n|B:|A:)"). 
    has_start <- grepl("A:",split_text)
    has_end <- grepl("B:",split_text)
    
    method_a = paste(split_text[cumsum(has_start + has_end) %% 2 == 1], collapse = "\n")
    print(method_a)
    

    Method B Regex matches only work on a single line. So you can't match multiple lines unless you put them on the same line. So replace the newlines that break up your text and then add them back in at the end

    # Method B: newLine substitution to use regex to find what's in between.
    # We need to substitute something for the newline otherwise regex can't work between lines
    # Only downside is slight risk of something in your text actually being called ___newline___
    
    # replace newlines so that everything is on a single line
    newline_substitute = "___newline___"
    newline_substituted_text <- gsub("\n",newline_substitute, text)
    
    # Find all matches
    # Here I want everything between A: and B:
    # if it's possible that A: and B: appear in the paragraph text
    # you may need to be more specific, find A: and B: 
    # where there is a ___newline___ before A, etc...
    match_regex <- "A:(.*?)B:"
    matches <- gregexpr(match_regex, newline_substituted_text)
    
    # Extract matches 
    extracted_text <- unlist(regmatches(newline_substituted_text, matches))
    
    # Concatenate together
    extracted_text <- paste(extracted_text, collapse = "")
    
    # replace starting and ending blocks
    # this removes A: and B: so we just get paragraphs
    extracted_text <- gsub(match_regex, "\\1", extracted_text)
    
    # add newlines back in
    method_b <- gsub(newline_substitute, "\n", extracted_text)
    print(method_b)