Search code examples
rdplyrsubset

subset rows between two rows containing specific values?


I have multiple data frames with the generic layout below. The strings of text vary in length from a few words to multiple sentences. The title strings on each data frame all vary slightly but they all share a word in common (for example, all of the TitleBs on each data frame share the word “code” in common and all TitleCs share the word “write” in common).

|element|NumbID|String   |
|-—————-|-————-|-———————-|
|header |1     |TitleA   |
|para   |2     |TxtStrng |
|header |3     |TitleB   |
|header |4     |Subtit1  |
|para   |5     |TxtStrng |
|header |6     |Subtit2  |
|para   |7     |TxtStrng |
|header |8     |TitleC   |

I am trying to figure out how to write a code that can be used on all the data frames and will allow me to extract all the rows starting at TitleB and just before TitleC, as in the example below.

|element|NumbID|String   |
|:————-:|:———-:| :————--:|
|header |3     |TitleB   |
|header |4     |Subtit1  |
|para   |5     |TxtStrng |
|header |6     |Subtit2  |
|para   |7     |TxtStrng |

I thought maybe I could use subset() in some way to do this but I’m really struggling to figure out how to make it work.


Solution

  • So, you need a way to identify TitleB and TitleC strings. From you're description, I'll use grepl("code", String) for TitleB and grepl("write", String) for TitleC.

    Then we need to identify rows where a TitleB has already occurred but a TitleC hasn't: we can use cumsum for this to generate a cumulative count of occurrences:

    result = subset(
      your_data,
      cumsum(grepl("code", String)) > 0 &
        cumsum(grepl("write", String)) == 0
      )
    

    If you need more help, please make your example more reproducible, preferably using dput() to share a copy/pasteable version of the data in valid R syntax.