Search code examples
rtext-parsing

Find the closest previous element that matches a certain pattern in R


Given such a vector:

c("node 1",
  "primary",
  "sports, improve",
  "music, improve",
  "painting, improve",
  "surrogate",
  "music",
  "node 2", 
  "primary", 
  "music, improve",
  "painting, improve",
  "node 3", 
  "primary",
  "sports, improve")

I want to get each name under each "primary" and its corresponding node as a single string. For example: for the first node, which is the first element in the vector above ("node 1"), there should be three outputs: "node 1 sports", "node 1 music", "node 1 painting". For "node 2" there should be two : "node 2 music", "node 2 painting". The data is much bigger than the given vector, so indexing and manually generating strings is not preferred. My initial thought is to find each element that contains "improve" with grepl. I can't find a way to assign the elements found with grepl to its corresponding node.


Solution

  • Create a group based on the occurrence of 'node', get the cumsum of logical vector, split the vector 'v1' into a list, paste the first element with the substring of elements that have 'improve' and stack it to a two column data.frame

    stack(lapply(split(v1, cumsum(grepl('node', v1))), 
       function(x) paste(x[1], sub(",.*", "", x[grep('improve', x)]))))[2:1]
    

    -output

    #  ind          values
    #1   1   node 1 sports
    #2   1    node 1 music
    #3   1 node 1 painting
    #4   2    node 2 music
    #5   2 node 2 painting
    #6   3   node 3 sports