Search code examples
rfor-loopcorpus

R: find a specific string next to another string with for loop


I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.

I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).

node.positions <- grep("blood", novel.vector.words)

output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header

#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match

for (i in 1:length(node.positions)){ # access each match...
  # access the current match
  node <- novel.vector.words[node.positions[i]]
  # access the left context of the current match
  left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
  # access the right context of the current match
  right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
  # concatenate and print the results
  cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}

What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.


Solution

  • You can create an index using dplyr::lead to match 'of' following 'blood':

    library(dplyr)
    
    novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
    
    which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
    
    [1] 1 5
    

    In response to the question in the comments:

    This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.

    Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.

    library(tidytext)
    library(dplyr)
    library(stringr)
    
    ## Read Dracula into dataframe and add explicit line numbers
    fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
      mutate(line = row_number())
    
    ## Pair of words to search for and word distance
    word1 <- "blood"
    word2 <- "of"
    word_distance <- 5
    
    ## Create ngrams using skip_ngrams token
    blood_of <- fulltext %>% 
      unnest_tokens(output = ngram, input = text,  token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
      filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b"))) 
    
    ## Return count
    blood_of %>%
      nrow
    
    [1] 54
    
    ## Inspect first six line number indices
    head(blood_of$line)
    
    [1]  999 1279 1309 2192 3844 4135