Search code examples
rduplicatesuniquevectorization

R - Delete unique rows in "neighborhood"


I have input data in the below format

 stress word
 0      hello
 1      hello
 1      this
 1      is
 1      a
 1      normal
 0      normal
 1      test
 1      hello

I want to get output as

stress  word       stress_pos
 0      hello      2
 1      hello      2
 1      normal     1
 0      normal     1

The dataset is a large list with words indicating the position of a word's stress -- if the k^th row containing a word is a 1 in the first column then the stress is placed on the k^th syllable. Words may appear in multiple places in the list, so I would like to remove non-duplicates in the range of 3 rows (for each row look at the previous and the next line). I'm only dealing with disyllabic words. That is why I'm only looking at the direct neighbors.

I can't use duplicated() or unique() (or I don't know how) because they would process the whole table and not only a small part of it.

The third column indicates what the position of the stress in the word is which can be derived from column one.

Is there any way to not use loops? And what would be a good way to go about this?


Solution

  • First, let's consider how to remove all words that are not duplicated by another word within distance 3 of them. You could determine whether each word matches the word with difference d from it with:

    matches <- function(words, d) {
      words <- as.character(words)
      if (d < 0) {
        words == c(rep("", -d), head(words, d))
      } else {
        words == c(tail(words, -d), rep("", d))
      }
    }
    

    Then you could grab the appropriate rows of your data with:

    (out <- dat[rowSums(sapply(c(-1, 1), function(d) matches(dat$word, d))) > 0,])
    #   stress   word
    # 1      0  hello
    # 2      1  hello
    # 6      1 normal
    # 7      0 normal
    

    All the remains is to determine the syllable that is stressed:

    out$word <- as.character(out$word)
    out$stress_pos <- ave(out$stress, out$word, FUN=function(x) min(which(x == 1)))
    out
    #   stress   word stress_pos
    # 1      0  hello          2
    # 2      1  hello          2
    # 6      1 normal          1
    # 7      0 normal          1