Search code examples
rdataframeloopsnlpconditional-statements

Conditional loop for a dataframe in R


I have a dataframe (tibble) containing information about sentences. The dataframe has the following structure:

word position category related_word sentence
a 1 det 2 1
man 2 noun 3 1
sees 3 verb 0 1
a 4 det 5 1
horse 5 noun 3 1
and 6 conj 7 1
a 7 det 8 1
dog 8 noun 3 1

I would like to create a loop that looks at every sentence in the dataframe (the sentence number is in the last column), then if there is a noun in the dataframe (category =="noun"), finds its related word by using the value of related_word in the same row as the noun. The value of related_word corresponds to the position of the related word. The loop would then add both words (the noun and its related word) in a new column (in the format "word" "word").

For the dataframe I provided below, there are three nouns in the first sentence. So the loop would first use the first noun (=="man"), and find its related word by using the value of related_word (==3). Since this value == 3, that related word is "sees". Then the loop would write in the same row as the word "man" the complete pair, i.e. "man sees" in a new column (called "pair").

For the remaining two nouns ("horse" and "dog", the new column would hold the following values: "horse see" and "dog see".

How could I approach this? There are a few problems here but the main one is how to use the value of related_word in order to find the values of a different variables. E.g. how can I get from "man" to "sees"?


Solution

  • You can join the table on itself.. (join on sentence, and on position equaling related_word). Here is a start - perhaps give us more information about what you want the output to look like?

    df %>%
      inner_join(filter(df,category=="noun"), by=c("sentence"="sentence", "position"="related_word")) %>% 
      mutate(newcol = paste(word.y,word.x)) %>% 
      select(sentence, newcol)
    

    Output:

    # A tibble: 3 × 2
      sentence newcol    
         <int> <chr>     
    1        1 man sees  
    2        1 horse sees
    3        1 dog sees  
    

    The output can be something slightly different: wrap the above in a left_join() [ notice that in this iteration I retain position.y in the final select statement, to facilitate the join:

    df %>% left_join(
      df %>%
        inner_join(filter(df,category=="noun"), by=c("sentence"="sentence", "position"="related_word")) %>% 
        mutate(newcol = paste(word.y,word.x)) %>% 
        select(sentence, position.y, newcol),
      by=c("sentence"="sentence", "position" = "position.y")
    )
    

    Output:

    # A tibble: 8 × 6
      word  position category related_word sentence newcol    
      <chr>    <int> <chr>           <int>    <int> <chr>     
    1 a            1 det                 2        1 NA        
    2 man          2 noun                3        1 man sees  
    3 sees         3 verb                0        1 NA        
    4 a            4 det                 5        1 NA        
    5 horse        5 noun                3        1 horse sees
    6 and          6 conj                7        1 NA        
    7 a            7 det                 8        1 NA        
    8 dog          8 noun                3        1 dog sees