Search code examples
rnlptokensequencen-gram

R how to extract n-grams based rows


I have a dataframe df:

userID Score  Task_Alpha Task_Beta Task_Charlie Task_Delta 
3108   -8.00  Easy       Easy      Easy         Easy
3207    3.00  Hard       Easy      Match        Match
3350    5.78  Hard       Easy      Hard         Hard
3961    10.00 Easy       Easy      Hard         Hard


1. userID is factor variable
2. Score is numeric
3. All the 'Task_' features are factor variables with possible values 'Hard', 'Easy', 'Match'

I want to see a possible association between the transitions (Task_alpha, Task_beta, Task_Charlie, Task_Delta) and Scores.

My hypothesis is that the 2-gram or bi-gramsequence Hard Hard could be associated with higher score. However, the sequence Easy Easy would be related to lower score.

In this example I have only considered 2-gram. In my actual code I want to try longer sequences as well. Just for reference, you can see that the total possible bi-grams we can have would be:

Easy Hard
Hard Easy
Easy Match
Match Easy
Hard Match
Match Hard

Question: As a first step my required overall output is something like:

Task   Task  Score 
Easy   Easy -8.00
Easy   Easy -8.00
Easy   Easy -8.00
Hard   Easy  3.00
Easy  Match  3.00
Match Match  3.00
Hard   Easy  5.78
Easy   Hard  5.78
Hard   Hard  5.78
Easy   Easy  10.00
Easy   Hard  10.00
Hard   Hard  10.00

Solution

  • I have been able to solve this problem as below:

    Step 1: As a first step, I have concatenated the columns:

     df$all = paste(df$Task_Alpha,
                  df$Task_Beta,
                  df$Task_Charlie,
                  df$Task_Delta,
                  sep="-")
    
    userID  Score Task_Alpha Task_Beta Task_Charlie Task_Delta all
    3108   -8.00  Easy       Easy      Easy         Easy       Easy-Easy-Easy-Easy
    3207    3.00  Hard       Easy      Match        Match      Hard-Easy-Match-Match
    3350    5.78  Hard       Easy      Hard         Hard       Hard-Easy-Hard-Hard
    3961    10.00 Easy       Easy      Hard         Hard       Easy-Easy-Hard-Hard
    

    Step 2: As a second step (to have a more generalized solution), I have tried the n-gram based-approach. Where I try to split the strings into any size n-gram I want

    library(tidytext)
    library(dplyr)
    
    df = as_tibble(df)
    df_test = df %>%
       unnest_tokens(bigram, all, token = "ngrams", n = 2)
    

    This gives me the output:

    userID Score Task_*A* Task_*B* Task_*C* Task_*D* all                   bigram
    3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy
    3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy
    3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy
    3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match hard easy
    3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match easy match
    3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match match match
    3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   hard easy
    3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   easy hard
    3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   hard hard
    3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy easy
    3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy hard
    3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   hard hard
    

    Step 3: This solution meets my requirements, even when I want to increase the size of the grams. For example, for 3-gram I can simply achieve this by:

      df = as_tibble(df)
      df_test = df %>%
        unnest_tokens(trigram, all, token = "ngrams", n = 3)
    

    Which will yield:

    userID Score Task_*A* Task_*B* Task_*C* Task_*D* all                   trigram
    3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy easy
    3108  -8.00  Easy     Easy     Easy     Easy     Easy-Easy-Easy-Easy   easy easy easy
    3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match hard easy match
    3207   3.00  Hard     Easy     Match    Match    Hard-Easy-Match-Match easy match match
    3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   hard easy hard
    3350   5.78  Hard     Easy     Hard     Hard     Hard-Easy-Hard-Hard   easy hard hard
    3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy easy hard
    3961   10.00 Easy     Easy     Hard     Hard     Easy-Easy-Hard-Hard   easy hard hard