I have a dataframe df
:
userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta
3108 -8.00 Easy Easy Easy Easy
3207 3.00 Hard Easy Match Match
3350 5.78 Hard Easy Hard Hard
3961 10.00 Easy Easy Hard Hard
1. userID is factor variable
2. Score is numeric
3. All the 'Task_' features are factor variables with possible values 'Hard', 'Easy', 'Match'
I want to see a possible association between the transitions (Task_alpha
, Task_beta
, Task_Charlie
, Task_Delta
) and Scores.
My hypothesis is that the 2-gram
or bi-gram
sequence Hard Hard
could be associated with higher score. However, the sequence Easy Easy
would be related to lower score.
In this example I have only considered 2-gram
. In my actual code I want to try longer sequences as well. Just for reference, you can see that the total possible bi-grams
we can have would be:
Easy Hard
Hard Easy
Easy Match
Match Easy
Hard Match
Match Hard
Question: As a first step my required overall output is something like:
Task Task Score
Easy Easy -8.00
Easy Easy -8.00
Easy Easy -8.00
Hard Easy 3.00
Easy Match 3.00
Match Match 3.00
Hard Easy 5.78
Easy Hard 5.78
Hard Hard 5.78
Easy Easy 10.00
Easy Hard 10.00
Hard Hard 10.00
I have been able to solve this problem as below:
Step 1: As a first step, I have concatenated the columns:
df$all = paste(df$Task_Alpha,
df$Task_Beta,
df$Task_Charlie,
df$Task_Delta,
sep="-")
userID Score Task_Alpha Task_Beta Task_Charlie Task_Delta all
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard
Step 2:
As a second step (to have a more generalized solution), I have tried the n-gram
based-approach. Where I try to split the strings into any size n-gram
I want
library(tidytext)
library(dplyr)
df = as_tibble(df)
df_test = df %>%
unnest_tokens(bigram, all, token = "ngrams", n = 2)
This gives me the output:
userID Score Task_*A* Task_*B* Task_*C* Task_*D* all bigram
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match hard easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match easy match
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match match match
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard hard easy
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard easy hard
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard hard hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy easy
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard hard hard
Step 3:
This solution meets my requirements, even when I want to increase the size of the grams. For example, for 3-gram
I can simply achieve this by:
df = as_tibble(df)
df_test = df %>%
unnest_tokens(trigram, all, token = "ngrams", n = 3)
Which will yield:
userID Score Task_*A* Task_*B* Task_*C* Task_*D* all trigram
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy easy
3108 -8.00 Easy Easy Easy Easy Easy-Easy-Easy-Easy easy easy easy
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match hard easy match
3207 3.00 Hard Easy Match Match Hard-Easy-Match-Match easy match match
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard hard easy hard
3350 5.78 Hard Easy Hard Hard Hard-Easy-Hard-Hard easy hard hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy easy hard
3961 10.00 Easy Easy Hard Hard Easy-Easy-Hard-Hard easy hard hard