I am trying to apply an n-gram character model on a string to compute its probability in this model.
I created a character bigram model with stringdist::qgram()
:
library(tidyverse)
library(stringdist)
ref_corpus <- c("This is a sample sentence", "Other sentences from the reference corpus", "Many other ones")
bigram_ref <- qgrams(ref_corpus, q = 2) # collecting all bigrams
bigram_model <- log(bigram_ref/sum(bigram_ref)) # computing the log probabilities of each
bigram_model
# Th hi is s sa se te th
# V1 -4.356709 -4.356709 -3.663562 -3.258097 -4.356709 -3.663562 -3.663562 -3.258097
Now, I want to use this model to compute the probability of a new string within the model:
bigram_string <- qgrams("This one", q = 2)
bigram_string
# Th hi is s on ne o
# V1 1 1 1 1 1 1 1
I don't find how to multiply these two named matrices/vectors so that I can obtain the counts in bigram_string
multiplied by the log probabilities in bigram_model
.
Expected output:
bigram_string %*% bigram_model
# Th hi is s ...
# V1 -4.356709 -4.356709 -3.663562 -3.258097 ...
# Actual output:
# Error in bigram_string %*% bigram_model : non-conformable arguments
I made some progress with subsetting:
bigram_model["V1",][bigram_string]
# But output:
# Th Th Th Th Th Th Th
# -4.356709 -4.356709 -4.356709 -4.356709 -4.356709 -4.356709 -4.356709
Perhaps, we need to subset the column names
bigram_model[, colnames(bigram_string)] * bigram_string
-output
Th hi is s on ne o
V1 -4.356709 -4.356709 -3.663562 -3.258097 -4.356709 -4.356709 -3.663562