Search code examples
rmatrix-multiplicationn-gramstringdist

Multiply two named vectors/matrices, applying an n-gram model (stringdist::qgrams)


I am trying to apply an n-gram character model on a string to compute its probability in this model.

I created a character bigram model with stringdist::qgram():

library(tidyverse)
library(stringdist)

ref_corpus   <- c("This is a sample sentence", "Other sentences from the reference corpus", "Many other ones")
bigram_ref   <- qgrams(ref_corpus, q = 2)       # collecting all bigrams
bigram_model <- log(bigram_ref/sum(bigram_ref)) # computing the log probabilities of each 

bigram_model
#           Th        hi        is        s         sa        se        te        th
# V1 -4.356709 -4.356709 -3.663562 -3.258097 -4.356709 -3.663562 -3.663562 -3.258097

Now, I want to use this model to compute the probability of a new string within the model:

bigram_string <- qgrams("This one", q = 2) 
bigram_string
#    Th hi is s  on ne  o
# V1  1  1  1  1  1  1  1

I don't find how to multiply these two named matrices/vectors so that I can obtain the counts in bigram_string multiplied by the log probabilities in bigram_model.

Expected output:

bigram_string %*% bigram_model
#            Th        hi        is         s  ...
# V1  -4.356709 -4.356709 -3.663562 -3.258097  ...

# Actual output:
# Error in bigram_string %*% bigram_model : non-conformable arguments

I made some progress with subsetting:

bigram_model["V1",][bigram_string]

# But output:
#        Th        Th        Th        Th        Th        Th        Th 
# -4.356709 -4.356709 -4.356709 -4.356709 -4.356709 -4.356709 -4.356709

Solution

  • Perhaps, we need to subset the column names

    bigram_model[, colnames(bigram_string)] * bigram_string
    

    -output

            Th        hi        is        s         on        ne         o
    V1 -4.356709 -4.356709 -3.663562 -3.258097 -4.356709 -4.356709 -3.663562