Search code examples
rquanteda

Definition of output of Quanteda findSequence function - R package for text analysis


quick question:

The R text analysis package Quanteda - findSequence gives the following output and I can't find documentation on some of the columns:

seqs <- findSequences(tokens, types_upper, count_min=2)
head(seqs, 3)
              sequence len          z         p       mue
     3         first time   2 -0.4159751 0.6612859 -165.7366
     8  political parties   2 -0.4159751 0.6612859 -165.7366
     9   preserve protect   2 -0.4159751 0.6612859 -165.7366

Can someone help with definitions of z, p and mue is p = probability? and if so, how calculated. The help says, "This algorithm is based on Blaheta and Johnson's “Unsupervised Learning of Multi-Word Verbs”." but provides no further detail of output components.

Looks like and interesting function but more information would help.


Solution

  • Looking at the function code and then checking the paper, z is calculated from lambda (log-odds ratio) over sigma (asymptotic standard error). It's a z-score, like Pierre commented, and p is a probability 1 - stats::pnorm(z).

    mue is explained in the second paragraph in section 2.3 of Blaheta and Johnson's "Unsupervised Learning of Multi-Word Verbs." "µ = λ − 3.29σ.... This corresponds to setting the measures µ and µ1 to the lower bound of a 0.001 confidence interval for λ..., which is a systematic way of trading recall for precision in the face of noisy data (Johnson, 2001)."

    If you go to section 2.3, you can see further details:

    We propose two different measures of association µ and µ1, which we call the “all subtuples” and “unigram subtuples” measures below. As we explain below, they seem to identify very different kinds of collocations, so both are useful in certain circumstances. These measures are estimates of λ and λ1 respectively, which are particular parameters of certain log-linear models. In cases where the counts are small the estimates of λ and λ1 may be noisy, and so high values from small count data should be discounted in some way when being compared with values from large count data. We do this by also estimating the asymptotic standard error σ and σ1 of λ and λ1 respectively, and set µ = λ − 3.29σ and µ1 = λ1 − 3.29σ1. This corresponds to setting the measures µ and µ1 to the lower bound of a 0.001 confidence interval for λ and λ1 respectively, which is a systematic way of trading recall for precision in the face of noisy data (Johnson, 2001).

    The details (and additional references) pertaining to calculating λ and σ are also in section 2.3