Search code examples
rtext-mininglsa

How many singular values to keep in the R package lsa


I used the function lsa in the R package lsa to get the semantic space. The input is a term-document matrix. The problem is that the dimcalc_share() function used by lsa by default seems to be wrong. The help page of the function says the function "finds the first position in the descending sequence of singular values where their sum meets or exceeds the specified share." I understand the words as the function keeps the nth largest singular values such that the sum of these values exceeds a certain percentage of the sum of all singular values. The function's source code is

function(share=0.5)
{
    function(x){
        if(any(which(cumsum(s/sum(s))<=share))){
            d=max(which(cumsum(s/sum(s))<=share))+1
        }
        else{
            d=length(s)
        }
        return(d)
    }
}

I have two questions with the source code: 1. why plus 1 to the d? 2. if the fraction of the first singular value is larger than share, the function will keep all singular values, while I suppose the function should just keep the first one.


Solution

  • Your first question is "why the + 1?"

    Let's look at how these functions work:

    # create some files
    td = tempfile()
    dir.create(td)
    write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
    write( c("ham", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
    write( c("dog", "pet", "pet"), file=paste(td, "D3", sep="/") )
    
    # LSA
    data(stopwords_en)
    myMatrix = textmatrix(td, stopwords=stopwords_en)
    myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
    myLSAspace = lsa(myMatrix, dims=dimcalc_share())
    as.textmatrix(myLSAspace)
    
                 D1         D2         D3
    cat   0.3616693  0.6075489  0.3848429
    dog   0.4577219  0.2722711  1.2710784
    mouse 0.5942734  1.3128719  0.1357196
    ham   0.6075489  1.5336529 -0.1634938
    sushi 0.6075489  1.5336529 -0.1634938
    pet   0.6099616 -0.2591316  2.6757285
    

    So, lsa gets dimensions from dimcalc_share() based on the input matrix and a given share (.5 by default) and runs a Singular Value Decomposition to map the original TDM to a new LSAspace.

    Those dimensions are the number of singular values for the dimensionality reduction in LSA. dimcalc_share() finds the first position in the descending sequence of singular values s where their sum (divided by the sum of all values) meets or exceeds the specified share.

    The function is written such that it d is equal to the max() position <= share:

    > # Break it apart
    > s <- myMatrix
    > share <- .5
    > 
    > any(which(cumsum(s/sum(s)) <= share)) #TRUE
    [1] TRUE
    > cumsum(s/sum(s)) <= share
     [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    > d = max(which(cumsum(s/sum(s)) <= share)) + 1
    > d
    [1] 10
    

    If you only used d -1, which would give you 9 instead of 10, then you'd instead have a position where the cumsum is still <= to share. That wouldn't work:

    > myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
    > myLSAspace2 = lsa(myMatrix, dims=d-1)
    Error in SVD$u[, 1:dims] : subscript out of bounds
    

    Equivalently

    > dims = 9
    > myLSAspace = lsa(myMatrix, dims)
    Error in SVD$u[, 1:dims] : subscript out of bounds
    

    So the function dimshare_calc() is correct in using + 1.

    Your 2nd question, modified for this example, is "would dimcalc_share() = 18 instead of = 1 if the first value was > share?"

    If the first value were > share then the first if condition would return false and, as you hypothesized, would instead use length(s) which is 18.

    You might follow up with a question on CrossValidated to confirm that your intuition that it should = 1 is correct (though that makes sense to me). If so, it would be simple to re-write the function with d = 1 as the else.