I used the function lsa in the R package lsa to get the semantic space. The input is a term-document matrix. The problem is that the dimcalc_share() function used by lsa by default seems to be wrong. The help page of the function says the function "finds the first position in the descending sequence of singular values where their sum meets or exceeds the specified share." I understand the words as the function keeps the nth largest singular values such that the sum of these values exceeds a certain percentage of the sum of all singular values. The function's source code is
function(share=0.5)
{
function(x){
if(any(which(cumsum(s/sum(s))<=share))){
d=max(which(cumsum(s/sum(s))<=share))+1
}
else{
d=length(s)
}
return(d)
}
}
I have two questions with the source code: 1. why plus 1 to the d? 2. if the fraction of the first singular value is larger than share, the function will keep all singular values, while I suppose the function should just keep the first one.
Your first question is "why the + 1
?"
Let's look at how these functions work:
# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
write( c("ham", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
write( c("dog", "pet", "pet"), file=paste(td, "D3", sep="/") )
# LSA
data(stopwords_en)
myMatrix = textmatrix(td, stopwords=stopwords_en)
myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
myLSAspace = lsa(myMatrix, dims=dimcalc_share())
as.textmatrix(myLSAspace)
D1 D2 D3
cat 0.3616693 0.6075489 0.3848429
dog 0.4577219 0.2722711 1.2710784
mouse 0.5942734 1.3128719 0.1357196
ham 0.6075489 1.5336529 -0.1634938
sushi 0.6075489 1.5336529 -0.1634938
pet 0.6099616 -0.2591316 2.6757285
So, lsa
gets dimensions from dimcalc_share()
based on the input matrix and a given share (.5 by default) and runs a Singular Value Decomposition to map the original TDM to a new LSAspace
.
Those dimensions are the number of singular values for the dimensionality reduction in LSA.
dimcalc_share()
finds the first position in the descending sequence of singular values s where their sum (divided by the sum of all values) meets or exceeds the specified share.
The function is written such that it d
is equal to the max()
position <= share
:
> # Break it apart
> s <- myMatrix
> share <- .5
>
> any(which(cumsum(s/sum(s)) <= share)) #TRUE
[1] TRUE
> cumsum(s/sum(s)) <= share
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> d = max(which(cumsum(s/sum(s)) <= share)) + 1
> d
[1] 10
If you only used d -1
, which would give you 9 instead of 10, then you'd instead have a position where the cumsum
is still <=
to share
. That wouldn't work:
> myMatrix = lw_logtf(myMatrix) * gw_idf(myMatrix)
> myLSAspace2 = lsa(myMatrix, dims=d-1)
Error in SVD$u[, 1:dims] : subscript out of bounds
Equivalently
> dims = 9
> myLSAspace = lsa(myMatrix, dims)
Error in SVD$u[, 1:dims] : subscript out of bounds
So the function dimshare_calc()
is correct in using + 1
.
Your 2nd question, modified for this example, is "would dimcalc_share() = 18 instead of = 1 if the first value was > share?"
If the first value were > share
then the first if
condition would return false and, as you hypothesized, would instead use length(s)
which is 18.
You might follow up with a question on CrossValidated to confirm that your intuition that it should = 1
is correct (though that makes sense to me). If so, it would be simple to re-write the function with d = 1
as the else
.