Search code examples
rpcasvddimensionality-reductionlsa

R- reduce dimensionality LSA


I am following an example of svd, but I still don't know how to reduce the dimension of the final matrix:

a <- round(runif(10)*100)
dat <- as.matrix(iris[a,-5])
rownames(dat) <- c(1:10)

s <- svd(dat)

pc.use <- 1
recon <- s$u[,pc.use] %*% diag(s$d[pc.use], length(pc.use), length(pc.use)) %*% t(s$v[,pc.use])

But recon still have the same dimension. I need to use this for Semantic analysis.


Solution

  • The code you provided does not reduce the dimensionality. Instead it takes first principal component from your data, removes the rest of principal components, and then reconstructs the data with only one PC.

    You can check that this is happening by inspecting the rank of the final matrix:

    library(Matrix)
    rankMatrix(dat)
    as.numeric(rankMatrix(dat))
    [1] 4
    as.numeric(rankMatrix(recon))
    [1] 1
    

    If you want to reduce dimensionality (number of rows) - you can select some principal principal components and compute the scores of your data on those components instead.

    But first let's make some things clear about your data - it seems you have 10 samples (rows) with 4 features (columns). Dimensionality reduction will reduce the 4 features to a smaller set of features.

    So you can start by transposing your matrix for svd():

    dat <- t(dat)
    dat
                   1   2   3   4   5   6   7   8   9  10
    Sepal.Length 6.7 6.1 5.8 5.1 6.1 5.1 4.8 5.2 6.1 5.7
    Sepal.Width  3.1 2.8 4.0 3.8 3.0 3.7 3.0 4.1 2.8 3.8
    Petal.Length 4.4 4.0 1.2 1.5 4.6 1.5 1.4 1.5 4.7 1.7
    Petal.Width  1.4 1.3 0.2 0.3 1.4 0.4 0.1 0.1 1.2 0.3
    

    Now you can repeat the svd. Centering the data before this procedure is advisable:

    s <- svd(dat - rowMeans(dat))
    

    Principal components can be obtained by projecting your data onto PCs.

    PCs <- t(s$u) %*% dat
    

    Now if you want to reduce dimensionality by eliminating PCs with low variance you can do so like this:

    dat2 <- PCs[1:2,] # would select first two PCs.