Search code examples
rprojectionlsa

R: how to map test data into lsa space created by training data


I am trying to do text analysis using LSA. I've read many other posts regarding LSA on StackOverflow, but I have not found one similar to mine yet. IF you know there's one similar to mine, please kindly redirect me to it! Much appreciated!

here's my reproducible code with sample data created:

creating sample data train & test sets

sentiment = c(1,1,0,1,0,1,0,0,1,0)
length(sentiment) #10
text = c('im happy', 'this is good', 'what a bummer X(', 'today is kinda okay day for me', 'i somehow messed up big time', 
         'guess not being promoted is not too bad :]', 'stayhing home is boring :(', 'kids wont stop crying QQ', 'warriors are legendary!', 'stop reading my tweets!!!')
train_data = data.table(as.factor(sentiment), text)
> train_data
    sentiment                                text
 1:  1                                   im happy
 2:  1                               this is good
 3:  0                           what a bummer X(
 4:  1             today is kinda okay day for me
 5:  0               i somehow messed up big time
 6:  1 guess not being promoted is not too bad :]
 7:  0                 stayhing home is boring :(
 8:  0                   kids wont stop crying QQ
 9:  1                    warriors are legendary!
10:  0                  stop reading my tweets!!!

sentiment = c(0,1,0,0)
text = c('running out of things to say...', 'if you are still reading, good for you!', 'nothing ended on a good note today', 'seriously sleep deprived!! >__<')
test_data = data.table(as.factor(sentiment), text)
> train_data
   sentiment                                    text
1:         0         running out of things to say...
2:         1 if you are still reading, good for you!
3:         0      nothing ended on a good note today
4:         0         seriously sleep deprived!! >__<

preprocessing for training data set

corpus.train = Corpus(VectorSource(train_data$text))

create a term document matrix for training set

tdm.train = TermDocumentMatrix(
  corpus.train,
  control = list(
    removePunctuation = TRUE,
    stopwords = stopwords(kind = "en"),
    stemming = function(word) wordStem(word, language = "english"),
    removeNumbers = TRUE, 
    tolower = TRUE,
    weighting = weightTfIdf)
)

convert into matrix (for later use)

train_matrix = as.matrix(tdm.train)

create an lsa space using train data

lsa.train = lsa(tdm.train, dimcalc_share())

set dimension # (i randomly picked one here b/c the data size is too small to create an elbow shape)

k = 6

project train matrix into the new LSA space

projected.train = fold_in(docvecs = train_matrix, LSAspace = lsa.train)[1:k,]

convert above projected data into a matrix

projected.train.matrix = matrix(projected.train, 
                                nrow = dim(projected.train)[1],
                                ncol = dim(projected.train)[2])

train the random forest model (somehow this step does not work anymore with this small sample data... but it's okay, won't be a big problem in this question; however, if you can help me with this error too, that'd be fantastic! i tried googling for this error but it's just not fixed...)

trcontrol_rf = trainControl(method = "boot", p = .75, trim = T)
model_train_caret = train(x = t(projected.train.matrix), y = train_data$sentiment, method = "rf", trControl = trcontrol_rf)

preprocessing for test data set

basically im repeating whatever i did to the training data set, except i did not use the test set to create its own LSA space

corpus.test = Corpus(VectorSource(test_data$text))

create a term document matrix for test set

tdm.test = TermDocumentMatrix(
  corpus.test,
  control = list(
    removePunctuation = TRUE,
    stopwords = stopwords(kind = "en"),
    stemming = function(word) wordStem(word, language = "english"),
    removeNumbers = TRUE, 
    tolower = TRUE,
    weighting = weightTfIdf)
)

convert into matrix (for later use)

test_matrix = as.matrix(tdm.test)

project test matrix into the trained LSA space (here's where the question is)

projected.test = fold_in(docvecs = test_matrix, LSAspace = lsa.train)

but i'd get an error: Error in crossprod(docvecs, LSAspace$tk) : non-conformable arguments

i am not finding any useful google search results regarding this error... (there's only one search results page from google QQ) any help is much appreciated! Thank you!


Solution

  • When you build the LSA model you are using the vocabulary of the training data. But when you build the TermDocumentMatrix for the test data, you are using the vocabulary of the test data. The LSA model only know how to handle documents tabulated against the vocabulary of the training data.

    One way to remedy this is to create your test TDM with dictionary set to the vocabulary of the training data:

    tdm.test = TermDocumentMatrix(
        corpus.test,
        control = list(
            removeNumbers = TRUE, 
            tolower = TRUE,
            stopwords = stopwords("en"),
            stemming = TRUE,
            removePunctuation = TRUE,
            weighting = weightTfIdf,
            dictionary=rownames(tdm.train)
        )
    )