I don't know what make of the result of the function predict

Introduction

In my school I must participate at a challenge for see if I have understand how work the text mining in R.

For that, we have 1050 files of different type (shopping, home, account, etc.).

The goal of this exercise is the development of a script for finds the type of a HTML page with a classifier, the time and the precision is very important.

My team and me we have use for begin a kppv classifier, but we have 40% of error with that. So we have to decide to use the classifier SVM !

Research

With several docs, and with much patience we have to create a script for creating an SVM model with all the document. And when we want see if the file put in the model is recognized, it's work.

But when we want put a html page, the output change, and we don't know what make with that.

Code

main.r

library("e1071")
library("tm")

splash=function(x){
    res=NULL
    for (i in x) res=paste(res, i)
    res
}

#Suppression des script s(<script .... </script>)
removeScript=function(t){
    sp=strsplit(t, "<script")
    vec=sapply(sp[[1]], gsub, pattern=".*</script>", replace=" ")
    PlainTextDocument(splash(vec))
}

#Suppression de toutes les balises
removeBalises=function(x){
    t1=gsub("<[^>]*>", " ", x)
    PlainTextDocument(gsub("[ \t]+"," ",t1))
}

clean_corpus = function(corp)
{
    corp<-tm_map(corp,content_transformer(tolower))
    corp<-tm_map(corp,content_transformer(splash))
    corp<-tm_map(corp,content_transformer(removeScript))
    corp<-tm_map(corp,content_transformer(removeBalises))
    corp<-tm_map(corp,removeNumbers)
    corp<-tm_map(corp,removeWords,words=stopwords('en'))
    corp<-tm_map(corp,stemDocument)
    corp<-tm_map(corp,removePunctuation)

    corp
}


training_set = readRDS(file = "training_set.rds")
term20 = readRDS(file = "term20.rds")

classes =  c(rep(1,150), rep(2,150), rep(3,150), rep(4,150), rep(5,150), rep(6,150), rep(7,150))

model <-svm(x=training_set[,ncol(training_set)],y=classes,type='C',kernel='linear', cost=1, gamma=1)

summary(model)

pred = predict(model, classes)
pred

testingFile = function()
{
    src = DirSource("testing")
    corp = VCorpus(src)
    clean_corpus(corp);
}

testCorpus = testingFile()
testCorpus

testdtm = DocumentTermMatrix(testCorpus, control=list(weighting=weightTf))
testmat = as.matrix(testdtm)

testpreds = sapply(1, function(i)
{
    v = testmat[i, ][term20]
    #v[is.na(v)] = 0
    predict(model, v)
})

testpreds

script for the recup of text

library("tm")
library("magrittr")
library("SnowballC")
library("nnet")

acc<-VCorpus(DirSource("training2016/", recursive=TRUE))
#acc<-VCorpus(DirSource("trainingLight/", recursive=TRUE))

[...]


dtm = DocumentTermMatrix(clean_corpus(acc))
dtm

term20 = findFreqTerms(dtm, lowfreq = 20)
freqs = sapply(1:50, function(i) length(findFreqTerms(dtm, lowfreq = i)))
plot(freqs)

dtm20 = dtm[, term20]
dim(dtm20)

m = as.matrix(dtm20)


classes =  c(rep(1,150), rep(2,150), rep(3,150), rep(4,150), rep(5,150), rep(6,150), rep(7,150))
#classes =  c(rep(1,150), rep(2,150), rep(3,150))
training_set = cbind(m, classes)

saveRDS(training_set, file = "training_set.rds")
saveRDS(term20, file = "term20.rds")

Result

When we want, put a only one file, he output a list of word with a value (which is the class).

This output can be useful, but we don't know how.

We want know How use this output.

The output

accessori   "5" 
account     "1" 
ahead       "1" 
airport     "4" 
also        "1" 
amp         "1" 
anyon       "1" 
appl        "7" 
around      "1" 
audio       "1" 
australia   "1" 
avail       "1" 
...

Solution

After several research, I learn that predicts function must take a matrix of word and only of word.

So I have just put this in my script:

v = testmat[1, ][term20]
names(v) = term20
v[is.na(v)] = 0
mat = matrix(v,nrow=1)
pred = predict(model, mat)
tableau = table(pred)
names(tableau)[[which.max(tableau)]]

Which going to transform my vector in matrix and delete les NA and return a value wich is the class of my file send in the SVM model.