In my school I must participate at a challenge for see if I have understand how work the text mining in R.
For that, we have 1050 files of different type (shopping, home, account, etc.).
The goal of this exercise is the development of a script for finds the type of a HTML page with a classifier, the time and the precision is very important.
My team and me we have use for begin a kppv classifier, but we have 40% of error with that. So we have to decide to use the classifier SVM !
With several docs, and with much patience we have to create a script for creating an SVM model with all the document. And when we want see if the file put in the model is recognized, it's work.
But when we want put a html page, the output change, and we don't know what make with that.
library("e1071")
library("tm")
splash=function(x){
res=NULL
for (i in x) res=paste(res, i)
res
}
#Suppression des script s(<script .... </script>)
removeScript=function(t){
sp=strsplit(t, "<script")
vec=sapply(sp[[1]], gsub, pattern=".*</script>", replace=" ")
PlainTextDocument(splash(vec))
}
#Suppression de toutes les balises
removeBalises=function(x){
t1=gsub("<[^>]*>", " ", x)
PlainTextDocument(gsub("[ \t]+"," ",t1))
}
clean_corpus = function(corp)
{
corp<-tm_map(corp,content_transformer(tolower))
corp<-tm_map(corp,content_transformer(splash))
corp<-tm_map(corp,content_transformer(removeScript))
corp<-tm_map(corp,content_transformer(removeBalises))
corp<-tm_map(corp,removeNumbers)
corp<-tm_map(corp,removeWords,words=stopwords('en'))
corp<-tm_map(corp,stemDocument)
corp<-tm_map(corp,removePunctuation)
corp
}
training_set = readRDS(file = "training_set.rds")
term20 = readRDS(file = "term20.rds")
classes = c(rep(1,150), rep(2,150), rep(3,150), rep(4,150), rep(5,150), rep(6,150), rep(7,150))
model <-svm(x=training_set[,ncol(training_set)],y=classes,type='C',kernel='linear', cost=1, gamma=1)
summary(model)
pred = predict(model, classes)
pred
testingFile = function()
{
src = DirSource("testing")
corp = VCorpus(src)
clean_corpus(corp);
}
testCorpus = testingFile()
testCorpus
testdtm = DocumentTermMatrix(testCorpus, control=list(weighting=weightTf))
testmat = as.matrix(testdtm)
testpreds = sapply(1, function(i)
{
v = testmat[i, ][term20]
#v[is.na(v)] = 0
predict(model, v)
})
testpreds
library("tm")
library("magrittr")
library("SnowballC")
library("nnet")
acc<-VCorpus(DirSource("training2016/", recursive=TRUE))
#acc<-VCorpus(DirSource("trainingLight/", recursive=TRUE))
[...]
dtm = DocumentTermMatrix(clean_corpus(acc))
dtm
term20 = findFreqTerms(dtm, lowfreq = 20)
freqs = sapply(1:50, function(i) length(findFreqTerms(dtm, lowfreq = i)))
plot(freqs)
dtm20 = dtm[, term20]
dim(dtm20)
m = as.matrix(dtm20)
classes = c(rep(1,150), rep(2,150), rep(3,150), rep(4,150), rep(5,150), rep(6,150), rep(7,150))
#classes = c(rep(1,150), rep(2,150), rep(3,150))
training_set = cbind(m, classes)
saveRDS(training_set, file = "training_set.rds")
saveRDS(term20, file = "term20.rds")
When we want, put a only one file, he output a list of word with a value (which is the class).
This output can be useful, but we don't know how.
We want know How use this output.
accessori "5"
account "1"
ahead "1"
airport "4"
also "1"
amp "1"
anyon "1"
appl "7"
around "1"
audio "1"
australia "1"
avail "1"
...
After several research, I learn that predicts function must take a matrix of word and only of word.
So I have just put this in my script:
v = testmat[1, ][term20]
names(v) = term20
v[is.na(v)] = 0
mat = matrix(v,nrow=1)
pred = predict(model, mat)
tableau = table(pred)
names(tableau)[[which.max(tableau)]]
Which going to transform my vector in matrix and delete les NA and return a value wich is the class of my file send in the SVM model.