using R programming
I have two sets of data (securityj and securityc). I want to find the cosine
similarity value between them
I used this code using the lsa library
databasfile = tempfile()
dir.create(databasfile)
write( databasej, file=paste(databasfile, "D1", sep="/"))
write( databasec, file=paste(databasfile, "D2", sep="/"))
myMatrix = textmatrix(databasfile)
databaseRes <- lsa::cosine(myMatrix[,1], myMatrix[,2])
securityfile = tempfile()
dir.create(securityfile)
write( securityj, file=paste(securityfile, "D1", sep="/"))
write( securityc, file=paste(securityfile, "D2", sep="/"))
securityMatrix = textmatrix(securityfile)
securityRes <- lsa::cosine(securityMatrix[,1], securityMatrix[,2])
I get this error when running (textmatrix(securityfile))
Error in FUN(X[[i]], ...) : [lsa] - could not open file C:\Users\AAA\AppData\Local\Temp\RtmpIDmcl7\file1898438fde2/D1 due to encoding problems of the file.
when dealing with databasfile it goes very well, but with the securityfile I have error, and the data is taken from the same original file. The thing is that I create the file then read it immediately. I tried to change the original file encoding and make sure it is UTF-8 but nothing changed
textmatrix
is a function in lsa library
. and my data is two lists of bigrams taken from cleaned job ads,
both (databasej ,databasec) and (securityj,securityc) came from the same text file, it worked in the first but i get error in the second.
and for separator sep="/" , it's the same as the function wants in the documentation.
sample input in securityj
[333] "risk assessment" "beginning darkmatter"
[335] "best practices" "create dream"
[337] "darkmatter agile" "darkmatter bring"
[339] "darkmatter impossible" "darkmatter place"
[341] "drive lead" "education drive"
[343] "experience education" "forensic analysis"
[345] "freedom create" "knowledge network"
[347] "lead missing" "missing freedom"
[349] "offers personal" "perl python"
[351] "related security" "security risks"
[353] "standard operating" "windows linux"
[355] "security controls" "systems security"
[357] "advice guidance" "application penetration"
[359] "certified information" "forensics malware"
[361] "guidance areas" "networks applications"
[363] "new era" "practice advice"
[365] "provisioning best" "security certified"
[367] "web application" "government oil"
[369] "kill chain" "network based"
[371] "risk assessments" "technical experience"
[373] "audit compliance" "business units"
I changed the file encoding to ANSI, and it worked