Search code examples
rcosine-similarityfile-encodings

Unable to read temp file using r programming, file encoding error


using R programming

I have two sets of data (securityj and securityc). I want to find the cosine similarity value between them

I used this code using the lsa library

databasfile = tempfile()
dir.create(databasfile)
write( databasej, file=paste(databasfile, "D1", sep="/"))
write( databasec, file=paste(databasfile, "D2", sep="/"))
myMatrix = textmatrix(databasfile)

databaseRes <- lsa::cosine(myMatrix[,1], myMatrix[,2])

securityfile = tempfile()
dir.create(securityfile)

write( securityj, file=paste(securityfile, "D1", sep="/"))
write( securityc, file=paste(securityfile, "D2", sep="/"))
securityMatrix = textmatrix(securityfile)

securityRes <- lsa::cosine(securityMatrix[,1], securityMatrix[,2])

I get this error when running (textmatrix(securityfile))

Error in FUN(X[[i]], ...) : [lsa] - could not open file C:\Users\AAA\AppData\Local\Temp\RtmpIDmcl7\file1898438fde2/D1 due to encoding problems of the file.

when dealing with databasfile it goes very well, but with the securityfile I have error, and the data is taken from the same original file. The thing is that I create the file then read it immediately. I tried to change the original file encoding and make sure it is UTF-8 but nothing changed

textmatrixis a function in lsa library. and my data is two lists of bigrams taken from cleaned job ads, both (databasej ,databasec) and (securityj,securityc) came from the same text file, it worked in the first but i get error in the second. and for separator sep="/" , it's the same as the function wants in the documentation.

sample input in securityj

 [333] "risk assessment"               "beginning darkmatter"         
 [335] "best practices"                "create dream"                 
 [337] "darkmatter agile"              "darkmatter bring"             
 [339] "darkmatter impossible"         "darkmatter place"             
 [341] "drive lead"                    "education drive"              
 [343] "experience education"          "forensic analysis"            
 [345] "freedom create"                "knowledge network"            
 [347] "lead missing"                  "missing freedom"              
 [349] "offers personal"               "perl python"                  
 [351] "related security"              "security risks"               
 [353] "standard operating"            "windows linux"                
 [355] "security controls"             "systems security"             
 [357] "advice guidance"               "application penetration"      
 [359] "certified information"         "forensics malware"            
 [361] "guidance areas"                "networks applications"        
 [363] "new era"                       "practice advice"              
 [365] "provisioning best"             "security certified"           
 [367] "web application"               "government oil"               
 [369] "kill chain"                    "network based"                
 [371] "risk assessments"              "technical experience"         
 [373] "audit compliance"              "business units"               

Solution

  • I changed the file encoding to ANSI, and it worked