Search code examples
rtextencodinganalysis

How covert a set of Unicode .txt to ANSI for text analysis in R


I am using R on Windows 10 x64. I am trying to read a set of txt file into R to do text analysis. I am using the following code:

setwd(inputdir)
files <- DirSource(directory = inputdir, encoding ="UTF-8" )
docs<- VCorpus(x=files)
writeLines(as.character(docs[[2]]))

The last line is intended to show the content of the document #2, which this code shows as empty (as well as all other documents in the set). I am not sure why. I checked encoding of the txt document (open, then choose "save as") and my txt files encoding is "Unicode." When I save any of the files as "ANSI" manually, the writeLines(as.character(docs[[2]])) gives me proper content. I thought I should convert all files to ANSI. In that regard, I wanted to ask how can I do that in R for all txt files in my "inputdir"?


Solution

  • get all txt file

    files <- list.files(path=getwd(), pattern="*.txt", full.names=T, recursive=FALSE)
    

    loop for converting the encoding and overwrite it

    for(i in 1:length(files)){
      input <- readLines(files[i])
      converted_input <- iconv(input, from = file_encoding, to = file_encoding)
      writeLines(converted_input,files[i])
    }
    

    possible encodings can be viewed by the iconvlist() command