Search code examples
runicodecharacter-encodinggeturl

R: Reading a UCS-2 LE bom file from GitHub


I have a program which creates and stores files automatically on GitHub. An example is https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master/test-999-666.txt

However, the files are coded on Dos/Windows machine with UCS-2 LE BOM (according to notepad++).

I am trying to read this text file into R but to no avail:

repo <- "https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master"
file <- "test-999-666.txt"
myurl  <- paste(repo, file, sep="/")
library(RCurl)
cnt <- getURL(myurl)

I get an error

Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : 
 caractère nul au milieu de la chaîne : '<ff><fe>*'

How can I configure getURL to read this file? I also tried with httr::GET (but receive an empty content).


Solution

  • This seems to be a relatively common pain point when working with files produced by Windows. I'm going to be honest and say that the solution I'm presenting doesn't seem the best, because it mainly bypasses getting everything into the right encoding and instead goes to the binary directly.

    Using the same variables as you:

    cnt <- getURLContent(myurl, binary = T)
    cnt <- rawToChar(cnt[cnt != 00])
    

    Should produce a parsable string.

    The idea is that instead of trying to have curl read the file, let it treat it like binary and deal with encoding later on. This gives us a vector of type raw. Then, since the main issue seems to be that null characters (i.e. \00) were causing a problem, we just exclude them from cnt before coerce cnt from raw to char.

    In the end, from your example, I get

    "ÿþ*** Header Start ***\r\nVersionPersist: 1\r\nLevelName: Session\r\nLevelName: Block\r\nLevelName: Trial\r\nLevelName: SubTrial\r\nLevelName: LogLevel5\r\nLevelName: LogLevel6\r\nLevelName: LogLevel7\r\nLevelName: LogLevel8\r\nLevelName: LogLevel9\r\nLevelName: LogLevel10\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\n*** Header End ***\r\nLevel: 1\r\n*** LogFrame Start ***\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\nClock.Information: <?xml version=\"1.0\"?>\\n<Clock xmlns:dt=\"urn:schemas-microsoft-com:datatypes\"><Description dt:dt=\"string\">E-Prime Primary Realtime Clock</Description><StartTime><Timestamp dt:dt=\"int\">0</Timestamp><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></StartTime><FrequencyChanges><FrequencyChange><Frequency dt:dt=\"r8\">2742255</Frequency><Timestamp dt:dt=\"r8\">492902384024</Timestamp><Current dt:dt=\"r8\">0</Current><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></FrequencyChange></FrequencyChanges></Clock>\\n\r\nStudioVersion: 2.0.10.252\r\nRuntimeVersion: 2.0.10.356\r\nRuntimeVersionExpected: 2.0.10.356\r\nRuntimeCapabilities: Professional\r\nExperimentVersion: 1.0.0.543\r\nExperimentStuff.RT: 2555\r\n*** LogFrame End ***\r\n"
    

    Which seems to contain all the right content.

    If you want you can try adding options(encoding = "UCS-2LE-BOM") before this code, I don't know if it changes anything, but it seems like it affects rawToChar.