r windows azure unix azure-machine-learning-service

Azure Machine Learning Batch request returning mal-formatted umlauts with Unixes

I have deployed my code in Azure Machine Learning and run the batch request in R with different operating systems, such as Unix and W10. For some reason, the host outputs are properly formatted only in R of W10 but I am unable to get properly formatted output in Unix systems. Only way I can get properly formatted outputs in all systems is through the Azure GUI and manually download the file. In W10, I have the luxury to get the properly formatted file directly with my Rscript/Rstudio thing. In R, I have used system("defaults write org.R-project.R force.LANG en_US.UTF-8") as hinted here to explicitly specify the encoding but this does not have any effect on the batch request R script that is executed in Azure servers run by Microsoft.

What is happening is that UTF-8 characters bytes are returned as Latin-1 characters bytes, for example

ö as Ã ¶

ä as Ã ¤

Ä as Ã ¥

as can be demonstrated and tested with this tool here about Latin-1 characters. So what are best ways to deal with this encoding issue, can it be addressed somehow inside Azure ML? Where can you do bug reports? Does there exist some tool to convert Latin-1 to UTF-8 in R?

How can you get properly formatted UTF-8 files with umlauts with R batch requests in Azure ML (not in Latin-1 characters)?

Solution

The Batch request R command has a saveBlobToFile function. The problem is in the saveBlobToFile function that uses wrong encoding with getUrl. getUrl function needs to specify the encodings explicitly. Do the following changes

blobContent = getURL(blobUrl, .encoding="UTF-8")

where without .encoding, the output is ISO8859-1('latin1') or something inherited from your system.