Search code examples
rhadoop

String character in RHDFS output


The hdfs.write() command in rhdfs creates a file with a leading non-unicode character. The documentation doesn't describe the file type being written.

Steps to recreate. 1. Open R and initialize rhdfs

> ofile = hdfs.file("brian.txt", "w")
> hdfs.write("hi",ofile)
> hdfs.close(ofile)

Creates a file called "brian.txt" which I could expect contains a single string, "hi". But this reveals and extra character at the beginning.

> hdfs dfs -cat brian.txt
X
    hi

I have no idea what file type is created and rhdfs doesn't show any file type options. This makes the output very difficult to use.


Solution

  • If you look at the hdfs.write function in the source code, you can see that it can take raw bytes instead of having R serialize it for you. So essentially you can do this for characters

    ofile = hdfs.file("brian.txt", "w")
    hdfs.write(charToRaw("hi", ofile))
    hdfs.close(ofile)