Search code examples
rjsonmongodbunicodebson

How to import a file to MongoDB with escaped unicode sequences?


I'm working in R with a dataframe containing a column with escaped unicode sequences:

d <- data.frame(id = 1, norm = "m\u0350pini\u0306\u030Ds\u0313u")

I ultimately need to import the dataframe into MongoDB (I'm using Compass) so that the corresponding unicode characters are correctly displayed. I tried saving it to a simple tab delimited text file, but MongoDB treats the unicode column as a string; so I tried saving it as a json:

library(jsonlite)

j <- tojson(d,dataframe="rows",pretty=T)

write(j,"jstest.json")

However, this automatically adds another backslash to the escaped sequences, giving m\\u0350pini\\u0306\\u030Ds\\u0313u, which again MongoDB does not interpret as unicode.
If I insert a document into MongoDB manually which single backslashes, the unicode symbols appear, but this is very impractical for me (thousands of documents).
What am I doing wrong?
Thanks for the help.

I tried using mongoimport:

d <- data.frame(id=c(1,2),unicode=c("m\u0350pini\u030Ds\u03131u","a\u0350mpi\u030D\u03B7"))
js <- toJSON(d,dataframe = "rows",pretty = T)
write(js,"jstest.json")

mongoimport -d test -c newcoll --type json --file jtest.json --jsonArray

However, the documents still don't display the characters:

{ "_id" : ObjectId("666b08603505c9daeb20edc7"), "id" : 1, "unicode" : "m\\u0350pini\\u030Ds\\u03131u" }
{ "_id" : ObjectId("666b08603505c9daeb20edc8"), "id" : 2, "unicode" : "a\\u0350mpi\\u030D\\u03B7" }

The only way I can get the result I want, which is exactly like @Konrad Rudolph 's second comment, is if I manually insert a document with single slashes.


Solution

    1. Inside R, export the data to TSV (e.g. via ‘readr’):

      readr::write_tsv(d, 'd.tsv')
      
    2. Use mongoimport to import the data:

      mongoimport --db mydb --collection mycollection --type tsv --file d.tsv --headerline