Search code examples
javalinuxscalaapache-sparknfs

Umlaut problems with Spark job writing to an NFSv3 mounted volume


I am trying to copy files to an nfsv3 mounted volume during a spark job. Some of the files contain umlauts. For example:

Malformed input or input contains unmappable characters: /import/nfsmountpoint/Währungszählmaske.pdf

The error occurs in the following line of scala code:

//targetPath is String and looks ok    
val target = Paths.get(targetPath)

The file encoding is shown as ANSI X3.4-1968 although the linux locale on the spark machines is set to en_US.UTF-8.

I already tried to change the locale for the spark job itself using the following arguments:

--conf 'spark.executor.extraJavaOptions=-Dsun.jnu.encoding=UTF8 -Dfile.encoding=UTF8'

--conf 'spark.driver.extraJavaOptions=-Dsun.jnu.encoding=UTF8 -Dfile.encoding=UTF8'

This solves the error, but the filename on the target volume looks like this: /import/nfsmountpoint/W?hrungsz?hlmaske.pdf

The volume mountpoint is:

hnnetapp666.mydomain:/vol/nfsmountpoint on /import/nfsmountpoint type nfs (rw,nosuid,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=4.14.1.36,mountvers=3,mountport=4046,mountproto=udp,local_lock=none,addr=4.14.1.36)

Is there a possible way to fix this?


Solution

  • Solved this by setting the encoding settings like mentioned above and manually converting from and to UTF-8:

    Solution for encoding conversion

    Just using NFSv4 with UTF-8 support would have been an easier solution.