How can I ensure the correct display of Japanese and other foreign characters obtained from the MetaApi get() method?

I'm encountering an issue with encoding foreign characters when retrieving metadata using VLCJ in a JavaFX music player application.

This is the code I use to obtain media metadata (such as title and album) after having prepared and parsed the media:

@Override
public void mediaPlayerReady(MediaPlayer mediaPlayer) {
    long length = mediaPlayer.status().length();
    String formattedTotalDuration = StringFormatter.formatDuration(Duration.millis(length));
    MetaApi meta = mediaPlayer.media().meta();

    Platform.runLater(() -> {
        playbackController.setLblDuration(formattedTotalDuration);
        playbackController.setLblSongName(meta.get(Meta.TITLE));
        playbackController.setLblSongArtist(meta.get(Meta.ARTIST));
        playbackController.setLblSongAlbum(meta.get(Meta.ALBUM));
        playbackController.setCoverArt(new Image(meta.get(Meta.ARTWORK_URL)));

    });
}

The problem manifests when attempting to display titles, artists, and albums that include characters from languages like Japanese. For instance, meta.get(Meta.TITLE) for a song with the title "01.私と浪漫ていすと" returns �?�?�浪漫�?��?��?��?� on the console and is displayed like this. On the contrary, text that doesn't include foreign characters is processed correctly as shown here.

I would like to know if there is any way to ensure that the text returned by the MetaAPI is encoded correctly. I've tried setting the following system property System.setProperty("file.encoding", "UTF-8"); as well as applying the UTF-8 encoding manually to meta.get(Meta.TITLE), but none have worked. Maybe the issue isn't encoding?

I appreciate any guidance or suggestions.

My Java version: 21 VLCJ version: 4.8.2 Windows 11

UPDATE:

I looked into how to detect text encoding as suggested by @Mike'Pomax'Kamermans. I wrote meta.get(Meta.TITLE) to a txt and got its encoding using juniversalchardet and its UniversalDetector:

import org.mozilla.universalchardet.UniversalDetector;

import java.io.*;
import java.nio.charset.Charset;

public class StringEncodingConverter {

    public static void main(String[] args) {
        try {
            // Detect text encoding
            String filePath = "C:\\Users\\myUser\\Documents\\juniversalcharset\\data.txt";
            Charset detectedCharset = detectCharset(filePath);
            if (detectedCharset != null) {
                System.out.println(detectedCharset.toString()); // <- Got 'UTF-8'
                // Convert to Unicode

                String unicodeText = convertToUnicode(filePath, detectedCharset);
                System.out.println("Converted text:\n" + unicodeText);
            } else {
                System.out.println("Failed to detect text encoding.");
            }

        } catch (IOException e) {
        }
    }

    private static Charset detectCharset(String filePath) throws IOException {
        try (FileInputStream fis = new FileInputStream(filePath); BufferedInputStream bis = new BufferedInputStream(fis)) {

            UniversalDetector detector = new UniversalDetector(null);

            byte[] buf = new byte[4096];
            int bytesRead;
            while ((bytesRead = bis.read(buf)) > 0 && !detector.isDone()) {
                detector.handleData(buf, 0, bytesRead);
            }

            detector.dataEnd();
            String charsetName = detector.getDetectedCharset();
            if (charsetName != null) {
                return Charset.forName(charsetName);
            }
        }
        return null;
    }

    private static String convertToUnicode(String filePath, Charset charset) throws IOException {
        try (FileInputStream fis = new FileInputStream(filePath); InputStreamReader isr = new InputStreamReader(fis, charset); BufferedReader reader = new BufferedReader(isr)) {

            StringBuilder result = new StringBuilder();
            String line;
            while ((line = reader.readLine()) != null) {
                result.append(line).append("\n");
            }
     

            return result.toString();
        }
    }
}

It turns out that the text returned by the MediaAPI was already UTF-8. As expected, converting it to UTF-8 made no difference. Unfortunately, taking a string from the meta.get() and passing it through a UniversalDetector directly gave me the same results. I'm no expert at all, but this leads me to believe that perhaps something is wrong with the metadata itself and how it is processed.

UPDATE 2

I forgot to mention that I used to use jaudiotagger-3.0.1 to obtain metadata from songs before, and it worked as expected. I switched to using VLCJ's MetaAPI in an attempt to make my project more cohesive and decrease the number of dependencies, in addition to increasing performance by avoiding creating File instances for every song. In the worst case, I may go back to using it.

Solution

This is not a vlcj/native-streams issue specifically, rather it became a problem when the JNA dependency version that vlcj and native-streams use got bumped to a version newer than 5.9.0.

With JNA 5.9.0, the strings are not garbled, but switching to 5.10.0 (or later) will lead to garbled strings.

This change in JNA is a possible reason - "Update native encoding detection for JEP400", https://github.com/java-native-access/jna/issues/1393

This is the code in JNA from 5.9.0, in Native.java:

    public static final Charset DEFAULT_CHARSET = Charset.defaultCharset();
    public static final String DEFAULT_ENCODING = Native.DEFAULT_CHARSET.name();

This is the code in JNA from 5.10.0:

static {
    // JNA used the defaultCharset to determine which encoding to use when
    // converting strings to native char*. The defaultCharset is set from
    // the system property file.encoding. Up to JDK 17 its value defaulted
    // to the system default encoding. From JDK 18 onwards its default value
    // changed to UTF-8.
    // JDK 18+ exposes the native encoding as the new system property
    // native.encoding, prior versions don't have that property and will
    // report NULL for it.
    // The algorithm is simple: If native.encoding is set, it will be used
    // else the original implementation of Charset#defaultCharset is used
    String nativeEncoding = System.getProperty("native.encoding");
    Charset nativeCharset = null;
    if (nativeEncoding != null) {
        try {
            nativeCharset = Charset.forName(nativeEncoding);
        } catch (Exception ex) {
        LOG.log(Level.WARNING, "Failed to get charset for native.encoding 
value : '" + nativeEncoding + "'", ex);
        }
    }
    if (nativeCharset == null) {
        nativeCharset = Charset.defaultCharset();
    }
    DEFAULT_CHARSET = nativeCharset;
    DEFAULT_ENCODING = nativeCharset.name();
}

You are using JDK version 21, so you would be impacted by this new code.

You should check the value of the native.encoding environment variable when your program runs, and explicitly set the value to "UTF-8" if it is not that already.