Search code examples
pythonencodingcharacter-encoding

Updating Japanese into nfo file turned into garbled character


Below is the original nfo file in format that Emby using

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<movie>
  <plot />
  <outline />
  <lockdata>false</lockdata>
  <dateadded>2023-02-22 21:52:29</dateadded>
  <title>old title</title>
  <sorttitle>old title</sorttitle>
  <runtime>119</runtime>
  <fileinfo>
    <streamdetails>
      <video>
        <codec>h264</codec>
        <micodec>h264</micodec>
        <bitrate>5744052</bitrate>
        <width>1920</width>
        <height>1080</height>
        <aspect>16:9</aspect>
        <aspectratio>16:9</aspectratio>
        <framerate>29.96973</framerate>
        <language>und</language>
        <scantype>progressive</scantype>
        <default>True</default>
        <forced>False</forced>
        <duration>119</duration>
        <durationinseconds>7168</durationinseconds>
      </video>
      <audio>
        <codec>aac</codec>
        <micodec>aac</micodec>
        <bitrate>256000</bitrate>
        <language>und</language>
        <scantype>progressive</scantype>
        <channels>2</channels>
        <samplingrate>48000</samplingrate>
        <default>True</default>
        <forced>False</forced>
      </audio>
    </streamdetails>
  </fileinfo>
</movie>

And I am trying to update the title with below python script

import xml.etree.ElementTree as ET

title = "千と千尋の神隠し"

# Load the NFO file
filename = "movie.nfo"
tree = ET.parse(filename)
root = tree.getroot()

# Find the <title> tag and replace its text value with the new title
title_elem = root.find("title")
title_elem.text = title

# Write the updated XML structure to the NFO file
tree.write(filename, encoding="utf-8", xml_declaration=True)

But after I run the script, the title turned into garbled character

<title>千と千尋の神隠し</title>

I know it is must be an encoding issue, but I do not know how to solve it

The nfo file should be updated to

<title>千と千尋の神隠し</title>

Solution

  • You face a mojibake case:

    print("千と千尋の神隠し".encode('utf-8').decode('cp437'))
    
    千と千尋の神隠し
    

    The problem is the .NFO file extension:

    The NFO file extension is used for a Warez Information File developed by THG. NFO file is basically pirated information pertaining to a software or program that is released and distributed by any organized group without the knowledge or permission of the creator or owner of such programs…

    Wikipedia .nfo says - NFO files often contain elaborate ANSI art (It is similar to ASCII art, but constructed from a larger set of 256 letters, numbers, and symbols — all codes found in IBM code page 437, often referred to as extended ASCII).

    Oddly enough, *.nfo files are always recognized as OEM-US encoding even in Notepad++ (see this issue at github)

    Result: your file is UTF8.

    Proof #1:

    import xml.etree.ElementTree as ET
    
    # Load the NFO file
    filename = "movie.nfo"
    tree = ET.parse(filename)
    root = tree.getroot()
    
    # Find the <title> tag
    title_elem = root.find("title")
    print( title_elem.text)
    
    千と千尋の神隠し
    

    Proof #2:

    filename = "movie.nfo"
    with open(filename, mode='r', encoding='utf-8') as fnfo:
        lines = fnfo.readlines()
    
    print([line for line in lines if '<title>' in line])
    
    ['  <title>千と千尋の神隠し</title>\n']