Search code examples
xmlapplescriptjavascript-automation

Writing non-ascii characters to xml/UTF-8


I have a script which assembles an xml document via string manipulation (which I wrote before I discovered the XML Suite).

When certain characters are included such as £, –(en-dash) and —(em dash) (I suspect all non-ascii characters), they're replaced with the unicode replacement character (U+FFFD).

This only happens when there is an xml header at the start of the document: i.e. <?xml. Making any change at all to this fixes the problem and writes what I would expect to the file. My assumption is that applescript is trying to parse the string as xml, but I want it to pass as a string.

I'm writing in JXA, but have included the Applescript equivalent as I think the issue is with OSA and there are likely more applescript users!

edit: ok, this is more an encoding issue I guess—reading as UTF-8 (which the xml I'm generating should be) results in the replacement character, but Western or Mac Roman display the characters correctly. UTF-8 definitely supports these characters though, so I'm not sure the best way to move forward?

edit 2: Just to be clear: I think what's happening is that the non-ascii characters are being encoded in something other than UTF-8, which is causing my XML output to be invalid. How can I get applescript or JXA to encode non-ascii characters as UTF-8?

Applescript

set dt to path to desktop as text
set filePath to dt & "test1.txt"

writeTextToFile(text1, filePath, true)

-- using the example handler from the Mac Automation Scripting Guide
on writeTextToFile(theText, theFile, overwriteExistingContent)
    try

        -- Convert the file to a string
        set theFile to theFile as string

        -- Open the file for writing
        set theOpenedFile to open for access file theFile with write permission

        -- Clear the file if content should be overwritten
        if overwriteExistingContent is true then set eof of theOpenedFile to 0

        -- Write the new content to the file
        write theText to theOpenedFile starting at eof

        -- Close the file
        close access theOpenedFile

        -- Return a boolean indicating that writing was successful
        return true

        -- Handle a write error
    on error

        -- Close the file
        try
            close access file theFile
        end try

        -- Return a boolean indicating that writing failed
        return false
    end try
end writeTextToFile

Javascript for Automation

app.includeStandardAdditions = true

function writeTextToFile(text, file, overwriteExistingContent) {
    try {

        // Convert the file to a string
        var fileString = file.toString()

        // Open the file for writing
        var openedFile = app.openForAccess(Path(fileString), { writePermission: true })

        // Clear the file if content should be overwritten
        if (overwriteExistingContent) {
            app.setEof(openedFile, { to: 0 })
        }

        // Write the new content to the file
        app.write(text, { to: openedFile, startingAt: app.getEof(openedFile) })

        // Close the file
        app.closeAccess(openedFile)

        // Return a boolean indicating that writing was successful
        return true
    }
    catch(error) {

        try {
            // Close the file
            app.closeAccess(file)
        }
        catch(error) {
            // Report the error is closing failed
            console.log(`Couldn't close file: ${error}`)
        }

        // Return a boolean indicating that writing was successful
        return false
    }
}

var text = "<?xml £"
var file = Path("Users/benfrearson/Desktop/text.txt")


writeTextToFile (text, file, true)

Solution

  • In AppleScript, you’d use write theText to theFile as «class utf8» to write UTF8-encoded text. You can’t do that in JXA as there’s no way to write raw AE codes.

    I generally recommend against JXA as it’s 1. buggy and crippled, and 2. abandoned. If you like JavaScript in general you’re far better off with Node. For application automation you’re best sticking to AppleScript: while it’s a crappy language and also moribund, at least it speaks Apple events right and has half-decent documentation and community support.

    If you must use JXA, the only workaround is to write your UTF8 file via the Cocoa APIs instead. Though generating XML via string-mashing is evil and bug-prone anyway, so you’d probably be as well taking the opportunity to rewrite your code to use a proper XML API. (Again, with Node you’re spoiled for choice and the hardest part will be figuring which NPM libraries are robust and easy to use and which are junk. With AS/JXA, it’s either System Events’ XML Suite, which is slow, or Cocoa’s XML APIs, which are complex.)