Search code examples
xmlscalasax

How to deal with accents in Scala XML?


I have this code to load XML from a HTML webpage:

import scala.xml._ 
import scala.xml.factory.XMLLoader 
import scala.xml.parsing.NoBindingFactoryAdapter
import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl

object XmlUtils {
  def load(s: String) = {
    val adapter = new NoBindingFactoryAdapter
    val factory = (new SAXFactoryImpl())
    val loader = XML.withSAXParser(factory.newSAXParser())
    scala.xml.Utility.trim(loader.loadString(s))   
  }: Node
}

The code loads the XML well except for the &Xaccute; symbols which are represented as '?' in the terminal output.

I'm new in the Java environment and Scala, so I'm pretty lost.

How can I fix that?

----- more info

I'm using Dispatch to fetch the HTML via HTTP

url(_url) <:< mapHeaders(headers)

The enviroment which I'm running the program is Akka, and I use the simple println to output the data

This is a simple example out of the Akka framework:

val s = "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\"><HTML><HEAD></HEAD><BODY>&aacute;</BODY></HTML>"
val xml = XmlUtils.load(s)
println(xml.text)

Output: ?


Solution

  • I tweaked your code a little, but it's essentially the same:

    package scratch
    
    import scala.xml._
    import scala.xml.factory.XMLLoader
    import scala.xml.parsing.NoBindingFactoryAdapter
    import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
    
    object XmlUtils {
      def load(s: String) = {
        val adapter = new NoBindingFactoryAdapter
        val factory = (new SAXFactoryImpl())
        val loader = XML.withSAXParser(factory.newSAXParser())
        val node = scala.xml.Utility.trim(loader.loadString(s))
    
        node
      }: Node
    
      def main(args: Array[String]) {
    
        val s = "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\"><HTML><HEAD></HEAD><BODY>&aacute;</BODY></HTML>"
        val xml = XmlUtils.load(s)
        println(xml.text)
      }
    }
    

    ... and changed the "Resource->Text File Encoding" project setting in Eclipse to "UTF-8" and it now produces output like this in a console on OS/X 10.9.1:

    $ scala -classpath .:../lib/tagsoup-1.2.1.jar scratch.XmlUtils
    á
    

    I suspect the project setting corresponds to passing the -encoding option to scalac.