I have this code to load XML from a HTML webpage:
import scala.xml._
import scala.xml.factory.XMLLoader
import scala.xml.parsing.NoBindingFactoryAdapter
import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
object XmlUtils {
def load(s: String) = {
val adapter = new NoBindingFactoryAdapter
val factory = (new SAXFactoryImpl())
val loader = XML.withSAXParser(factory.newSAXParser())
scala.xml.Utility.trim(loader.loadString(s))
}: Node
}
The code loads the XML well except for the &Xaccute; symbols which are represented as '?' in the terminal output.
I'm new in the Java environment and Scala, so I'm pretty lost.
How can I fix that?
----- more info
I'm using Dispatch to fetch the HTML via HTTP
url(_url) <:< mapHeaders(headers)
The enviroment which I'm running the program is Akka, and I use the simple println
to output the data
This is a simple example out of the Akka framework:
val s = "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\"><HTML><HEAD></HEAD><BODY>á</BODY></HTML>"
val xml = XmlUtils.load(s)
println(xml.text)
Output:
?
I tweaked your code a little, but it's essentially the same:
package scratch
import scala.xml._
import scala.xml.factory.XMLLoader
import scala.xml.parsing.NoBindingFactoryAdapter
import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
object XmlUtils {
def load(s: String) = {
val adapter = new NoBindingFactoryAdapter
val factory = (new SAXFactoryImpl())
val loader = XML.withSAXParser(factory.newSAXParser())
val node = scala.xml.Utility.trim(loader.loadString(s))
node
}: Node
def main(args: Array[String]) {
val s = "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\"><HTML><HEAD></HEAD><BODY>á</BODY></HTML>"
val xml = XmlUtils.load(s)
println(xml.text)
}
}
... and changed the "Resource->Text File Encoding" project setting in Eclipse to "UTF-8" and it now produces output like this in a console on OS/X 10.9.1:
$ scala -classpath .:../lib/tagsoup-1.2.1.jar scratch.XmlUtils
á
I suspect the project setting corresponds to passing the -encoding option to scalac.