I am trying to read from an oracle db which stores data in Windows-1252 encoding. I am reading that data using jdbc and writing to an xml file with UTF-8 encoding.
while writing to these files, I am getting '?' characters instead of the latin characters e.g. instead of í, i get a ?
'Coquí' is being written to XML as 'Coqu?'
I use this file to upload to solr later on. I have only put the relevant code here and not the whole code since its a long method (legacy code that i have inherited) which is complicated.
BufferedWriter result = new BufferedWriter(new FileWriter(OUTPUT_FILE));
stmt = conn.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);
rst = stmt.executeQuery(sql);
if (rst.getFetchSize() < 1)
return;
rst.beforeFirst();
while (rst.next()) {
Profile p = new Profile();
p.business_name = rst.getString("business_name");
p.business_name_sort = rst.getString("business_name_sort");
result.write(p.business_name;
result.write(p.business_name_sort);
}
By the sounds of it (you haven't given us the relevant code so I can't be certain) you aren't handling character set conversion properly. Java doesn't perform any automatic character set conversions for you - you've got to do it yourself.
You can do the following to convert it to UTF-8:
String utf8Text = new String(originalText.getBytes("UTF-8"), "UTF-8");
This assumes that originalText
is a String
containing the Windows-1252 encoded text.