I am getting content of pdf documents through tika and sending it to solr to index it through xml request in coldfusion 9 Here is my code.
<cfset gatt = new getallthetexts.textextractor()>
<cfset result= gatt.read(pdfpath)>
<cfset content = xmlFormat(result.text)>
<!---escape unicode characters--->
<cfset p= createObject("java","java.util.regex.Pattern").compile("[^\\u0009\\u000A\\u000D\u0020-\\uD7FF\\uE000-\\uFFFD\\u10000-\\u10FFF]+")>
<cfset p.matcher(content).replaceAll("")>
<cfxml variable="xml">
<add>
<field name="content">#content#</field>
</add>
</cfxml>
Now i am facing the following error:
A decimal representation must immediately follow the "&#" in a character reference.
I have used example on the following link to get content of pdf: https://github.com/cfjedimaster/getallthetexts/blob/master/test1.cfm
Can any one please help me to resolved this.
I am updating my answer as the adam suggested. Now i have used owasp to encode text for xml.
I have downloaded latest version of OWASP jar file from the following link: https://www.owasp.org/index.php/OWASP_Java_Encoder_Project
I have load jar file using javaloader and here is the component which have function to parse text to XML using OWASP.
component {
public function init() {
variables.javaloader = new javaloader.JavaLoader().init([getDirectoryFromPath(getCurrentTemplatePath()) & 'encoder.jar'],true);
return this;
}
public function parseTextForXML(required string inputText) {
esapi=variables.javaloader.create('org.owasp.esapi.ESAPI');
esapiEncoder = esapi.encoder();
return esapiEncoder.encodeForXML(inputText);
}
}
And used the function with CDATA fix my problem. Here is the code:
<cfset gatt = new getallthetexts.textextractor()>
<cfset encoderObj = new encoder()>
<cfset result= gatt.read(pdfpath)>
<cfset content = encoderObj.parseTextForXML(result.text)>
<cfxml variable="xml">
<add>
<field name="content"><![CDATA[#content#]]></field>
</add>
</cfxml>