Search code examples
xmlunicodesolrcoldfusioncoldfusion-9

A decimal representation must immediately follow the "&#" in a character reference


I am getting content of pdf documents through tika and sending it to solr to index it through xml request in coldfusion 9 Here is my code.

<cfset gatt = new getallthetexts.textextractor()>
<cfset result= gatt.read(pdfpath)>
<cfset content = xmlFormat(result.text)>
<!---escape unicode characters--->
<cfset p= createObject("java","java.util.regex.Pattern").compile("[^\\u0009\\u000A\\u000D\u0020-\\uD7FF\\uE000-\\uFFFD\\u10000-\\u10FFF]+")>
<cfset p.matcher(content).replaceAll("")>
<cfxml variable="xml">
 <add>
     <field name="content">#content#</field>
 </add>
</cfxml>

Now i am facing the following error:

A decimal representation must immediately follow the "&#" in a character reference.

I have used example on the following link to get content of pdf: https://github.com/cfjedimaster/getallthetexts/blob/master/test1.cfm

Can any one please help me to resolved this.


Solution

  • I am updating my answer as the adam suggested. Now i have used owasp to encode text for xml.

    I have downloaded latest version of OWASP jar file from the following link: https://www.owasp.org/index.php/OWASP_Java_Encoder_Project

    I have load jar file using javaloader and here is the component which have function to parse text to XML using OWASP.

    component {
    
        public function init() {
            variables.javaloader = new javaloader.JavaLoader().init([getDirectoryFromPath(getCurrentTemplatePath()) & 'encoder.jar'],true);
            return this;
        }   
        public function parseTextForXML(required string inputText) {
            esapi=variables.javaloader.create('org.owasp.esapi.ESAPI');
            esapiEncoder = esapi.encoder();
            return esapiEncoder.encodeForXML(inputText);
        }
    
    }
    

    And used the function with CDATA fix my problem. Here is the code:

    <cfset gatt = new getallthetexts.textextractor()>
    <cfset encoderObj = new encoder()>
    <cfset result= gatt.read(pdfpath)>
    <cfset content = encoderObj.parseTextForXML(result.text)>
    <cfxml variable="xml">
     <add>
         <field name="content"><![CDATA[#content#]]></field>
     </add>
    </cfxml>