coldfusion canonicalization coldfusion-2018

Canonicalize() function converts chars to white space

I am using EncodeForHTML() to prevent Cross Site Scripting (XSS) attacks. In doing so, some text field as :

step 1:   cost too much to keep. #3&#4 bad business decision

is stored in the database as :

step 2:   cost too much to keep. &#xd;&#xa;&#x23;3&amp;&#x23;4 bad business decision

Then I use canonicalize to get back the original string :

 #canonicalize(fieldName, false, false ,true)#

which should return what was entered in step 1 above.

However, that &#4 becomes displayed as a white space character. It almost looks like a square. It happens for any &# followed by a single digit.

This is ColdFusion 2018. Any ideas on how to get back the default #3&#4 ?

Solution

Okay, let's go through this:

Encoding `#3&#4` for HTML

# becomes # _{(hex entity)}
3 becomes 3 _{(no encoding required)}
& becomes & _{(named entity)}
# becomes # _{(hex entity)}
4 becomes 4 _{(no encoding required)}

Note: 
 in your example is CarriageReturn and LineFeed, so basically there is a newline in front of #3&#4. We will ignore this for now.

Decoding `3&4` for HTML

Regardless if you use decodeForHtml() or canonicalize():

# becomes #
3 becomes 3
& becomes &
# becomes #
4 becomes 4

This is absolutely correct and there's no issue here. So...

Why am I seeing □?

It's simple: You are outputting the decoded value in HTML.

If you tell your browser to render #3&#4 as HTML, the browser will "smart-detect" an incomplete entity. Entities always start with &. This is why you are supposed to encode an actual ampersand as &, so the browser recognizes it as a literal character. Nowdays most browsers automatically detect a single/standalone & and will encode it accordingly. However, in your case, the browser assumes you meant to say  (abbr.  or ), which is the control character EOT and cannot be printed, resulting in a □.

The Solution

Whenever you want to display something in HTML, you have to encode the values. If you need to inspect a variable in ColdFusion, prefer <cfdump var="#value#"> (or writeDump(value)) over just outputting a value via <cfoutput>#value#</cfoutput> (or writeOutput(value)).

Demo

https://cffiddle.org/app/file?filepath=6926a59a-f639-4100-b802-07a17ff79c53/5d545e2c-01a4-4c13-9f50-eb15777fba8c/6307a84e-89a3-411d-874f-7d32bd9a9874.cfm

<cfset charsToEncode = [
    "##", <!--- we have to escape # in ColdFusion by doubling it --->
    "3",
    "&",
    "##", <!--- we have to escape # in ColdFusion by doubling it --->
    "4"
]>

<h2>encodeForHtml</h2>
<cfloop array="#charsToEncode#" index="char">
    <cfdump var="#encodeForHtml(char)#"><br>
</cfloop>

<cfset charsToDecode = [
    "&##x23;", <!--- we have to escape # in ColdFusion by doubling it --->
    "3",
    "&amp;",
    "&##x23;", <!--- we have to escape # in ColdFusion by doubling it --->
    "4"
]>

<h2>decodeForHtml</h2>
<cfloop array="#charsToDecode#" index="char">
    <cfdump var="#decodeForHtml(char)#"><br>
</cfloop>

<h2>canonicalize</h2>
<cfloop array="#charsToDecode#" index="char">
    <cfdump var="#canonicalize(char, false, false)#"><br>
</cfloop>

<h2>encoding the output PROPERLY</h2>
<cfoutput>#encodeForHtml("##3&##4")#</cfoutput><br>
<cfoutput>#encodeForHtml(decodeForHtml("&##x23;3&amp;&##x23;4"))#</cfoutput><br>
Note: due to the mix of entities, canonicalize() has to guess the begin/end of each entity and is having issues with the ampersand here:<br>
<cfoutput>#encodeForHtml(canonicalize("&##x23;3&##x26;&##x23;4", false, false))#</cfoutput><br>

<h2>encoding the output INCORRECTLY</h2>
#3&#4<br>
<cfoutput>#decodeForHtml("&##x23;3&amp;&##x23;4")#</cfoutput><br>
<cfoutput>#canonicalize("&##x23;3&amp;&##x23;4", false, false)#</cfoutput><br>