Search code examples
goescapingrssunescapestring

Unescape twice escaped title in rss


I got some rss with strange escaped titles, for example:

<title>S&amp;amp;P 500 : Wall Street amorce un rebond, Binance fait l&amp;apos;objet d&amp;apos;une enquête de la SEC</title>

the whole rss: https://www.dailyfx.com/francais/feeds/actualites-marches-financiers

opera browser shows such news titles correctly as follows

S&P 500 : Wall Street amorce un rebond, Binance fait l'objet d'une enquête de la SEC

How can I correctly unescape news for the case normally I receive once-escaped news, and for the case above?


Solution

  • The sequence &amp; encodes a & sign. But if the content ought to be HTML for example, that may contain further HTML escape sequences.

    For example if the text to display contains &, in HTML it would be encoded as &amp;. If you insert this text into an XML, the first character & also has to be escaped which results in &amp;amp;.

    To get the human-readable decoded text, you have to parse the XML and decode as HTML. You may use html.UnescapeString().

    For example:

    const src = `<title>S&amp;amp;P 500 : Wall Street amorce un rebond, Binance fait l&amp;apos;objet d&amp;apos;une enquête de la SEC</title>`
    
    var s string
    if err := xml.Unmarshal([]byte(src), &s); err != nil {
        panic(err)
    }
    fmt.Println(s)
    
    s = html.UnescapeString(s)
    fmt.Println(s)
    

    This will output (try it on the Go Playground):

    S&amp;P 500 : Wall Street amorce un rebond, Binance fait l&apos;objet d&apos;une enquête de la SEC
    S&P 500 : Wall Street amorce un rebond, Binance fait l'objet d'une enquête de la SEC