Search code examples
asp.netsql-serverregexxml-parsingsql-server-2014

How to convert all HTML img close tags to be XML-compliant? (<img> to <img/>)


The Task: We're scraping HTML for content via HttpWebRequest (some 6,000 calls). This string is trimmed and stored in a SQL Server 2014 database for processing as XML.

The Problem: In SQL Server, we'll get an XML parsing error: "...end tag does not match start tag" due to the image tags.

Now, I have a rather inelegant and potentially flawed solution in SQL Server.

Sample String

<div someattr="aaa">
    <div class="bbb">Some Text</div>
    <img src="image.jpg" width="150">      <-- Notice the lack of />
</div>

Desired Results

<div someattr="aaa">
    <div class="bbb">Some Text</div>
    <img src="image.jpg" width="150"/>      <-- Notice the />
</div>

I've tried countless Regex combinations in ASP.Net, and I seem to do more harm than good. Any guidance or direction would be appreciated.

Respectfully,

John


Solution

  • I'd suggest you to use an HTML parser und store the data in a better way than just a string. But if you're going for a quick and dirty solution with a regular expression, this might help you:

    Look for this regex:

    (<img[^>]*?[^\/]\s*)(>)
    

    And replace it with:

    $1/$2
    
    • [^>]*? looks for any character except > but as few as possible
    • [^\/]\s* makes sure, that the last character before > is either not a slash / or not a slash followed by white space
    • The first and second part are grouped in $1 and $2. It will only match, if there is not already a slash, and if it is an img tag.
    • It won't work, if there is a > character as a string in between the <img ...> tag or if the tag is not closed at all <img title="".

    Here is a live example: https://regex101.com/r/HIxIIR/1