The Task: We're scraping HTML for content via HttpWebRequest (some 6,000 calls). This string is trimmed and stored in a SQL Server 2014 database for processing as XML.
The Problem: In SQL Server, we'll get an XML parsing error: "...end tag does not match start tag
" due to the image tags.
Now, I have a rather inelegant and potentially flawed solution in SQL Server.
Sample String
<div someattr="aaa">
<div class="bbb">Some Text</div>
<img src="image.jpg" width="150"> <-- Notice the lack of />
</div>
Desired Results
<div someattr="aaa">
<div class="bbb">Some Text</div>
<img src="image.jpg" width="150"/> <-- Notice the />
</div>
I've tried countless Regex combinations in ASP.Net, and I seem to do more harm than good. Any guidance or direction would be appreciated.
Respectfully,
John
I'd suggest you to use an HTML parser und store the data in a better way than just a string. But if you're going for a quick and dirty solution with a regular expression, this might help you:
Look for this regex:
(<img[^>]*?[^\/]\s*)(>)
And replace it with:
$1/$2
[^>]*?
looks for any character except >
but as few as possible[^\/]\s*
makes sure, that the last character before >
is either not a slash /
or not a slash followed by white space$1
and $2
. It will only match, if there is not already a slash, and if it is an img
tag.>
character as a string in between the <img ...>
tag or if the tag is not closed at all <img title=""
.Here is a live example: https://regex101.com/r/HIxIIR/1