Search code examples
htmlregexstringpowershell

Powershell remove HTML tags in string content


I have a large HTML data string separated into small chunks. I am trying to write a PowerShell script to remove all the HTML tags, but am finding it difficult to find the right regex pattern.

Example String:

<p>This is an example<br />of various <span style="color: #445444">html content</span>

I have tried using:

$string -replace '\<([^\)]+)\>',''

It works with simple examples but ones such as above it captures the whole string.

Any suggestions on whats the best way to achieve this?


Solution

  • For a pure regex, it should be as easy as <[^>]+>:

    $string -replace '<[^>]+>',''
    

    Regular expression visualization

    Debuggex Demo

    Note that this could fail with certain HTML comments or the contents of <pre> tags.

    Instead, you could use the HTML Agility Pack (alternative link), which is designed for use in .Net code, and I've used it successfully in PowerShell before:

    Add-Type -Path 'C:\packages\HtmlAgilityPack.1.4.6\lib\Net40-client\HtmlAgilityPack.dll'
    
    $doc = New-Object HtmlAgilityPack.HtmlDocument
    $doc.LoadHtml($string)
    $doc.DocumentNode.InnerText
    

    HTML Agility Pack works well with non-perfect HTML.