Search code examples
htmlpowershellurltext-parsing

Extract url from text file


I have a large text file that contains the text View this email in your browser then a URL. It can vary and sometimes part of the URL goes onto the next line.

Also, when it does go onto the next line there is an equals symbol at the end which needs to be removed but not any other equals symbols which may be there.

Few examples:

View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be)

View this email in your browser <https://mail.com/?e=3D14=
60&u=3Df612577510b&id=3D2c8be>

View this email in your browser (https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be)

I need to extract that URL using PowerShell, without the brackets (parentheses), which sometimes can be < > so that I can download it as a HTML file.

 if ($str -match '(?<=\()https?://[^)]+') {
 #  # ... remove any line breaks from it, and output the result.
  $Matches.0 -replace '\r?\n'
 }

 if ($str -match '(?<=\<)https?://[^>]+') {
 #  # ... remove any line breaks from it, and output the result.
  $Matches.0 -replace '\r?\n'
 }

Solution

    • Since you're trying to match across lines, you need to make sure that your text file is read as a whole, i.e. as a single, multiline string, which you can do with the -Raw switch of the Get-Content cmdlet.

    • Apart from that, the only thing missing from your regex was to also match and remove a preceding = before newlines.

    The following extracts all URLs from input file file.txt, and outputs them - with the newline and line-ending = removed - as an array of strings:

    # Note the '=' before '\r?\n'
    [regex]::Matches(
      (Get-Content -Raw file.txt),
      '(?<=[<(])https://[^>)]+'
    ).Value -replace '=\r?\n'
    
    • Direct use of the [regex]::Matches() .NET API allows you to extract all matches at once, whereas PowerShell's -match operator only ever looks for one match.

    • -replace is then used to remove newlines (\r?\n) from the matches, along with a preceding =.

    For an explanation of the URL-matching regex and the ability to experiment with it, see this regex101.com page.


    Example with a multiline string literal:

    [regex]::Matches('
    View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be)
    
    View this email in your browser <https://mail.com/?e=3D14=
    60&u=3Df612577510b&id=3D2c8be>
    
    View this email in your browser (https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be)
      ',
      '(?<=[<(])https://[^>)]+'
    ).Value -replace '=\r?\n'
    

    Output:

    https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be
    https://mail.com/?e=3D1460&u=3Df612577510b&id=3D2c8be
    https://eg.com/?e=3D1460&u=3Df6510b&id=3D2c8be