Search code examples
htmlpowershellurltext-parsing

Extract url from text file next to certain string


I have a large text file that contains something like:

View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be)

Sometimes, part of the URL goes onto the next line.

I simply need to extract that URL using PowerShell, without the brackets (parentheses), so that I can download it as a HTML file.

I've tried doing this in batch which I'm most familiar with, but it's proving impossible and seems this would be possible in PowerShell.


Solution

  • The following uses regex-based operators and .NET APIs.

    In both solutions, -replace '\r?\n' is used to remove any embedded newlines (line breaks) from the URL(s) found, using the -replace operator (\r?\n is a regex that matches both Windows-format CRLF and Unix-format LF-only newlines).

    # Sample multi-line input string.
    # To read such a string from a file, use, e.g.:
    #     $str = Get-Content -Raw file.txt
    $str = @'
      Initial text.
    
      View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2b
    b1612577510b&id=3D2c8be)
    
      More text.
    '@
    
    # Find the (first) embedded URL...
    if ($str -match '(?<=\()https?://[^)]+') {
      # ... remove any line breaks from it, and output the result.
      $Matches.0 -replace '\r?\n'
    }
    
    # Extract *all* URLs and remove any embedded line breaks from each
    [regex]::Matches(
      $str, 
      '(?<=\()https?://[^)]+'
    ).Value -replace '\r?\n'
    

    For an explanation of the first regex and the ability to experiment with it, see this regex101.com page.