I have a large text file that contains something like:
View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2bb1612577510b&id=3D2c8be)
Sometimes, part of the URL goes onto the next line.
I simply need to extract that URL using PowerShell, without the brackets (parentheses), so that I can download it as a HTML file.
I've tried doing this in batch which I'm most familiar with, but it's proving impossible and seems this would be possible in PowerShell.
The following uses regex-based operators and .NET APIs.
In both solutions, -replace '\r?\n'
is used to remove any embedded newlines (line breaks) from the URL(s) found, using the -replace
operator (\r?\n
is a regex that matches both Windows-format CRLF and Unix-format LF-only newlines).
-match
operator, which - if it returns $true
- reports what was matched in the automatic $Matches
variable variable.# Sample multi-line input string.
# To read such a string from a file, use, e.g.:
# $str = Get-Content -Raw file.txt
$str = @'
Initial text.
View this email in your browser (https://us15.campaign-archive.com/?e=3D1460&u=3Df6e2b
b1612577510b&id=3D2c8be)
More text.
'@
# Find the (first) embedded URL...
if ($str -match '(?<=\()https?://[^)]+') {
# ... remove any line breaks from it, and output the result.
$Matches.0 -replace '\r?\n'
}
System.Text.RegularExpressions.Regex.Matches
.NET API is required:# Extract *all* URLs and remove any embedded line breaks from each
[regex]::Matches(
$str,
'(?<=\()https?://[^)]+'
).Value -replace '\r?\n'
For an explanation of the first regex and the ability to experiment with it, see this regex101.com page.