Search code examples

Extract URLs from webpages

I want to extract URLs from a webpage that contains multiple URLs in it and save the extracted to a txt file.

The URLs in the webpage starts '' but i wanted to remove '' from them and extract only the URLs. When i run the ps script below, it only saves ''. Any help to fix this please.

$threatFeedUrl = " versions Anti-Malware List/AntiMalwareHosts.txt"
    # Download the threat feed data
    $threatFeedData = Invoke-WebRequest -Uri $threatFeedUrl
    # Define a regular expression pattern to match URLs starting with ''
    $pattern = '127\.0\.0\.1(?:[^\s]*)'
    # Use the regular expression to find matches in the threat feed data
    $matches = [regex]::Matches($threatFeedData.Content, $pattern)
    # Create a list to store the matched URLs
    $urlList = @()
    # Populate the list with matched URLs
    foreach ($match in $matches) {
        $urlList += $match.Value
    # Specify the output file path
    $outputFilePath = "output.txt"
    # Save the URLs to the output file
    $urlList | Out-File -FilePath $outputFilePath
    Write-Host "URLs starting with '' extracted from threat feed have been saved to $outputFilePath."


  • Preface:

    • The target URL happens to be a (semi-structured) plain-text resource, so regex-based processing is appropriate.

    • In general, however, with HTML content, using a dedicated parser is preferable, given that regexes aren't capable of parsing HTML robustly.[1] See this answer for an example of extracting links from an HTML document.

    • You're mistakenly using a non-capturing group ((?:…)) rather than a capturing one ((…))

    • In the downloaded content, there is a space after

    • Therefore use the following regex instead (\S is the simpler equivalent of [^\s] + only matches only a non-empty run of non-whitespace characters):

      '127\.0\.0\.1 (\S+)'
    $matches = …
    • While it technically doesn't cause a problem here, $matches is the name of the automatic $Matches variable, and therefore shouldn't be used for custom purposes.
    • $match.Value is the whole text that your regex matched, whereas you only want the text of the capture group.

    • Use $match.Groups[1].Value instead.

    $urlList += 
    • Building an array iteratively, with += is inefficient, because a new array must be allocated behind the scenes in every iteration; simply use the foreach statement as an expression, and let PowerShell collect the results for you. See this answer for more information.
    Invoke-WebRequest -Uri $threatFeedUrl
    • Since you're only interested in the text content of the response, it is simpler to use Invoke-RestMethod rather than Invoke-WebRequest; the former returns the content directly (no need to access a .Content property).

    To put it all together:

    $threatFeedUrl = ' versions Anti-Malware List/AntiMalwareHosts.txt'
    # Download the threat feed data
    $threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl
    # Define a regular expression pattern to match URLs starting with ''
    $pattern = '127\.0\.0\.1 (\S+)'
    # Use the regular expression to find matches in the threat feed data
    $matchList = [regex]::Matches($threatFeedData, $pattern)
    # Create and populate the list with matched URLs
    $urlList = 
      foreach ($match in $matchList) {
    # Specify the output file path
    $outputFilePath = 'output.txt'
    # Save the URLs to the output file
    $urlList | Out-File -FilePath $outputFilePath
    Write-Host "URLs starting with '' extracted from threat feed have been saved to $outputFilePath."

    [1] See this blog post for background information.