Search code examples
htmlpowershelluri

Extract URLs from webpages


I want to extract URLs from a webpage that contains multiple URLs in it and save the extracted to a txt file.

The URLs in the webpage starts '127.0.0.1' but i wanted to remove '127.0.0.1' from them and extract only the URLs. When i run the ps script below, it only saves '127.0.0.1'. Any help to fix this please.

$threatFeedUrl = "https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt"
    
    # Download the threat feed data
    $threatFeedData = Invoke-WebRequest -Uri $threatFeedUrl
    
    # Define a regular expression pattern to match URLs starting with '127.0.0.1'
    $pattern = '127\.0\.0\.1(?:[^\s]*)'
    
    # Use the regular expression to find matches in the threat feed data
    $matches = [regex]::Matches($threatFeedData.Content, $pattern)
    
    # Create a list to store the matched URLs
    $urlList = @()
    
    # Populate the list with matched URLs
    foreach ($match in $matches) {
        $urlList += $match.Value
    }
    
    # Specify the output file path
    $outputFilePath = "output.txt"
    
    # Save the URLs to the output file
    $urlList | Out-File -FilePath $outputFilePath
    
    Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."

Solution

  • Preface:

    • The target URL happens to be a (semi-structured) plain-text resource, so regex-based processing is appropriate.

    • In general, however, with HTML content, using a dedicated parser is preferable, given that regexes aren't capable of parsing HTML robustly.[1] See this answer for an example of extracting links from an HTML document.


    '127\.0\.0\.1(?:[^\s]*)'
    
    • You're mistakenly using a non-capturing group ((?:…)) rather than a capturing one ((…))

    • In the downloaded content, there is a space after 127.0.0.1

    • Therefore use the following regex instead (\S is the simpler equivalent of [^\s] + only matches only a non-empty run of non-whitespace characters):

      '127\.0\.0\.1 (\S+)'
      
    $matches = …
    
    • While it technically doesn't cause a problem here, $matches is the name of the automatic $Matches variable, and therefore shouldn't be used for custom purposes.
    $match.Value
    
    • $match.Value is the whole text that your regex matched, whereas you only want the text of the capture group.

    • Use $match.Groups[1].Value instead.

    $urlList += 
    
    • Building an array iteratively, with += is inefficient, because a new array must be allocated behind the scenes in every iteration; simply use the foreach statement as an expression, and let PowerShell collect the results for you. See this answer for more information.
    Invoke-WebRequest -Uri $threatFeedUrl
    
    • Since you're only interested in the text content of the response, it is simpler to use Invoke-RestMethod rather than Invoke-WebRequest; the former returns the content directly (no need to access a .Content property).

    To put it all together:

    $threatFeedUrl = 'https://raw.githubusercontent.com/DandelionSprout/adfilt/master/Alternate versions Anti-Malware List/AntiMalwareHosts.txt'
        
    # Download the threat feed data
    $threatFeedData = Invoke-RestMethod -Uri $threatFeedUrl
        
    # Define a regular expression pattern to match URLs starting with '127.0.0.1'
    $pattern = '127\.0\.0\.1 (\S+)'
        
    # Use the regular expression to find matches in the threat feed data
    $matchList = [regex]::Matches($threatFeedData, $pattern)
        
    # Create and populate the list with matched URLs
    $urlList = 
      foreach ($match in $matchList) {
        $match.Groups[1].Value
      }
        
    # Specify the output file path
    $outputFilePath = 'output.txt'
        
    # Save the URLs to the output file
    $urlList | Out-File -FilePath $outputFilePath
        
    Write-Host "URLs starting with '127.0.0.1' extracted from threat feed have been saved to $outputFilePath."
    

    [1] See this blog post for background information.