Search code examples
regexpowershellstring-matchingweb-content

find url in web page content using powershell


I need to search for https://cdn.windwardstudios.com/Archive/23.X/23.3.0/JavaRESTfulEngine-23.3.0.32.zip url from https://www.windwardstudios.com/version/version-downloads using powershell.

Thus i need https:\\<anything>\JavaRESTfulEngine<anything>.zip

To start off, i tried $regexPattern = 'https://cdn\.windwardstudios\.com/Archive/\d{2}\.X/\d+\.\d+\.\d+/JavaRESTfulEngine-.*?\.zip' which works and gives me the desired URL

To generalize further i tried $regexPattern = 'https://cdn\.windwardstudios\.com/Archive/([^/]+)/JavaRESTfulEngine-.*?\.zip' but now it does not work.

Below is my powershell script.

# URL of the website to scrape

$websiteUrl = https://www.windwardstudios.com/version/version-downloads

# Use Invoke-WebRequest to fetch the web page content

$response = Invoke-WebRequest -Uri $websiteUrl

# Check if the request was successful

if ($response.StatusCode -eq 200) {

    # Parse the HTML content to find the zip file URL using a regular expression

    $htmlContent = $response.Content

    $regexPattern = 'https://cdn\.windwardstudios\.com/Archive/([^/]+)/JavaRESTfulEngine-.*?\.zip'

    $zipFileUrls = [regex]::Matches($htmlContent, $regexPattern) | ForEach-Object { $_.Value }

    if ($zipFileUrls.Count -gt 0) {

        Write-Host "Found zip file URLs:"

        $zipFileUrls | ForEach-Object { Write-Host $_ }

    } else {

        Write-Host "Zip file URLs not found on the page."

    }

} else {

    Write-Host "Failed to fetch the web page. Status code: $($response.StatusCode)"

}

Output:

Zip file URLs not found on the page.

Desired output:

https://cdn.windwardstudios.com/Archive/23.X/23.3.0/JavaRESTfulEngine-23.3.0.32.zip

Can you please suggest?


Solution

  • You can use

    https://cdn\.windwardstudios\.com/Archive/(\S+?)/JavaRESTfulEngine-.*?\.zip
    

    See the regex demo.

    Details:

    • https://cdn\.windwardstudios\.com/Archive/ - a literal https://cdn.windwardstudios.com/Archive/ string
    • (\S+?) - Group 1: one or more non-whitespace chars as few as possible
    • /JavaRESTfulEngine- - a literal /JavaRESTfulEngine- string
    • .*? - any zero or more chars other than line break chars as few as possible
    • \.zip - a .zip string.