Search code examples
powershellcsvmatchhref

Extract value from link in Powershell


I have a function in Powershell that gets the content of a file and breaks it up into fields to put into a CSV file. I'm wondering if there's a way to get a value from the link and add it to the columns that are sent to the CSV file while keeping the link column intact.

function Convert2CSV {
(Get-Content $input_path) -match "href" | % {
$data = ($_ -replace '(?:.*)href="(.*?)">Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)</a>(?:.*)' , '$1;$2;$3;$4').Split(";")
New-Object psobject -Property @{
    "Link" = $data[0]
    "Date" = $data[1]
    "Time" = $data[2]
    "Item" = $data[3]
    }
} #| Export-Csv $output_file -NoTypeInformation
}

The value I'm looking for is either

FeedDefault_.*?(&) or _Feed.*?(&)

Am I correct in thinking that I can add some sort of if statement to the "Link" = $data[0] part?

Sample Output as requested.

Value in Link   |   Link                                                                    |   Date        |   Time    |   Item            |
--------------------------------------------------------------------------------------------------------------------------------------------|
bluepebbles     |   http://www.domain.com/page.html?FeedDefault_bluepebbles&something       |   2013-05-19  |   13:30   | Blue Pebbles      |
--------------------------------------------------------------------------------------------------------------------------------------------|
redpebbles      |   http://www.domain.com/page.html?Feed_redpebbles&something               |   2013-05-19  |   13:31   | Red Pebbles       |
--------------------------------------------------------------------------------------------------------------------------------------------|

CSV Formatted

Value in Link,Link,Date,Time,Item
"bluepebbles","http://www.domain.com/page.html?FeedDefault_bluepebbles&something","2013-05-19","13:30","Blue Pebbles"
"redpebbles","http://www.domain.com/page.html?Feed_redpebbles&something","2013-05-19","13:31","Red Pebbles"

So entering in

$input_path = 'f:\mockup\area51\files\link.html'
$output_file = 'f:\mockup\area51\files\db_csv.csv'

$tstampCulture = [Globalization.cultureinfo]::GetCultureInfo("en-GB")

$ie = New-Object -COM "InternetExplorer.Application"
$ie.Visible = $false

$ie.Navigate("file:///$input_path")

$ie.document.getElementsByTagName("a") | % {
  $_.innerText -match 'Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)'
  $obj = New-Object psobject -Property @{
    "Link" = $_.href
    "Date" = $matches[1]
    "Time" = $matches[2]
    "Item" = $matches[3]
  }
  if ( $obj.Link -match '\?Feed(?:Default)?_(.*?)&' ) {
    $obj | Add-Member –Type "NoteProperty" –Name "LinkValue" –Value $matches[1]
  }
  $obj
} #| Export-Csv $output_file -NoTypeInformation

Returns an error:

You cannot call a method on a null-valued expression.
At line:12 char:38
+     $ie.document.getElementsByTagName <<<< ("a") | % {
+ CategoryInfo          : InvalidOperation: (getElementsByTagName:String) [], RuntimeException
+ FullyQualifiedErrorId : InvokeMethodOnNull

So I'm pretty sure that I probably messed something up. :)


Solution

  • First I'd suggest to use -match instead of -replace. The resulting $matches array already contains the submatches you're interested in, so there's no need to manually create this array.

    Get-Content $input_path | ? { $_.contains("href") } | % {
      $_ -match 'href="(.*?)">Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)</a>'
      $obj = New-Object psobject -Property @{
        "Link" = $matches[1]
        "Date" = $matches[2]
        "Time" = $matches[3]
        "Item" = $matches[4]
      }
      $obj
    } #| Export-Csv $output_file -NoTypeInformation
    

    The additional information can be extracted from $obj.Link with a second -match and then added to the custom object via Add-Member:

    if ( $obj.Link -match '\?Feed(?:Default)?_(.*?)&' ) {
      $obj | Add-Member –Type "NoteProperty" –Name "LinkValue" –Value $matches[1]
    }
    

    Also, since your input files are probably HTML files you should consider using the InternetExplorer COM object, which will give you far better control over the extracted tags than processing the files line-by-line.

    $ie = New-Object -COM "InternetExplorer.Application"
    $ie.Visible = $false
    
    $ie.Navigate("file:///$input_path")
    while ( $ie.Busy ) { Start-Sleep -Milliseconds 100 }
    
    $ie.document.getElementsByTagName("a") | % {
      $_.innerText -match 'Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)'
      $obj = New-Object psobject -Property @{
        "Link" = $_.href
        "Date" = $matches[1]
        "Time" = $matches[2]
        "Item" = $matches[3]
      }
      if ( $obj.Link -match '\?Feed(?:Default)?_(.*?)&' ) {
        $obj | Add-Member –Type "NoteProperty" –Name "LinkValue" –Value $matches[1]
      }
      $obj
    }