I have a function in Powershell that gets the content of a file and breaks it up into fields to put into a CSV file. I'm wondering if there's a way to get a value from the link and add it to the columns that are sent to the CSV file while keeping the link column intact.
function Convert2CSV {
(Get-Content $input_path) -match "href" | % {
$data = ($_ -replace '(?:.*)href="(.*?)">Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)</a>(?:.*)' , '$1;$2;$3;$4').Split(";")
New-Object psobject -Property @{
"Link" = $data[0]
"Date" = $data[1]
"Time" = $data[2]
"Item" = $data[3]
}
} #| Export-Csv $output_file -NoTypeInformation
}
The value I'm looking for is either
FeedDefault_.*?(&) or _Feed.*?(&)
Am I correct in thinking that I can add some sort of if statement to the "Link" = $data[0] part?
Sample Output as requested.
Value in Link | Link | Date | Time | Item |
--------------------------------------------------------------------------------------------------------------------------------------------|
bluepebbles | http://www.domain.com/page.html?FeedDefault_bluepebbles&something | 2013-05-19 | 13:30 | Blue Pebbles |
--------------------------------------------------------------------------------------------------------------------------------------------|
redpebbles | http://www.domain.com/page.html?Feed_redpebbles&something | 2013-05-19 | 13:31 | Red Pebbles |
--------------------------------------------------------------------------------------------------------------------------------------------|
CSV Formatted
Value in Link,Link,Date,Time,Item
"bluepebbles","http://www.domain.com/page.html?FeedDefault_bluepebbles&something","2013-05-19","13:30","Blue Pebbles"
"redpebbles","http://www.domain.com/page.html?Feed_redpebbles&something","2013-05-19","13:31","Red Pebbles"
So entering in
$input_path = 'f:\mockup\area51\files\link.html'
$output_file = 'f:\mockup\area51\files\db_csv.csv'
$tstampCulture = [Globalization.cultureinfo]::GetCultureInfo("en-GB")
$ie = New-Object -COM "InternetExplorer.Application"
$ie.Visible = $false
$ie.Navigate("file:///$input_path")
$ie.document.getElementsByTagName("a") | % {
$_.innerText -match 'Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)'
$obj = New-Object psobject -Property @{
"Link" = $_.href
"Date" = $matches[1]
"Time" = $matches[2]
"Item" = $matches[3]
}
if ( $obj.Link -match '\?Feed(?:Default)?_(.*?)&' ) {
$obj | Add-Member –Type "NoteProperty" –Name "LinkValue" –Value $matches[1]
}
$obj
} #| Export-Csv $output_file -NoTypeInformation
Returns an error:
You cannot call a method on a null-valued expression.
At line:12 char:38
+ $ie.document.getElementsByTagName <<<< ("a") | % {
+ CategoryInfo : InvalidOperation: (getElementsByTagName:String) [], RuntimeException
+ FullyQualifiedErrorId : InvokeMethodOnNull
So I'm pretty sure that I probably messed something up. :)
First I'd suggest to use -match
instead of -replace
. The resulting $matches
array already contains the submatches you're interested in, so there's no need to manually create this array.
Get-Content $input_path | ? { $_.contains("href") } | % {
$_ -match 'href="(.*?)">Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)</a>'
$obj = New-Object psobject -Property @{
"Link" = $matches[1]
"Date" = $matches[2]
"Time" = $matches[3]
"Item" = $matches[4]
}
$obj
} #| Export-Csv $output_file -NoTypeInformation
The additional information can be extracted from $obj.Link
with a second -match
and then added to the custom object via Add-Member
:
if ( $obj.Link -match '\?Feed(?:Default)?_(.*?)&' ) {
$obj | Add-Member –Type "NoteProperty" –Name "LinkValue" –Value $matches[1]
}
Also, since your input files are probably HTML files you should consider using the InternetExplorer
COM object, which will give you far better control over the extracted tags than processing the files line-by-line.
$ie = New-Object -COM "InternetExplorer.Application"
$ie.Visible = $false
$ie.Navigate("file:///$input_path")
while ( $ie.Busy ) { Start-Sleep -Milliseconds 100 }
$ie.document.getElementsByTagName("a") | % {
$_.innerText -match 'Date:\s*([\w\.]+)\s*([\w\:]+)\s*Item:\s*(.*)'
$obj = New-Object psobject -Property @{
"Link" = $_.href
"Date" = $matches[1]
"Time" = $matches[2]
"Item" = $matches[3]
}
if ( $obj.Link -match '\?Feed(?:Default)?_(.*?)&' ) {
$obj | Add-Member –Type "NoteProperty" –Name "LinkValue" –Value $matches[1]
}
$obj
}