Search code examples
powershellcsvoledbpowershell-4.0

Converting CSV to custom object using a Microsoft.ACE.OLEDB.12.0 OleDbConnection


I am trying to read in a large CSV with millions of rows for testing. I know that I can treat the CSV as a database using the provider Microsoft.ACE.OLEDB.12.0

Using a small data set I am able to read the row contents positionally using .GetValue(int). I am having a tough time finding a better was to read the data (assuming there even is one.). If I know the column names before hand this is easy. However if I didn't know them I would have to read in the first line of the file to get that data which seems silly.

@"
id,first_name,last_name,email,ip_address
1,Edward,Richards,[email protected],201.133.112.30
2,Jimmy,Scott,[email protected],103.231.149.144
3,Marilyn,Williams,[email protected],52.180.157.43
4,Frank,Morales,[email protected],218.175.165.205
5,Chris,Watson,[email protected],75.251.1.149
6,Albert,Ross,[email protected],89.56.133.54
7,Diane,Daniels,[email protected],197.156.129.45
8,Nancy,Carter,[email protected],75.162.65.142
9,John,Kennedy,[email protected],85.35.177.235
10,Bonnie,Bradley,[email protected],255.67.106.193
"@ | Set-Content .\test.csv 

$conn = New-Object System.Data.OleDb.OleDbConnection("Provider=Microsoft.ACE.OLEDB.12.0;Data Source='C:\Users\Matt';Extended Properties='Text;HDR=Yes;FMT=Delimited';")
$cmd=$conn.CreateCommand()
$cmd.CommandText="Select * from test.csv where first_name like '%n%'"
$conn.open()
$data = $cmd.ExecuteReader()

$data | ForEach-Object{
    [pscustomobject]@{
        id=$_.GetValue(0) 
        first_name=$_.GetValue(1) 
        last_name=$_.GetValue(2)
        ip_address=$_.GetValue(4)
    }
}

$cmd.Dispose()
$conn.Dispose()

Is there a better way to deal with the output from $cmd.ExecuteReader()? Finding hard to get information for a CSV import. Most of the web deals with exporting to CSV using this provider from a SQL database. The logic here would be applied to a large CSV so that I don't need to read the whole thing in just to ignore most of the data.


Solution

  • I should have looked closer on TechNet for the OleDbDataReader Class. There are a few methods and properties that help understand the data returned from the SQL statement.

    • FieldCount: Gets the number of columns in the current row.

      So if nothing else you know how many columns your rows have.

    • Item[Int32]: Gets the value of the specified column in its native format given the column ordinal.

      Which I can use to pull back the data from each row. This appears to work the same as GetValue().

    • GetName(Int32): Gets the name of the specified column.

      So if you don't know what the column is named this is what you can use to get it from a given index.

    There are many other methods and some properties but those are enough to shed light if you are not sure what data is contained within a csv (assuming you don't want to manually verify before hand). So, knowing that, a more dynamic way to get the same information would be...

    $data | ForEach-Object{
    
        # Save the current row as its own object so that it can be used in other scopes
        $dataRow = $_
        # Blank hashtable that will be built into a "row" object
        $properties = @{}
    
        # For every field that exists we will add it name and value to the hashtable
        0..($dataRow.FieldCount - 1) | ForEach-Object{
            $properties.($dataRow.GetName($_)) = $dataRow.Item($_)
        }
    
        # Send the newly created object down the pipeline.
        [pscustomobject]$properties
    }
    
    $cmd.Dispose()
    $conn.Dispose()
    

    Only downside of this is that the columns will likely be output in not the same order as the originating CSV. That can be address by saving the row names in a separate variable and using a Select at the end of the pipe. This answer was mostly trying to make sense of the column names and values returned.