Search code examples
vb.netlinqsortingmergedistinct-values

Linq join on parameterized distinct key


I'm trying to LINQ two tables based on a dynamic key. User can change key via a combo box. Key may be money, string, double, int, etc. Currently I'm getting the data just fine, but without filtering out the doubles. I can filter the double in VB, but it's slooooow. I'd like to do it in the LINQ query right out of the gate.

Here's the data:

First Table:

 -------------------------------------------------------------
| AppleIndex  | AppleCost  | AppleColor  | AppleDescription   |
 ------------------------------------------------------------
|     1       |     3      | Red         | This is an apple   |
|     2       |     5      | Green       | This is an apple   |
|     3       |     4      | Pink        | This is an apple   |
|     4       |     2      | Yellow      | This is an apple   |
|     5       |     2      | Orange      | This is an apple   |
|     1       |     3      | Red         | This is a duplicate|
|     2       |     5      | Green       | This is a duplicate|
|     3       |     4      | Pink        | This is a duplicate|
|     4       |     2      | Yellow      | This is a duplicate|
|     5       |     2      | Orange      | This is a duplicate|
 -------------------------------------------------------------

Second Table:

 ------------------------------------------------------------
| OrangeIndex | OrangeCost | OrangeColor | OrangeDescription |
 ------------------------------------------------------------
|     1       |     1      | Orange      | This is an Orange |
|     2       |     3      | Orange      |                   |
|     3       |     2      | Orange      | This is an Orange |
|     4       |     3      | Orange      |                   |
|     5       |     2      | Orange      | This is an Orange |
 ------------------------------------------------------------

Currently, I'm using the following code to get too much data:

Dim Matches = From mRows In LinqMasterTable Join sRows In LinqSecondTable _
              On mRows(ThePrimaryKey) Equals sRows(TheForignKey) _
              Order By mRows(ThePrimaryKey) _
              Select mRows, sRows Distinct

Outcome:

 -------------------------------------------------------------------------
| 1  | 3 | Red    | This is an apple     | 1 | Orange | This is an Orange |
| 1  | 3 | Red    | This is an duplicate | 1 | Orange | This is an Orange |
| 2  | 5 | Green  | This is an apple     | 3 | Orange |                   |
| 2  | 5 | Green  | This is an duplicate | 3 | Orange |                   |
| 3  | 4 | Pink   | This is an apple     | 2 | Orange | This is an Orange |
| 3  | 4 | Pink   | This is an duplicate | 2 | Orange | This is an Orange |
| 4  | 2 | Yellow | This is an apple     | 3 | Orange |                   |
| 4  | 2 | Yellow | This is an duplicate | 3 | Orange |                   |
| 5  | 2 | Orange | This is an apple     | 2 | Orange | This is an Orange |
| 5  | 2 | Orange | This is an duplicate | 2 | Orange | This is an Orange |
 -------------------------------------------------------------------------

Desired Outcome:

 ------------------------------------------------------------------------
| 1 | 3 | Red    | This is an apple | 1 | 1 | Orange | This is an Orange |
| 2 | 5 | Green  | This is an apple | 2 | 3 | Orange |                   |
| 3 | 4 | Pink   | This is an apple | 3 | 2 | Orange | This is an Orange |
| 4 | 2 | Yellow | This is an apple | 4 | 3 | Orange |                   |
| 5 | 2 | Orange | This is an apple | 5 | 2 | Orange | This is an Orange |
 ------------------------------------------------------------------------

I have tried the following:

'Get the original Column Names into an Array List
'MasterTableColumns = GetColumns(qMasterDS, TheMasterTable) '(external code)

'Plug the Existing DataSet into a DataView:
Dim View As DataView = New DataView(qMasterTable)

'Sort by the Primary Key:
View.Sort = ThePrimaryKey

'Build a new table listing only one column:
Dim newListTable As DataTable = _
View.ToTable("UniqueData", True, ThePrimaryKey)

This returns a unique list, but no associated data:

 -------------
| AppleIndex  |
 -------------
|     1       | 
|     2       | 
|     3       |
|     4       |
|     5       |
 -------------

So I tried this instead:

'Build a new table with ALL the columns:
Dim newFullTable As DataTable = _
View.ToTable("UniqueData", True, _
     MasterTableColumns(0), _
     MasterTableColumns(1), _
     MasterTableColumns(2), _
     MasterTableColumns(3))

Unfortunately, it yields the following... with duplicates:

 -------------------------------------------------------------
| AppleIndex  | AppleCost  | AppleColor  | AppleDescription   |
 ------------------------------------------------------------
|     1       |     3      | Red         | This is an apple   |
|     2       |     5      | Green       | This is an apple   |
|     3       |     4      | Pink        | This is an apple   |
|     4       |     2      | Yellow      | This is an apple   |
|     5       |     2      | Orange      | This is an apple   |
|     1       |     3      | Red         | This is a duplicate|
|     2       |     5      | Green       | This is a duplicate|
|     3       |     4      | Pink        | This is a duplicate|
|     4       |     2      | Yellow      | This is a duplicate|
|     5       |     2      | Orange      | This is a duplicate|
 -------------------------------------------------------------

Any ideas?

~~~~~~~~~~~~ Update: ~~~~~~~~~~~~

Jeff M suggested the following code. (Thanks Jeff) However, it gives me a error. Does anyone know the syntax for making this work in VB? I've monkeyed with it a bit and can't seem to get it right.

Dim matches = _
    From mRows In (From row In LinqMasterTable _
        Group row By row(ThePrimaryKey) Into g() _
        Select g.First()) _
    Join sRows In LinqSecondTable _
    On mRows(ThePrimaryKey) Equals sRows(TheForignKey) _
    Order By mRows(ThePrimaryKey) _
    Select mRows, sRows

Error in Third row at "row(ThePrimaryKey)":

"Range variable name can be inferred only from a simple or qualified name with no arguments."


Solution

  • Declarations and Such:

    Private Sub LinqTwoTableInnerJoin(ByRef qMasterDS As DataSet, _
                                      ByRef qMasterTable As DataTable, _
                                      ByRef qSecondDS As DataSet, _
                                      ByRef qSecondTable As DataTable, _
                                      ByRef qPrimaryKey As String, _
                                      ByRef qForignKey As String, _
                                      ByVal qResultsName As String)
    
    Dim TheMasterTable As String = qMasterTable.TableName
    Dim TheSecondTable As String = qSecondTable.TableName
    Dim ThePrimaryKey As String = qPrimaryKey
    Dim TheForignKey As String = qForignKey
    Dim TheNewForignKey As String = ""
    
    MasterTableColumns = GetColumns(qMasterDS, TheMasterTable)
    SecondTableColumns = GetColumns(qSecondDS, TheSecondTable)
    
    Dim mColumnCount As Integer = MasterTableColumns.Count
    Dim sColumnCount As Integer = SecondTableColumns.Count
    
    Dim ColumnCount As Integer = mColumnCount + sColumnCount
    
    Dim LinqMasterTable = qMasterDS.Tables(TheMasterTable).AsEnumerable
    Dim LinqSecondTable = qSecondDS.Tables(TheSecondTable).AsEnumerable
    

    Get the Data and order it by the Selected Key:

    Dim Matches = From mRows In LinqMasterTable Join sRows In LinqSecondTable _
                 On mRows(ThePrimaryKey) Equals sRows(TheForignKey) _
                 Order By mRows(ThePrimaryKey) _
                 Select mRows, sRows
    

    Put the Results into a Dataset Table:

    ' Make sure the dataset is available and/or cleared:
    If dsResults.Tables(qResultsName) Is Nothing Then dsResults.Tables.Add(qResultsName)
    dsResults.Tables(qResultsName).Clear() : dsResults.Tables(qResultsName).Columns.Clear()
    
    'Adds Master Table Column Names
    For x = 0 To MasterTableColumns.Count - 1
        dsResults.Tables(qResultsName).Columns.Add(MasterTableColumns(x))
    Next
    
    'Rename Second Table Names if Needed:
    For x = 0 To SecondTableColumns.Count - 1
        With dsResults.Tables(qResultsName)
            For y = 0 To .Columns.Count - 1
                If SecondTableColumns(x) = .Columns(y).ColumnName Then
                    SecondTableColumns(x) = SecondTableColumns(x) & "_2"
                End If
            Next
        End With
    Next
    
    'Make sure that the Forign Key is a Unique Value
    If ForignKey1 = PrimaryKey Then
        TheNewForignKey = ForignKey1 & "_2"
    Else
        TheNewForignKey = ForignKey1
    End If
    
    'Adds Second Table Column Names
    For x = 0 To SecondTableColumns.Count - 1 
        dsResults.Tables(qResultsName).Columns.Add(SecondTableColumns(x))
    Next
    
    'Copy Results into the Dataset:
    For Each Match In Matches
    
        'Build an array for each row:
        Dim NewRow(ColumnCount - 1) As Object
    
        'Add the mRow Items:
        For x = 0 To MasterTableColumns.Count - 1
            NewRow(x) = Match.mRows.Item(x)
        Next
    
        'Add the srow Items:
        For x = 0 To SecondTableColumns.Count - 1
            Dim y As Integer = x + (MasterTableColumns.Count)
            NewRow(y) = Match.sRows.Item(x)
        Next
    
        'Add the array to dsResults as a Row:
        dsResults.Tables(qResultsName).Rows.Add(NewRow)
    
    Next
    

    Give the user an option to clean doubles or not:

    If chkUnique.Checked = True Then
        ReMoveDuplicates(dsResults.Tables(qResultsName), ThePrimaryKey)
    End If
    

    Remove the Duplicates if they so desire:

    Private Sub ReMoveDuplicates(ByRef SkipTable As DataTable, _
                             ByRef TableKey As String)
    
        'Make sure that there's data to work with:
        If SkipTable Is Nothing Then Exit Sub
        If TableKey Is Nothing Then Exit Sub
    
        'Create an ArrayList of rows to delete:
        Dim DeleteRows As New ArrayList()
    
        'Fill the Array with Row Number of the items equal 
        'to the item above them:
        For x = 1 To SkipTable.Rows.Count - 1
            Dim RowOne As DataRow = SkipTable.Rows(x - 1)
            Dim RowTwo As DataRow = SkipTable.Rows(x)
            If RowTwo.Item(TableKey) = RowOne.Item(TableKey) Then
                DeleteRows.Add(x)
            End If
        Next
    
        'If there are no hits, exit this sub:
        If DeleteRows.Count < 1 Or DeleteRows Is Nothing Then
            Exit Sub
        End If
    
        'Otherwise, remove the rows based on the row count value:
        For x = 0 To DeleteRows.Count - 1
    
            'Start at the END and count backwards so the duplicate 
            'item's row count value doesn't change with each deleted row
            Dim KillRow As Integer = DeleteRows((DeleteRows.Count - 1) - x)
    
            'Delete the row:
            SkipTable.Rows(KillRow).Delete()
    
        Next
    End Sub
    

    Then clean up any leftovers:

    If Not chkRetainKeys.Checked = True Then 'Removes Forign Key
        dsResults.Tables(qResultsName).Columns.Remove(TheNewForignKey)
    End If
    
    'Clear Arrays
    MasterTableColumns.Clear()
    SecondTableColumns.Clear()
    

    Final Analysis: Ran this against 2 Files with 4 columns, 65,535 rows, and with some doubles. Process time, roughly 1 second. In fact it took longer to load the fields into memory than it did to parse the data.