Search code examples
.netvb.netado.netgroup-bylinq-to-dataset

GroupBy in LINQ to DataSet


Background:

I'm importing data from a MySQL database into a SQL-Server database(for reports and later a SSAS-Cube). I want to normalize the data at the same time. I want to group repeating Ticket_IDs to one record in a table Contact with other useful informations and leave the rawdata in the sub-table ContactDetail(with foreignkey to Contact). Hence every record in Contact has a unique Ticket_ID.

I've decided to use strong typed datasets for the import. Now i'm wondering what's the best way to detect if i've already added a Ticket_ID. I could check for it in every loop(~100000 records) but i'm assuming that there is a better/faster way.

Simplified sample-data:

Ticket_ID    ID     fiContact
89442226     1      1
89442226     2      1
89442226     3      1
89442261     4      2
89442261     5      2
89442354     6      3
89442359     7      4
89442359     8      4
89442367     9      5
89442504     10     6

This should be the Contact-table

Ticket_ID    idContact
89442226     1
89442261     2
89442354     3
89442359     4
89442367     5
89442504     6

Question:

Is it possible with LINQ/LINQ-to-DataSet to group by Ticket_ID and get a list of ContactDetailRows for every ContactRow? I know there is a GroupBy-Extension, but i'm unsure how to use and if it does what i need(keep the ContactDetail-Rows, f.e. like a dicitonary with Ticket_ID as key and a List(of EmailRow) as value).

This is what i have(simplified):

    For Each srcEmail In src.email 'i want to group src.email by Ticket_ID'
        'so far i check for existence in every loop'
        Dim res = From c In dest.Contact
               Where c.Ticket_ID = srcEmail.ticket_id
        If Not res.Any Then
            'create new Contact
            Dim newContact = Me.dest.Contact.NewContactRow
            newContact.Ticket_ID = srcEmail.ticket_id
            ' ..... '
            dest.Contact.AddContactRow(newContact)
        End If
        'TODO: create ContactDetail row and add it to the DataTable '
    Next
  • src: typed DataSet(MySQL)
  • src.email: typed DataTable => into ContactDetail
  • dest: typed DataSet(SQL-Server)
  • dest.Contact typed DataTable
  • dest.ContactDetail typed DataTable with fk to Contact

I would prefer VB.NET because i'm yet not as familiar with LINQ and the Syntax is quite different in C#.

Edit:

Thanks to @Magnus i've get it going in the following way:

Dim emailsPerTicketID = src.email.ToLookup(Function(email) email.ticket_id)
For Each ticket In emailsPerTicketID
    'create new Contact
    Dim newContact = Me.dest.Contact.NewContactRow
    newContact.Ticket_ID = ticket.Key
    newContact.CreatedAt = ticket.First().modified_time
    ' ...... '
    dest.Contact.AddContactRow(newContact)
    'TODO: add now all EmailRows'
    For Each emailRow In ticket
       Dim newContactDetail = dest.ContactDetail.NewContactDetailRow
       newContactDetail.ContactRow = newContact
       newContactDetail.Interaction = emailRow.interaction
       ' .... '
       dest.ContactDetail.AddContactDetailRow(newContactDetail)
    Next
Next

I will have a look if this is faster than the iterating approach with a HashSet to detect if the contact was already created.


Solution

  • I think using Lookup(like a dictionary but with key/Collection instead) would be a good solution for you. something like this:

    var lookup = ds.Tables["src"].AsEnumerable().ToLookup(x => x.Field<int>("Ticket_ID"));
    foreach (var row in ds.Tables["dest"].AsEnumerable())
    {
        if(!lookup.Contains(row.Field<int>("Ticket_ID ")))
        {
            //create new Contact
        }
        else
        {
            //do other struff
        }
    }
    

    If you need any help translating any of the syntax to VB comment me.