Search code examples
sqldeduplicationpst

What's the best way to deduplicate with the info I have?


I need to find and remove duplicate files (.pst) and eventually get the unique emails. Currently, I am using Powershell to recursively go through folders to find only .pst files and then export specific metadata into a .csv file. It has been suggested to me to import the .csv into SQL to do comparisons (name, dates on the files, etc...). After that, I'm stuck.

What language or program would be best suited to get the files I need and delete the rest of them? I'm pretty much working in VB.Net (could attempt C#) and powershell.


Solution

  • I'll assume that you did import the .csv into an SQL database. Let's say the table name is psts.

    First, to find out how many records have the same email address,

    SELECT email, count(*)
    FROM psts
    GROUP BY email
    

    Next, you don't want to see the ones which have one value, so,

    SELECT email, count(*)
    FROM psts
    GROUP BY email
    HAVING count(*) >= 2
    

    to get a list of those records,

    SELECT out.email, count(*), out.<other fields you want to see>
    FROM psts out
      INNER JOIN psts pst_count ON out.email = pst_count.email
    GROUP BY out.email
    HAVING count(*) >= 2