Search code examples
csvawk

How to remove duplicates based on column1 and conditions in column3?


I'm trying to remove duplicates from a large csv file, based con values in column 1, but considering this:

Column 3 could be empty or have multiple values separated by ::: If there are more than one repeated value in column1, keep the record that has maximun number of elements inside the column 3. Remove the - between numbers in column3 in case it exists.

My input is:

H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+15-00:::01354,1
b,4,1111:::+1500,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426,4
d,5,7761:::+993-733:::+53-426:::87425,4
d,5,7761:::53-426,4

The output I'm trying to get is:

H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+1500:::01354,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426:::87425,4

My current script only removes duplicates without the other considerations, since I'm don't how to mix both scripts and how to add the condition to keep the record that has more elements in column 3.

awk -F, '{ gsub(/-/,"", $3); print } ' input.csv > input_without_hyphen.csv
awk -F',' -v OFS=',' '!a[$1]++' input_without_hyphen.csv > output.csv

Thanks for any help.


Solution

  • Assumptions:

    • the comma only shows up as a delimiter (ie, we don't have to worry about a comma showing up in the actual data)
    • if 2 rows are exactly the same and have the same number of elements in the 3rd column then we keep the first one we process
    • input order is to be maintained

    One awk idea:

    awk '
    BEGIN  { FS = OFS = "," }
    
    FNR==1 { print; next }              # print header as is
    
    NF!=4  { next }                     # skip lines that do not have 4 comma-delimited fields
    
           { key = $1                   # define key for this row
             if (! (key in counts))     # if we have not seen this key before then ...
                order[++ordnum] = key   # associate the key with the next ordering number
    
             gsub(/-/,"",$3)            # remove all hyphens from 3rd column
    
             n = split($3,a,/:::/)      # split 3rd column on ":::" delimiter, store results in array a[]; "n" == number of elements in array
    
             if (n > counts[key]) {     # if the number of elements (in 3rd column) is more than the previous count (for a row with the same key) then ...
                counts[key] = n         # save the new count and ...
                rows[key]   = $0        # save the current row
             }
           }
    
    END    { for (i=1; i<=ordnum; i++)  # iterate through our ordering numbers and ...
                print rows[order[i]]    # print the associated row to stdout
           }
    ' input.csv
    

    This generates:

    H1,H2,H3,H4
    a,2,8005:::+2287:::3426,2
    b,4,1111:::+1500:::01354,1
    c,4,2208:::+6583,9
    d,5,7761:::+993733:::+53426:::87425,4
    

    NOTE: while the data came out in the same order as OP's expected output, ordering isn't always guaranteed; if the row order is important then we would need to add more code