I'm trying to remove duplicates from a large csv file, based con values in column 1, but considering this:
Column 3 could be empty or have multiple values separated by :::
If there are more than one repeated value in column1, keep the record that has maximun number of elements inside the column 3.
Remove the -
between numbers in column3 in case it exists.
My input is:
H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+15-00:::01354,1
b,4,1111:::+1500,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426,4
d,5,7761:::+993-733:::+53-426:::87425,4
d,5,7761:::53-426,4
The output I'm trying to get is:
H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+1500:::01354,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426:::87425,4
My current script only removes duplicates without the other considerations, since I'm don't how to mix both scripts and how to add the condition to keep the record that has more elements in column 3.
awk -F, '{ gsub(/-/,"", $3); print } ' input.csv > input_without_hyphen.csv
awk -F',' -v OFS=',' '!a[$1]++' input_without_hyphen.csv > output.csv
Thanks for any help.
Assumptions:
One awk
idea:
awk '
BEGIN { FS = OFS = "," }
FNR==1 { print; next } # print header as is
NF!=4 { next } # skip lines that do not have 4 comma-delimited fields
{ key = $1 # define key for this row
if (! (key in counts)) # if we have not seen this key before then ...
order[++ordnum] = key # associate the key with the next ordering number
gsub(/-/,"",$3) # remove all hyphens from 3rd column
n = split($3,a,/:::/) # split 3rd column on ":::" delimiter, store results in array a[]; "n" == number of elements in array
if (n > counts[key]) { # if the number of elements (in 3rd column) is more than the previous count (for a row with the same key) then ...
counts[key] = n # save the new count and ...
rows[key] = $0 # save the current row
}
}
END { for (i=1; i<=ordnum; i++) # iterate through our ordering numbers and ...
print rows[order[i]] # print the associated row to stdout
}
' input.csv
This generates:
H1,H2,H3,H4
a,2,8005:::+2287:::3426,2
b,4,1111:::+1500:::01354,1
c,4,2208:::+6583,9
d,5,7761:::+993733:::+53426:::87425,4
NOTE: while the data came out in the same order as OP's expected output, ordering isn't always guaranteed; if the row order is important then we would need to add more code