Search code examples
csvsortingawksed

Replace duplicate values across columns of CSV file


I have a headerless CSV file that is sorted on the first column. When the 2nd and 3rd columns are identical, I want to "collapse" them into one - i.e. leave the last column empty. In other words, this:

0000001,11111,66666
0000002,12121,22222
0000003,33333,33333
0000004,74747,44444
0000005,12345,12345

...becomes this:

0000001,11111,66666
0000002,12121,22222
0000003,33333,
0000004,74747,44444
0000005,12345,

I've tried various permutations of grep and cut but can't get anything to work - the closest I've come is cut -c 8-19 file.csv, which just isolates the 2nd and 3rd columns. I have a feeling needing to do this across columns and needing to delete the value instead of the whole line makes this complicated enough to require awk or sed, and I don't know enough about either to know how to approach it.


Solution

  • Using sed

    $ sed -E 's/([^,]*,([^,]*),)\2/\1/' input_file
    0000001,11111,66666
    0000002,12121,22222
    0000003,33333,
    0000004,74747,44444
    0000005,12345,