Search code examples
bashsortingmultiple-columnsuniq

Is it possible to find data that have duplicate values in one column but not others using bash?


I have a file with multiple columns and rows. I would like to take the data and find rows where there are duplicates of the value in column 4 and then print those rows to a new file.

My data file looks like this:

 RR2.out    -1752.142111    -1099486.696073  0.000000
 SS2.out    -1752.142111    -1099486.696073  0.000000
 RR1.out    -1752.141887    -1099486.555511  0.140562
 SS1.out    -1752.141887    -1099486.555511  0.140562
 RR4.out    -1752.140564    -1099485.725315  0.970758
 SS4.out    -1752.140564    -1099485.725315  0.970758
 RR3.out    -1752.140319    -1099485.571575  1.124498
 SS3.out    -1752.140319    -1099485.571575  1.124498
 SS5.out    -1752.138532    -1099484.450215  2.245858
 RR6.out    -1752.138493    -1099484.425742  2.270331
 SS6.out    -1752.138493    -1099484.425742  2.270331
 file Gibbs kcal rel
 file Gibbs kcal rel

If I just use uniq -d I only get

file Gibbs kcal rel
file Gibbs kcal rel

because they are the only two lines that match completely. What I want to know is if there is a way to find all rows that have duplicate values in column 4, not always an entire match.

I then use awk and read to read in the file names in column 1, so ideally I wouldn't have to transfer the data to another file and then back in, as I have found that this can cause errors related to the reading of the file names.

In this example, I should get the following file as an output:

 RR2.out    -1752.142111    -1099486.696073  0.000000
 SS2.out    -1752.142111    -1099486.696073  0.000000
 RR1.out    -1752.141887    -1099486.555511  0.140562
 SS1.out    -1752.141887    -1099486.555511  0.140562
 RR4.out    -1752.140564    -1099485.725315  0.970758
 SS4.out    -1752.140564    -1099485.725315  0.970758
 RR3.out    -1752.140319    -1099485.571575  1.124498
 SS3.out    -1752.140319    -1099485.571575  1.124498
 RR6.out    -1752.138493    -1099484.425742  2.270331
 SS6.out    -1752.138493    -1099484.425742  2.270331
 file Gibbs kcal rel
 file Gibbs kcal rel

Solution

  • uniq has the -f/--skip-fields option to ignore the first n fields of each line.

    uniq -D -f3