I have a file with multiple columns and rows. I would like to take the data and find rows where there are duplicates of the value in column 4 and then print those rows to a new file.
My data file looks like this:
RR2.out -1752.142111 -1099486.696073 0.000000
SS2.out -1752.142111 -1099486.696073 0.000000
RR1.out -1752.141887 -1099486.555511 0.140562
SS1.out -1752.141887 -1099486.555511 0.140562
RR4.out -1752.140564 -1099485.725315 0.970758
SS4.out -1752.140564 -1099485.725315 0.970758
RR3.out -1752.140319 -1099485.571575 1.124498
SS3.out -1752.140319 -1099485.571575 1.124498
SS5.out -1752.138532 -1099484.450215 2.245858
RR6.out -1752.138493 -1099484.425742 2.270331
SS6.out -1752.138493 -1099484.425742 2.270331
file Gibbs kcal rel
file Gibbs kcal rel
If I just use uniq -d I only get
file Gibbs kcal rel
file Gibbs kcal rel
because they are the only two lines that match completely. What I want to know is if there is a way to find all rows that have duplicate values in column 4, not always an entire match.
I then use awk and read to read in the file names in column 1, so ideally I wouldn't have to transfer the data to another file and then back in, as I have found that this can cause errors related to the reading of the file names.
In this example, I should get the following file as an output:
RR2.out -1752.142111 -1099486.696073 0.000000
SS2.out -1752.142111 -1099486.696073 0.000000
RR1.out -1752.141887 -1099486.555511 0.140562
SS1.out -1752.141887 -1099486.555511 0.140562
RR4.out -1752.140564 -1099485.725315 0.970758
SS4.out -1752.140564 -1099485.725315 0.970758
RR3.out -1752.140319 -1099485.571575 1.124498
SS3.out -1752.140319 -1099485.571575 1.124498
RR6.out -1752.138493 -1099484.425742 2.270331
SS6.out -1752.138493 -1099484.425742 2.270331
file Gibbs kcal rel
file Gibbs kcal rel
uniq
has the -f
/--skip-fields
option to ignore the first n fields of each line.
uniq -D -f3