I have a dataset that looks like this
pdbid ch spacegroup ph uniprotacc name
5TUE A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 HALOALKANE DEHALOGENASE
I want to count every entry, based on column 5 (uniprotacc). Output should be something like this
pdbid ch spacegroup ph uniprotacc newval name
5TUE A P 21 21 21 A0A059WYP6 1 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 2 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 3 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 4 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 5 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 1 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 2 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 3 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 1 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 2 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 1 HALOALKANE DEHALOGENASE
I do not know, I think maybe awk or even libreoffice-calc can do the job easily. But any help is appreciated.
File is a tsv
meaning tab separated file.
Here is an awk
script solution.
BEGIN {FS = OFS = "\t"}
NR==1 {$NF = "newval" OFS $NF}
NR>1 {$NF = ++seen[$(NF - 1)] OFS $NF}
1
awk -f script.awk input.tsv
pdbid ch spacegroup ph uniprotacc newval name
5TUE A P 21 21 21 A0A059WYP6 1 Tetracycline destructase Tet(50)
5TUE B P 21 21 21 A0A059WYP6 2 Tetracycline destructase Tet(50)
5TUF A P 21 21 21 A0A059WYP6 3 Tetracycline destructase Tet(50)
5TUF B P 21 21 21 A0A059WYP6 4 Tetracycline destructase Tet(50)
5TUI A P 21 21 21 A0A059WYP6 5 Tetracycline destructase Tet(50)
5TUI B P 21 21 21 A0A059WYP6 6 Tetracycline destructase Tet(50)
6J3M A F 41 3 2 A0A059ZFC5 1 Phosphopantetheine adenylyltransferase
6JNH A F 41 3 2 A0A059ZFC5 2 Phosphopantetheine adenylyltransferase
6JOG A F 41 3 2 5.6 A0A059ZFC5 3 Phosphopantetheine adenylyltransferase
4BRZ A P 1 21 1 7 A0A067XG63 1 HALOALKANE DEHALOGENASE
4BRZ B P 1 21 1 7 A0A067XG63 2 HALOALKANE DEHALOGENASE
4C6H A P 21 21 2 A0A067XG66 1 HALOALKANE DEHALOGENASE
BEGIN { # pre processing
FS = "\t"; # assign input field separator to "\t" tab
OFS = "\t"; # assign ouput field separator to "\t" tab
}
NR==1 { # processing first line
# $NF is the last field in input line
$NF = "newval" OFS $NF; # prefix last field with "newval" and tab
}
NR>1 { # processing non first line
# $(NF - 1) is the befroe last field in input line. Such as A0A059WYP6
# seen[$(NF - 1)] is an array couning the occurance of $(NF - 1)
# ++seen[$(NF - 1)] is an incremented array couning the occurance of $(NF - 1)
$NF = ++seen[$(NF - 1)] OFS $NF; # prefix last field with an incremented array couning the occurance of $(NF - 1) and tab
}
{print} # print every processed line