Search code examples
awklibreoffice-calc

Count number of rows based in a pattern on a column


I have a dataset that looks like this

pdbid   ch  spacegroup  ph  uniprotacc  name
5TUE    A   P 21 21 21      A0A059WYP6  Tetracycline destructase Tet(50)
5TUE    B   P 21 21 21      A0A059WYP6  Tetracycline destructase Tet(50)
5TUF    A   P 21 21 21      A0A059WYP6  Tetracycline destructase Tet(50)
5TUF    B   P 21 21 21      A0A059WYP6  Tetracycline destructase Tet(50)
5TUI    A   P 21 21 21      A0A059WYP6  Tetracycline destructase Tet(50)
5TUI    B   P 21 21 21      A0A059WYP6  Tetracycline destructase Tet(50)
6J3M    A   F 41 3 2        A0A059ZFC5  Phosphopantetheine adenylyltransferase
6JNH    A   F 41 3 2        A0A059ZFC5  Phosphopantetheine adenylyltransferase
6JOG    A   F 41 3 2    5.6 A0A059ZFC5  Phosphopantetheine adenylyltransferase
4BRZ    A   P 1 21 1    7   A0A067XG63  HALOALKANE DEHALOGENASE
4BRZ    B   P 1 21 1    7   A0A067XG63  HALOALKANE DEHALOGENASE
4C6H    A   P 21 21 2       A0A067XG66  HALOALKANE DEHALOGENASE

I want to count every entry, based on column 5 (uniprotacc). Output should be something like this

pdbid   ch  spacegroup  ph  uniprotacc  newval  name
5TUE    A   P 21 21 21      A0A059WYP6  1   Tetracycline destructase Tet(50)
5TUE    B   P 21 21 21      A0A059WYP6  2   Tetracycline destructase Tet(50)
5TUF    A   P 21 21 21      A0A059WYP6  3   Tetracycline destructase Tet(50)
5TUF    B   P 21 21 21      A0A059WYP6  4   Tetracycline destructase Tet(50)
5TUI    A   P 21 21 21      A0A059WYP6  5   Tetracycline destructase Tet(50)
5TUI    B   P 21 21 21      A0A059WYP6  6   Tetracycline destructase Tet(50)
6J3M    A   F 41 3 2        A0A059ZFC5  1   Phosphopantetheine adenylyltransferase
6JNH    A   F 41 3 2        A0A059ZFC5  2   Phosphopantetheine adenylyltransferase
6JOG    A   F 41 3 2    5.6 A0A059ZFC5  3   Phosphopantetheine adenylyltransferase
4BRZ    A   P 1 21 1    7   A0A067XG63  1   HALOALKANE DEHALOGENASE
4BRZ    B   P 1 21 1    7   A0A067XG63  2   HALOALKANE DEHALOGENASE
4C6H    A   P 21 21 2       A0A067XG66  1   HALOALKANE DEHALOGENASE

I do not know, I think maybe awk or even libreoffice-calc can do the job easily. But any help is appreciated.

File is a tsv meaning tab separated file.


Solution

  • Here is an awk script solution.

    script.awk

    BEGIN {FS = OFS = "\t"}
    NR==1 {$NF = "newval" OFS $NF}
    NR>1 {$NF = ++seen[$(NF - 1)] OFS $NF}
    1
    

    running:

    awk -f script.awk input.tsv
    

    output:

    pdbid   ch      spacegroup              ph      uniprotacc      newval  name
    5TUE    A       P 21 21 21              A0A059WYP6      1       Tetracycline destructase Tet(50)
    5TUE    B       P 21 21 21              A0A059WYP6      2       Tetracycline destructase Tet(50)
    5TUF    A       P 21 21 21              A0A059WYP6      3       Tetracycline destructase Tet(50)
    5TUF    B       P 21 21 21              A0A059WYP6      4       Tetracycline destructase Tet(50)
    5TUI    A       P 21 21 21              A0A059WYP6      5       Tetracycline destructase Tet(50)
    5TUI    B       P 21 21 21              A0A059WYP6      6       Tetracycline destructase Tet(50)
    6J3M    A       F 41 3 2                A0A059ZFC5      1       Phosphopantetheine adenylyltransferase
    6JNH    A       F 41 3 2                A0A059ZFC5      2       Phosphopantetheine adenylyltransferase
    6JOG    A       F 41 3 2        5.6     A0A059ZFC5      3       Phosphopantetheine adenylyltransferase
    4BRZ    A       P 1 21 1        7       A0A067XG63      1       HALOALKANE DEHALOGENASE
    4BRZ    B       P 1 21 1        7       A0A067XG63      2       HALOALKANE DEHALOGENASE
    4C6H    A       P 21 21 2               A0A067XG66      1       HALOALKANE DEHALOGENASE
    

    script explanation:

    BEGIN { # pre processing 
        FS = "\t"; # assign input field separator to "\t" tab
        OFS = "\t"; # assign ouput field separator to "\t" tab
    }
    NR==1 { # processing first line
        # $NF is the last field in input line
        $NF = "newval" OFS $NF; # prefix last field with "newval" and tab
    }
    NR>1 { # processing non first line
        # $(NF - 1) is the befroe last field in input line. Such as A0A059WYP6
        # seen[$(NF - 1)] is an array couning the occurance of $(NF - 1)
        # ++seen[$(NF - 1)] is an incremented array couning the occurance of $(NF - 1)
        $NF = ++seen[$(NF - 1)] OFS $NF; # prefix last field with an incremented array couning the occurance of $(NF - 1) and tab
    }
    {print} # print every processed line