Search code examples
bashawksedlarge-filesuniq

Count repeated columns of a line, print all lines and their count


I want:

$ cat file
ABCDEFG, XXX
ABCDEFG, YYY
ABCDEFG, ZZZ
AAAAAAA, XZY
BBBBBBB, XYZ
CCCCCCC, YXZ
DDDDDDD, YZX
CDEFGHI, ZYX
CDEFGHI, XZY

$ cat file | magic
3 ABCDEFG, XXX
3 ABCDEFG, YYY
3 ABCDEFG, ZZZ
1 AAAAAAA, XZY
1 BBBBBBB, XYZ
1 CCCCCCC, YXZ
1 DDDDDDD, YZX
2 CDEFGHI, ZYX
2 CDEFGHI, XZY

So, pre-sorted file goes in, identify repeats in the first column, count the number of lines of this repeat, print the repeat count plus all the repeated lines and their content, including whatever is in column 2, which can be anything and is not relevant to the unique count. Two problems:

1) get the effect of uniq -c, but without deleting the duplicates.

My really "hacky" sed -e solution after searching online was this:

cat file | cut -d',' -f1 | uniq -c | sed -E -e 's/([0-9][0-9]*) (.*)/echo $(yes \1 \2 | head -\1)/;e' | sed -E 's/ ([0-9])/;\1/g' | tr ';' '\n'

I was surprised to see things like head -\1 working, but well great. However, I feel like there should be a much simpler solution to the problem.

2) The above get's rid of the second column. I could just run my code first, and then paste it to my second column in the original file, but the file is massive and I want things to be as speed efficient as possible.

Any suggestions?


Solution

  • One in awk. Pretty tired so not fully tested. I hope it works, good night:

    $ awk -F, '
    $1!=p {
        for(i=1;i<c;i++)
            print c-1,a[i]
        c=1
    }
    {
        a[c++]=$0
        p=$1
    }
    END {
        for(i=1;i<c;i++)
            print c-1,a[i]
    }' file
    

    Output:

    3 ABCDEFG,XXX
    3 ABCDEFG,YYY
    3 ABCDEFG,ZZZ
    1 AAAAAAA,XZY
    1 BBBBBBB,XYZ
    1 CCCCCCC,YXZ
    1 DDDDDDD,YZX
    2 CDEFGHI,ZYX
    2 CDEFGHI,XZY