Search code examples
perlunixawknawk

How to print lines from a file that have repeated more than six times


I have a file containing the data shown below. The first comma-delimited field may be repeated any number of times, and I want to print only the lines after the sixth repetition of any value of this field

For example, there are eight fields with 1111111 as the first field, and I want to print only the seventh and eighth of these records

Input file:

1111111,aaaaaaaa,14
1111111,bbbbbbbb,14
1111111,cccccccc,14
1111111,dddddddd,14
1111111,eeeeeeee,14
1111111,ffffffff,14
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,aaaaaaaa,14
2222222,bbbbbbbb,14
2222222,cccccccc,14
2222222,dddddddd,14
2222222,eeeeeeee,14
2222222,ffffffff,14
2222222,gggggggg,14
3333333,aaaaaaaa,14
3333333,bbbbbbbb,14
3333333,cccccccc,14
3333333,dddddddd,14
3333333,eeeeeeee,14
3333333,ffffffff,14
3333333,gggggggg,14
3333333,hhhhhhhh,14

Output:

1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14

What I have tried is to transponse the 2nd and 3rd fields with respect to 1st, so that I can use nawk on the field of $7 or $8

#!/usr/bin/ksh awk -F"," '{ a[$1]; b[$1]=b[$1]","$2 c[$1]=c[$1]","$3} END{ for(i in a){ print i","b[i]","c[i]} } ' file > output.txt

Solution

  • If your records are unordered

    i.e. you may have "1111111" items distributed randomly throughout your input:

    $ awk -F, '++a[$1] > 6' input.txt
    1111111,gggggggg,14
    1111111,hhhhhhhh,14
    2222222,gggggggg,14
    3333333,gggggggg,14
    3333333,hhhhhhhh,14
    

    How does this work?

    As you know, awk's -F option sets the delimiter. If it's not a special character, there's no pressing need to quote it.

    Awk scripts consist of a series of blocks of condition { action; }. If the condition is missing, action is applied to every line. If the action is missing, it is implied to be print;. So an awk script that consists of simply a condition will print every input line for which that condition evaluates to true.

    In this case, our condition also has elements of an action. That it, it increments elements of an associative array where the keys are your first field. The increment happens regardless of whether the condition evaluates to true. Also, putting ++ ahead rather than following the variable causes the increment to happen before the evaluation rather than after it. (I'm talking about the difference between ++var and var++.) And if the resultant incremented array element is greater than 6, the condition evaluates to true, causing the line to print.

    This is functionally equivalent to the perl solutions in other answers, but because of the nature awk scripts is even tighter and (arguably) simpler. And of course, it's likely to be faster. (In my informal test just now, the awk script above executed more than twice as fast as an equivalent perl script from another answer, processing 250000 lines of input in 0.23s of user time vs 0.61 seconds in perl.)

    If your records are ordered

    i.e. all your "1111111" lines are together:

    $ awk -F, '$1!=f{c=0;f=$1} ++c>6' input.txt
    1111111,gggggggg,14
    1111111,hhhhhhhh,14
    2222222,gggggggg,14
    3333333,gggggggg,14
    3333333,hhhhhhhh,14
    

    How does this work?

    • If we're on a different $1 than last time (which is also true on the first line), we reset our counter and save $1 to a variable for future comparisons.
    • Then we increment the counter and print the line (implicitly) if the counter goes above 6.

    This has the advantage of not eating memory with an array, but is only appropriate if your goal is to match sequential sets of lines with common $1 rather than handle matching lines that may be randomly distributed throughout your input.