I have a file containing the data shown below. The first comma-delimited field may be repeated any number of times, and I want to print only the lines after the sixth repetition of any value of this field
For example, there are eight fields with 1111111
as the first field, and I want to print only the seventh and eighth of these records
Input file:
1111111,aaaaaaaa,14
1111111,bbbbbbbb,14
1111111,cccccccc,14
1111111,dddddddd,14
1111111,eeeeeeee,14
1111111,ffffffff,14
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,aaaaaaaa,14
2222222,bbbbbbbb,14
2222222,cccccccc,14
2222222,dddddddd,14
2222222,eeeeeeee,14
2222222,ffffffff,14
2222222,gggggggg,14
3333333,aaaaaaaa,14
3333333,bbbbbbbb,14
3333333,cccccccc,14
3333333,dddddddd,14
3333333,eeeeeeee,14
3333333,ffffffff,14
3333333,gggggggg,14
3333333,hhhhhhhh,14
Output:
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14
What I have tried is to transponse the 2nd and 3rd fields with respect to 1st, so that I can use nawk
on the field of $7
or $8
#!/usr/bin/ksh awk -F"," '{ a[$1]; b[$1]=b[$1]","$2 c[$1]=c[$1]","$3} END{ for(i in a){ print i","b[i]","c[i]} } ' file > output.txt
i.e. you may have "1111111" items distributed randomly throughout your input:
$ awk -F, '++a[$1] > 6' input.txt
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14
How does this work?
As you know, awk's -F
option sets the delimiter. If it's not a special character, there's no pressing need to quote it.
Awk scripts consist of a series of blocks of condition { action; }
. If the condition is missing, action is applied to every line. If the action is missing, it is implied to be print;
. So an awk script that consists of simply a condition will print every input line for which that condition evaluates to true.
In this case, our condition also has elements of an action. That it, it increments elements of an associative array where the keys are your first field. The increment happens regardless of whether the condition evaluates to true. Also, putting ++
ahead rather than following the variable causes the increment to happen before the evaluation rather than after it. (I'm talking about the difference between ++var
and var++
.) And if the resultant incremented array element is greater than 6, the condition evaluates to true, causing the line to print.
This is functionally equivalent to the perl
solutions in other answers, but because of the nature awk scripts is even tighter and (arguably) simpler. And of course, it's likely to be faster. (In my informal test just now, the awk script above executed more than twice as fast as an equivalent perl script from another answer, processing 250000 lines of input in 0.23s of user time vs 0.61 seconds in perl.)
i.e. all your "1111111" lines are together:
$ awk -F, '$1!=f{c=0;f=$1} ++c>6' input.txt
1111111,gggggggg,14
1111111,hhhhhhhh,14
2222222,gggggggg,14
3333333,gggggggg,14
3333333,hhhhhhhh,14
How does this work?
This has the advantage of not eating memory with an array, but is only appropriate if your goal is to match sequential sets of lines with common $1 rather than handle matching lines that may be randomly distributed throughout your input.