Search code examples
grepduplicate-dataline-breaksno-match

Using grep with a pattern file: print single and duplicate entries


Let me start off by saying I don't want to print only the duplicate lines nor do I want to remove them.

I am trying to use grep with a pattern file to parse a large data file.

The Pattern file for example may look like this:

1243
1234
1234
1234
1354
1356
1356
1677

etc. with more single and duplicate entries.

The Input data file might look like this:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
ttttt   1555    bbbbbb
ppppp   1354    pppppp
yyyyy   3333    zzzzzz
qqqqq   1677    eeeeee
iiiii   4444    iiiiii

etc. for 27000 lines.

when i use

grep -f 'Patternfile.txt' 'Inputfile.txt' > 'Outputfile.txt'

I get an output file that resembles this:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
ppppp   1354    pppppp

how would can i get it to also report the duplicates so i end up with something like this?:

aatta   1243    qqqqqq
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
yyyyy   1234    vvvvvv
ppppp   1354    pppppp


qqqqq   1677    zzzzzz

Additionally I would also like to print a blank line should a query in the pattern file not match a substring in the input file.

Thank you!


Solution

  • One solution, not with grep, but with perl:

    With patternfile.txt and inputfile.txt with data of your original post. Next content of script.pl should do the job (I assume that the string to match is the second column, otherwise it should be modified to use a regexp instead. This way is faster):

    use warnings;
    use strict;
    
    ## Check arguments.
    die qq[Usage: perl $0 <pattern-file> <input-file>\n] unless @ARGV == 2;
    
    ## Open input files.
    open my $pattern_fh, qq[<], shift @ARGV or die qq[Cannot open pattern file\n];
    open my $input_fh, qq[<], shift @ARGV or die qq[Cannot open input file\n];
    
    ## Hash to save patterns.
    my (%pattern, %input);
    
    ## Read each pattern and save how many times appear in the file.
    while ( <$pattern_fh> ) { 
        chomp;
        if ( exists $pattern{ $_ } ) { 
            $pattern{ $_ }->[1]++;
        }   
        else {
            $pattern{ $_ } = [ $., 1 ];
        }   
    }
    
    ## Read file with data and save them in another hash.
    while ( <$input_fh> ) { 
        chomp;
        my @f = split;
        $input{ $f[1] } = $_; 
    }
    
    ## For each pattern, search it in the data file. If it appears, print line those
    ## many times saved previously, otherwise print a blank line.
    for my $p ( sort { $pattern{ $a }->[0] <=> $pattern{ $b }->[0] } keys %pattern ) { 
        if ( $input{ $p } ) { 
            printf qq[%s\n], $input{ $p } for ( 1 .. $pattern{ $p }->[1] );
        }   
        else {
             # Old behaviour.
             # printf qq[\n];
    
             # New requirement.
             printf qq[\n] for ( 1 .. $pattern{ $p }->[1] );
        }   
    }
    

    Run it like:

    perl script.pl patternfile.txt inputfile.txt
    

    And gives next output:

    aatta   1243    qqqqqq
    yyyyy   1234    vvvvvv
    yyyyy   1234    vvvvvv
    yyyyy   1234    vvvvvv
    ppppp   1354    pppppp
    
    
    qqqqq   1677    eeeeee