Search code examples
perlprintingno-duplicates

How to print without duplicates with perl?


My assignment is a little more in depth than the title but in the title is my main question. Here is the assignment:

Write a perl script that will grep for all occurrences of the regular expression in all regular files in the file/directory list as well as all regular files under the directories in the file/directory list. If a file is not a TEXT file then the file should first be operated on by the unix command strings (no switches) and the resulting lines searched. If the -l switch is given only the file name of the files containing the regular expression should be printed, one per line. A file name should occur a maximum of one time in this case. If the -l switch is not given then all matching lines should be printed, each proceeded on the same line by the file name and a colon. An example invocation from the command line:

plgrep 'ba+d' file1 dir1 dir2 file2 file3 dir3

Here is my code:

#!/usr/bin/perl -w

use Getopt::Long;
my $fname = 0;
GetOptions ('l' => \$fname);

$pat = shift @ARGV;
while (<>) {
    if (/$pat/) {
        $fname ? print "$ARGV\n" : print "$ARGV:$_";
    }
}

So far that code does everything it's supposed to except for reading non-text files and printing out duplicates of file names when using the -l switch. Here is an example of my output after entering the following on the command line: plgrep 'ba+d' file1 file2

  • file1:My dog is bad.
  • file1:My dog is very baaaaaad.
  • file2:I am bad at the guitar.
  • file2:Even though I am bad at the guitar, it is still fun to play!

Which is PERFECT! But when I use the -l switch to print out only the file names this is what I get after entering the following on the command line: plgrep -l 'ba+d' file1 file2

  • file1
  • file1
  • file2
  • file2

How do I get rid of those duplicates so it only prints:

  • file1
  • file2

I have tried:

$pat = shift @ARGV;
while (<>) {
    if (/$pat/) {
        $seen{$ARGV}++;
        $fname ? print "$ARGV\n" unless ($seen{$ARGV} > 1); : print "$ARGV:$_";
    }
}

But when I try to run it without the -l switch I only get:

  • file1:My dog is bad.
  • file2:I am bad at the guitar.

I also tried:

$fname ? print "$ARGV\n" unless ($ARGV > 1) : print "$ARGV:$_";

But I keep getting syntax error at plgrep line 17, near ""$ARGV\n" unless"

If someone could help me out with my duplicates issue as well as the italicized part of the assignment I would truly appreciate it. I don't even know where to start on that italicized part.


Solution

  • If you're printing only file names, you can exit the loop (using the last command) after the first match, since you already know the file matches. By not scanning the rest of the file, this will also prevent the name from being printed repeatedly.

    Edited to add: In order to do it this way, you'll also need to switch from using <> to read the files to instead getting the names from @ARGV and opening them normally.

    If you want to continue using <>, you'll instead need to watch $ARGV to see when it changes (indicating that you've started reading a new file) and keep a flag to indicate whether the current file has found any matches yet or not. However, this approach would require you to read every file in its entirety, which will be less efficient than only reading enough of each file to know whether it contains at least one match or not (i.e., skipping to the next file after the first match), so I would recommend switching to open instead.