Search code examples
perlstdinduplication

How can I read, analyze, and then "un-read" and reread the beginning of an input stream in Perl?


I'm reading and processing a stream of input from the ARGV filehandle in Perl (i.e. the while(<>) construct) a regular filehandle, which may be STDIN. However, I need to analyze a significant portion of the input in order to detect which of four different but extremely similar formats it is encoded in (different ASCII encodings of FASTQ quality scores; see here). Once I've decided which format the data is in, I need to go back and parse those lines a second time to actually read the data.

So I need to read the first 500 lines or so of the stream twice. Or, to look at it another way, I need to read the first 500 lines, and then "put them back" so I can read them again. Since I may be reading from STDIN, I can't just seek back to the beginning. And the files are huge, so I can't just read everything into memory (although reading those first 500 lines into memory is ok). What's the best way to do this?

Alternatively, can I duplicate the input stream somehow?

Edit: Wait a minute. I just realized that I can't process the input as one big stream anymore, because I have to detect each file's format independently. So I can't use ARGV. The rest of the question still stands, though.


Solution

  • As you said, if the filehandle might be STDIN, you can't use seek to rewind it. But it's still pretty simple. I wouldn't bother with a module:

    my @lines;
    
    while (<$file>) {
      push @lines, $_;
      last if @lines == 500;
    }
    
    ... # examine @lines to determine format
    
    while (defined( $_ = @lines ? shift @lines : <$file> )) {
      ... # process line
    }
    

    Remember that you need an explicit defined in this case, because the special case that adds an implicit defined to some while loops doesn't apply to this more complex expression.