Search code examples
skbio

Opening filehandles for use with TabularMSA in skbio


Hey there skbio team.

So I need to allow either DNA or RNA MSAs. When I do the following, if I leave out the alignment_fh.close() skbio reads the 'non header' line in the except block making me think I need to close the file first so it will start at the beginning, but if I add alignment_fh.close() I cannot get it to read the file. I've tried opening it via a variety of methods, but I believe TabularMSA.read() should allow files OR file handles. Thoughts? Thank you!

try:
    aln = skbio.TabularMSA.read(alignment_fh, constructor=skbio.RNA)
except:
    alignment_fh.close()
    aln = skbio.TabularMSA.read(alignment_fh, constructor=skbio.DNA)

Solution

  • I've tried opening it via a variety of methods, but I believe TabularMSA.read() should allow files OR file handles.

    You're correct: scikit-bio generally supports reading and writing files using open file handles or file paths.

    The issue you're running into is that your first TabularMSA.read() call reads the entire contents of the open file handle, so that when the second TabularMSA.read() call is hit within the except block, the file pointer is already at the end of the open file handle -- this is why you're getting an error message hinting that the file is empty.

    This behavior is intentional; when scikit-bio is given an open file handle, it will read from or write to the file but won't attempt to manage the handle's file pointer (that type of management is up to the caller of the code).

    Now, when asking scikit-bio to read a file path (i.e. a string containing the path to a file on disk or accessible at some URI), scikit-bio will handle opening and closing the file handle for you, so that's often the easier way to go.

    You can use file paths or file handles to accomplish your goal. In the following examples, suppose aln_filepath is a str pointing to your alignment file on disk (e.g. "/path/to/my/alignment.fasta").

    • With file paths: You can simply pass the file path to both TabularMSA.read() calls; no open() or close() calls are necessary on your part.

      try:
          aln = skbio.TabularMSA.read(aln_filepath, constructor=skbio.RNA)
      except ValueError:
          aln = skbio.TabularMSA.read(aln_filepath, constructor=skbio.DNA)
      
    • With file handles: You'll need to open a file handle and reset the file pointer within your except block before reading a second time.

      with open(aln_filepath, 'r') as aln_filehandle:
          try:
              aln = skbio.TabularMSA.read(aln_filehandle, constructor=skbio.RNA)
          except ValueError:
              aln_filehandle.seek(0)  # reset file pointer to beginning of file
              aln = skbio.TabularMSA.read(aln_filehandle, constructor=skbio.DNA)
      

    Note: In both examples, I've used except ValueError instead of a "catch-all" except statement. I recommend catching specific error types (e.g. ValueError) instead of any exception because the code could be failing in different ways than what you're expecting. For example, with a "catch-all" except statement, users won't be able to interrupt your program with Ctrl-C because KeyboardInterrupt will be caught and ignored.