Search code examples
pythonregexcommand-line-argumentsfasta

Python Conditionals Error with code


I had previously asked this question, trying to get started with this code: The command line parameters need to take 2 OR 3 parameters

-s: this is an optional parameter, or switch, indicating that the user wwants the spliced gene sequence(introns removed). The user does not have to provide this (meaning he wants the entire gene sequence), but it he does provide it then it must be the first parameter

input file (with the genes)

output file (where the program will create to store the fasta file

the file contains line like this:

NM_001003443 chr11 + 5925152 592608098 2 5925152,5925652, 5925404,5926898,

I then needed to make multiple conditionals to make sure that everything that was input was correct, or else the program would exit:

  • The user specifies an input file name that does not end with .genes
  • The user specifies an output name that does not end with either .fa or .fasta
  • The user provides less than two, or more than three, parameters
  • The user's first parameter starts with a dash, but is not '-s'
  • The input file violates any of the following:

    • The first line should start with a '#' symbol
    • Every line should have exactly ten columns (columns separated by one or more spaces)
    • Column 2 (counting from 0) should be either a + or - symbol
    • Column 8 should be a tab-separated list of integers
    • Column 9 should be a tab-separated list of integers, with exactly the same integers as column 8.

I have written code for this, yet there is an error somewhere in it. Yet, I am unable to locate the error as of late. Could someone help me look though my code and see if an error is present somewhere? I would really appreciate it!!

All the if statement are tabbed over in my actual code, but I had trouble importing it here...

import sys

p = '(NM_\d+)\s+(chr\d+)([(\+)|(-)])\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+,\d+,)s+(\d+,\d+,)'
e = '([(\+)|(-)])'
def getGenes(spliced, infile, outfile):
spliced = False
if '-s' in sys.argv:
    spliced = True
    sys.argv.remove('s')
    infile, outfile = sys.argv[1:]
if '.genes' not in infile:
    print('Incorrect input file type')
    sys.exit(1)
if '.fa' or '.fasta' not in outfile:
    print('Incorrect output file type')
    sys.exit(1)
if len(sys.argv[0]) < 2 or len(sys.argv[0]) > 3:
    print('Command line parameters missing')
    sys.exit(1)
if sys.argv[1] != '-s':
    print('Invalid parameter, if spliced, must be -s')
    sys.exit(1)
fp = open(infile, 'r')
wp = open(outfile, 'r')
FirstLine = fp.readline().strip()
if not FirstLine.startswith('#'):
    print ('First line does not start with #')
    sys.exit(1)
n = 1
for line in fp.readlines():
    n += 1
    cols = line.strip().split('')
    if len(cols) != 10:
        print('Lenth not equal to 10')
        sys.exit(1)
    if cols[2] != '+' or '-':
        print('Column 2 is not a + or - symbol')
        sys.exit(1)
    if cols[8] != '\t\d+':
        print('Column 8 is not a tab-separated list of integers')
        sys.exit(1)
    if cols[9] != '\t\d+' and len(cols[9]) != len(cols[8]):
        print('Column 9 in not a tab-separated list of integers with the exact same number of integers in column 8')
        sys.exit(1)

Solution

  • remove this block:

    if sys.argv[1] != '-s':
        print('Invalid parameter, if spliced, must be -s')
        sys.exit(1)
    

    sys.argv[1] will always be unequal to '-s', because if '-s' were present in argv, you removed it some lines earlier:

    if '-s' in sys.argv:
        spliced = True
        sys.argv.remove('s')
    

    and this line

    if len(sys.argv[0]) < 2 or len(sys.argv[0]) > 3:
    

    does not check something useful, and will trigger more often than not. It checks if the length of the name invoking the script is exactly 2 or 3 chars. That does not make sense. It looks like you wanted to check that both filenames, plus maybe the -s flag are passed, and nothing more.

    In this case, what you'd meant, is:

    if not 3 <= len(sys.argv) <= 4: # len(sys.argv) - 1 is the number of parameters for the script, as sys.argv[0] is the scriptname itself
    

    If you need more help, you have to be more precise about the observed misbehaviour.

    Edit:

    if cols[8] != '\t\d+':
    

    won't work the way you'd like it. it compares the value in cols[8] to the literal '\t\d+' string. You might want to learn about the re module. same problem in the next if line.