Search code examples
pythonnested-listsdna-sequence

Processing a sub-list of variable size within a larger list


I'm a biological engineering PhD student here trying to self-learn Python programming for use in automating a part of my research, but I've ran into a problem with processing sub-lists within a bigger list that I can't seem to solve.

Basically, the goal of what I'm trying to do is write a small script that will process a CSV file containing a list of plasmid sequences that I'm building using various DNA assembly methods, and then spit out the primer sequences that I need to order in order to build the plasmid.

Here's the scenario that I'm dealing with:

When I want to build a plasmid, I have to enter into my Excel spreadsheet the full sequence of that plasmid. I have to choose between two DNA assembly methods, called "Gibson" and "iPCR". Each "iPCR" assembly only requires one line in the list, so I know how to process those guys already, as I just have to put in one cell the full sequence of the plasmid I'm trying to build. "Gibson" assemblies, on the other hand, require that I have to split up the full DNA sequence into smaller chunks, so sometimes I need 2-5 lines within the Excel spreadsheet to fully describe one plasmid.

So I end up with a spreadsheet that sort of ends up looking like this:

Construct.....Strategy.....Name

1.....Gibson.....P(OmpC)-cI::P(cI)-LacZ controller
1.....Gibson.....P(OmpC)-cI::P(cI)-LacZ controller
1.....Gibson.....P(OmpC)-cI::P(cI)-LacZ controller
2.....iPCR.......P(cpcG2)-K1F controller with K1F pos. feedback
3.....Gibson.....P(cpcG2)-K1F controller with swapped promoter positions
3.....Gibson.....P(cpcG2)-K1F controller with swapped promoter positions
4.....iPCR.......P(cpcG2)-K1F controller with stronger K1F RBS library

I think the list at this length is representative enough.

So the problem I'm running into is, I'd like to be able to run through the list and process the Gibsons, but I can't seem to get the code to work the way I want. Here's the code I've written so far:

#import BioPython Tools
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

#import csv tools
import csv
import sys
import os

with open('constructs-to-make.csv', 'rU') as constructs:
    construct_list = csv.reader(constructs, delimiter=',')
    construct_list.next()
    construct_number = 1
    primer_list = []
    temp_list = []
    counter = 2

    for row in construct_list:
        print('Current row is row number ' + str(counter))
        print('Current construct number is ' + str(construct_number))
        print('Current assembly type is ' + row[1])
        if row[1] == "Gibson": #here, we process the Gibson assemblies first
            print('Current construct number is: #' + row[0] + ' on row ' + str(counter) + ', which is a Gibson assembly')
##            print(int(row[0]))
##            print(row[3])
            if int(row[0]) == construct_number:
                print('Adding DNA sequence from row ' + str(counter) + ' for construct number ' + row[0])
                temp_list.append(str(row[3]))
                counter += 1
            if int(row[0]) > construct_number:
                print('Current construct number is ' + str(row[0]) + ', which is greater than the current construct number, ' + str(construct_number))
                print('Therefore, going to work on construct number ' + str(construct_number))
                for part in temp_list: #process the primer design work here
                    print('test')
##                    print(part)
                construct_number += 1
                temp_list = []
                print('Adding DNA from row #' + str(counter) + ' from construct number ' + str(construct_number))
                temp_list.append(row)
                print('Next construct number is number ' + str(construct_number))
                counter += 1
##            counter += 1
        if str(row[1]) == "iPCR":
            print('Current construct number is: ' + row[0] + ' on row ' + str(counter) + ', which is an iPCR assembly.')
            #process the primer design work here
            #get first 60 nucleotides from the sequence
            sequence = row[3]
            fw_primer = sequence[1:61]
            print('Sequence of forward primer:')
            print(fw_primer)
            last_sixty = sequence[-60:]
##            print(last_sixty)
            re_primer = Seq(last_sixty).reverse_complement()
            print('Sequence of reverse primer:')
            print(re_primer)
            #ending code: add 1 to counter and construct number
            counter += 1
            construct_number += 1
##            if int(row[0]) == construct_number:
##        else:
##            counter += 1
##            construct_number += 1
##    print(temp_list)

##        for row in temp_list:
##    print(temp_list)        
##    print(temp_list[-1])
#                fw_primer = temp_list[counter - 1].

(I know the code probably looks noob - I've never done any programming class beyond introductory Java.)

The problem with this code is that if I have n "constructs" (a.k.a. plasmids) that I'm trying to build by "Gibson" assembly, it will process the first n-1 plasmids, but not the last one. I also can't think of any better way to write this code, however, but I can see that for the workflow that I'm trying to implement, knowing how to process "n" things in a list, but with each "thing" of variable numbers of rows, would come in really handy for me.

I'd really appreciate anybody's help here! Thanks a lot!


Solution

  • Just some general coding help with python. If you haven't read PEP8 do so.

    To maintain clear code it can be helpful to assign variables to fields referenced in a record/row.

    I would add something like this for any field referenced:

    construct_idx = 0
    

    Also, I would recommend using string formatting, it's cleaner.

    So:

    print('Current construct number is: #{} on row {}, which is a Gibson assembly'.format(row[construct_idx], counter))
    

    Instead of:

    print('Current construct number is: #' + row[0] + ' on row ' + str(counter) + ', which is a Gibson assembly')
    

    If you're creating a csv reader object, making it's variable name "*_list" can be miss-leading. Calling it "*_reader" is more intuitive.

    construct_reader = csv.reader(constructs, delimiter=',')
    

    Instead of:

    construct_list = csv.reader(constructs, delimiter=',')