Search code examples
pythondesign-patternssequencematchinggenome

create list of organisms based on pattern matching of sequence to a genome


I have a dataframe with two columns, the first are names of organisms and the second is there sequence which is a string of letters. I am trying to create an algorithm to see if an organism's sequence is in a string of a larger genome also comprised of letters. If it is in the genome, I want to add the name of the organism to a list. So for example if flu is in the genome below I want flu to be added to a list.

dict_1={'organisms':['flu', 'cold', 'stomach bug'], 'seq_list':['HTIDIJEKODKDMRM', 
'AGGTTTEFGFGEERDDTER', 'EGHDGGEDCGRDSGRDCFD']}
df=pd.DataFrame(dict_1)

     organisms             seq_list
0          flu      HTIDIJEKODKDMRM
1         cold  AGGTTTEFGFGEERDDTER
2  stomach bug  EGHDGGEDCGRDSGRDCFD

genome='TLTPSRDMEDHTIDIJEKODKDMRM'

This first functions finds the index of the match if there is one where p is the organism and t is the genome. The second portion is the one I am having trouble with. I am trying to use a for loop to search each entry in the df, but if I get a match I am not sure how to reference the first column in the df to add the name to the empty list. Thank you for your help!

def naive(p, t):
occurences = []
for i in range(len(t) - len(p) + 1):
    match = True
    for j in range(len(p)):
        if t[i+j] != p[j]:
            match = False
            break
    if match:
        occurences.append(i)
return occurences


Organisms_that_matched = []
for x in df:
   matches=naive(genome, x)
   if len(matches) > 0:
      #add name of organism to Organisms_that_matched list

Solution

  • I'm not sure if you are learning about different ways to transverse and apply custom logic in a list, but you can use list comprehensions:

    import pandas as pd
    
    dict_1 = {
        'organisms': ['flu', 'cold', 'stomach bug'],
        'seq_list':  ['HTIDIJEKODKDMRM', 'AGGTTTEFGFGEERDDTER', 'EGHDGGEDCGRDSGRDCFD']}
    df = pd.DataFrame(dict_1)
    genome = 'TLTPSRDMEDHTIDIJEKODKDMRM'
    
    organisms_that_matched = [dict_1['organisms'][index] for index, x in enumerate(dict_1['seq_list']) if x in genome]
    
    print(organisms_that_matched)