Search code examples
pythonexcelloopspandasfasta

Add list to Excel column


I have two files:

  • 1 file is a fasta file containing sequence names and their sequences.
  • the other file is an excel file, it has one 'Hcolumn' of the sequence names and an empty column next it with the header 'Hsequence' and one 'Kcolumn' of sequence names and an empty column next to it with the header 'Ksequence'.

I want to loop through both files, if a sequence name in the fasta file is also in either the 'Hcolumn' or 'Kcolumn', then add the sequence to the corresponding sequence column in the excel file, which is either the 'Hsequence' or the 'Ksequence' column. My idea here is to get the location of the vh value, and add its seq value to the cell on its right (the next row).

Update with examples

fasta file (ignoring the > sign):

headerA
AAAGGCCT
headerB
ATCCTTTG
headerC
GGGGTCCCAAT

excel sheet

Hcolumn   Hsequence     Kcolumn    Ksequence 
headerA                 headerB
headerC                 headerE
headerD                 headerF

desired outcome:

Hcolumn   Hsequence     Kcolumn    Ksequence 
headerA   AAAGGCCT      headerB    ATCCTTTG
headerC   GGGGTCCCAAT   headerE
headerD                 headerF

my new code: (my code below only deals with one column, i don't yet know how to say if it's in columnA OR in columnB)

for line in f:         
    seq = f.readline()
    for vh in data["Hcolumn"]:
        vh = vh.rstrip()
        if (title in vh): 
            #print(">"+title, seq)
            data["HSequence"]=data["Hcolumn"].apply(lambda vh: seq)

        else:
            data["HSequence"]=data["Hcolumn"].apply(lambda vh: '')
            break               
return data.to_csv('Fullseqfile.txt', sep='\t')

the sequence column returned null, can someone help thanks


Solution

  • In [1]: import pandas as pd
    
    In [2]: !cat data/question1/file.fasta
    headerA
    AAAGGCCT
    headerB
    ATCCTTTG
    headerC
    GGGGTCCCAAT
    
    In [3]: xls=pd.read_excel('file.xls')
    
    In [4]: xls
    Out[4]: 
       Hcolumn  Hsequence  Kcolumn  Ksequence
    0  headerA        NaN  headerB        NaN
    1  headerC        NaN  headerE        NaN
    2  headerD        NaN  headerF        NaN
    
    In [5]: fh = open('file.fasta')
       ...: fasta_dic={}
       ...: for line in fh:
       ...:     if line.startswith('h'):
       ...:         seq_header = line.strip('\n')
       ...:         fasta_dic[seq_header] = ''
       ...:     else:
       ...:         fasta_dic[seq_header] = line.strip('\n')
       ...:         
    
    In [6]: def fill_seq(x):
       ...:     if x in fasta_dic.keys():
       ...:         return fasta_dic[x]
       ...:     else:
       ...:         return ''
       ...:     
    
    In [7]: xls['Hsequence'] = xls['Hcolumn'].apply(fill_seq)
       ...: xls['Ksequence'] = xls['Kcolumn'].apply(fill_seq)
       ...: 
    
    In [8]: xls
    Out[8]: 
       Hcolumn    Hsequence  Kcolumn Ksequence
    0  headerA     AAAGGCCT  headerB  ATCCTTTG
    1  headerC  GGGGTCCCAAT  headerE          
    2  headerD               headerF 
    
    1. Builds a dictionary fasta_dic with the sequences names as key and the sequences as value.

    2. The function fill_seqchecks if the input x it's in the dictionary you defined before, if it finds the value it will return the sequence.

    3. You apply the function fill_seq to the H/K sequence columns using the values from H/K sequence as input.

    After this you can keep working with your dataframe or export it to a xls file.