Search code examples
pythonfindassignnames

Finding names in a text file (Text A) using a list in another text file (Text B) and assign values next to the names in Text A (Python)


I am a newbie in Python language and I need your help please.

I have 2 different text files. Let's they are Text_A.txt and Text_B.txt.

Text_A.txt contains a list of names as following (they are tab delineated):

Sequence_1 Sequence_2 Sequence_3 Sequence_4 Sequence_5 Sequence_6 Sequence_7 Sequence_8

and Text_B.txt contains a list of names as following (sequence names are written in each line):

Sequence_1 Sequence_2 Sequence_3 Sequence_4 Sequence_5 Sequence_6 Sequence_7 Sequence_8 Sequence_9 Sequence_10 Sequence_11

What I would like to do is assign "1" next to the sequence names in Text_B.txt if the names are in Text_A.txt. And assign "0" next to the sequence names in Text_B.txt if the names are not in Text_A.txt.

so... the expected output using the example above is something like below (the names and corresponding values should be written in each line):

Sequence_1;1
Sequence_2;1 Sequence_3;1 Sequence_4;1 Sequence_5;1 Sequence_6;1 Sequence_7;1 Sequence_8;1 Sequence_9;0 Sequence_10;0 Sequence_11;0

I would like the output in .txt format.

How should I do this using Python?

Your help is really needed here as I have more than 3000 and 6000 names in Text_A.txt and Text_B.txt files respectively.

Thank you so much!


Solution

  • You may do the following

    # read each file assuming that your sequence of strings 
    # is the first line respectively
    with open('Text_A.txt', 'r') as f:
        seqA = f.readline()
    with open('Text_B.txt', 'r') as f:
        seqB = f.readline()
    
    # remove end-of-line character
    seqA = seqA.strip('\n')
    seqB = seqB.strip('\n')
    
    # so far, seqA and seqB are strings. split them now on tabs
    seqA = seqA.split('\t')
    seqB = seqB.split('\t')
    
    # now, seqA and seqB are list of strings
    # since you want to use seqA as a lookup, you should make a set out of seqA
    seqA = set( seqA )
    
    # now iterate over each item in seqB and check if it is present in seqA
    # store result in a list
    out = []
    for item in seqB:
        is_present = 1 if item in seqA else 0
        out.append('{item}:{is_presnet}\n'.format(item=item,is_present=is_present))
    
    # write result to file
    with open('output.txt','w') as f:
        f.write( '\t'.join( out ) )
    

    If your sequences contain several millions entries you should think about a more advanced approach.