I am a newbie in Python language and I need your help please.
I have 2 different text files. Let's they are Text_A.txt and Text_B.txt.
Text_A.txt contains a list of names as following (they are tab delineated):
Sequence_1 Sequence_2 Sequence_3 Sequence_4 Sequence_5 Sequence_6 Sequence_7 Sequence_8
and Text_B.txt contains a list of names as following (sequence names are written in each line):
Sequence_1 Sequence_2 Sequence_3 Sequence_4 Sequence_5 Sequence_6 Sequence_7 Sequence_8 Sequence_9 Sequence_10 Sequence_11
What I would like to do is assign "1" next to the sequence names in Text_B.txt if the names are in Text_A.txt. And assign "0" next to the sequence names in Text_B.txt if the names are not in Text_A.txt.
so... the expected output using the example above is something like below (the names and corresponding values should be written in each line):
Sequence_1;1
Sequence_2;1
Sequence_3;1
Sequence_4;1
Sequence_5;1
Sequence_6;1
Sequence_7;1
Sequence_8;1
Sequence_9;0
Sequence_10;0
Sequence_11;0
I would like the output in .txt format.
How should I do this using Python?
Your help is really needed here as I have more than 3000 and 6000 names in Text_A.txt and Text_B.txt files respectively.
Thank you so much!
You may do the following
# read each file assuming that your sequence of strings
# is the first line respectively
with open('Text_A.txt', 'r') as f:
seqA = f.readline()
with open('Text_B.txt', 'r') as f:
seqB = f.readline()
# remove end-of-line character
seqA = seqA.strip('\n')
seqB = seqB.strip('\n')
# so far, seqA and seqB are strings. split them now on tabs
seqA = seqA.split('\t')
seqB = seqB.split('\t')
# now, seqA and seqB are list of strings
# since you want to use seqA as a lookup, you should make a set out of seqA
seqA = set( seqA )
# now iterate over each item in seqB and check if it is present in seqA
# store result in a list
out = []
for item in seqB:
is_present = 1 if item in seqA else 0
out.append('{item}:{is_presnet}\n'.format(item=item,is_present=is_present))
# write result to file
with open('output.txt','w') as f:
f.write( '\t'.join( out ) )
If your sequences contain several millions entries you should think about a more advanced approach.