I have three text file:
fileA:
13 abc
123 def
234 ghi
1234 jkl
12 mno
fileB:
12 abc
12 def
34 qwe
43 rty
45 mno
fileC:
12 abc
34 sdg
43 yui
54 poi
54 def
I would like to see what all the values in the 2nd column are matching between the files. The following code works if the 2nd column is already sorted. but if the 2nd column is not sorted, how do i sort the 2nd column and compare the files ?
fileA = open("A.txt",'r')
fileB = open("B.txt",'r')
fileC = open("C.txt",'r')
listA1 = []
for line1 in fileA:
listA = line1.split('\t')
listA1.append(listA)
listB1 = []
for line1 in fileB:
listB = line1.split('\t')
listB1.append(listB)
listC1 = []
for line1 in fileC:
listC = line1.split('\t')
listC1.append(listC)
for key1 in listA1:
for key2 in listB1:
for key3 in listC1:
if key1[1] == key2[1] and key2[1] == key3[1] and key3[1] == key1[1]:
print "Common between three files:",key1[1]
print "Common between file1 and file2 files:"
for key1 in listA1:
for key2 in listB1:
if key1[1] == key2[1]:
print key1[1]
print "Common between file1 and file3 files:"
for key1 in listA1:
for key2 in listC1:
if key1[1] == key2[1]:
print key1[1]
If you just want to sort A1
, B1
, and C1
by the second column, this is easy:
listA1.sort(key=operator.itemgetter(1))
If you don't understand itemgetter
, this is the same:
listA1.sort(key=lambda element: element[1])
However, I think a better solution is to just use a set
:
setA1 = set(element[1] for element in listA1)
setB1 = set(element[1] for element in listB1)
setC1 = set(element[1] for element in listC1)
Or, more simply, don't build the lists in the first place; do this:
setA1 = set()
for line1 in fileA:
listA = line1.split('\t')
setA1.add(listA[1])
Either way:
print "Common between file1 and file2 files:"
for key in setA1 & setA2:
print key
To simplify it further, you probably want to refactor the repeated stuff into functions first:
def read_file(path):
with open(path) as f:
result = set()
for line in f:
columns = line.split('\t')
result.add(columns[1])
return result
setA1 = read_file('A.txt')
setB1 = read_file('B.txt')
setC1 = read_file('C.txt')
And then you can find further opportunities. For example:
def read_file(path):
with open(path) as f:
return set(row[1] for row in csv.reader(f))
As John Clements points out, you don't even really need all three of them to be sets, just A1, so you could instead do this:
def read_file(path):
with open(path) as f:
for row in csv.reader(f):
yield row[1]
setA1 = set(read_file('A.txt'))
iterB1 = read_file('B.txt')
iterC1 = read_file('B.txt')
The only other change you need is that you have to call intersection
instead of using the &
operator, so:
for key in setA1.intersection(iterB1):
I'm not sure this last change is actually an improvement. But in Python 3.3, where the only thing you need to do is change the return set(…)
into yield from (…)
, I probably would do it this way. (Even if the files are huge and have tons of duplicates, so there was a performance cost to it, I'd just stick unique_everseen
from the itertools
recipes around the read_file
calls.)