Search code examples
stringpython-3.xstring-concatenation

Python Text File Compare and Concatenate


I need help with concatenating two text files based on common strings.

My first txt file looks like this:

Hello abc
Wonders xyz
World abc

And my second txt file looks like this:

abc A
xyz B
abc C

I want my output file to be:

Hello abc A
Wonders xyz B
World abc C

My Code goes something like this:

a = open("file1","r")
b = open("file2","r")
c = open("output","w")

for line in b:
  chk = line.split(" ")

  for line_new in a:
     chk_new = line_new.split(" ")

     if (chk_new[0] == chk[1]):
        c.write(chk[0])
        c.write(chk_new[0])
        c.write(chk_new[1])

But when I use this code, I get the output as:

Hello abc A
Wonders xyz B
Hello abc C

Line 3 mismatch occurs. What should I do to get it the correct way?


Solution

  • I'm afraid you are mistaken, your code does not produce the output you say it does.

    Partly because a file can only be read once, with the exception being if you move the read cursor back to the beginning of the file (file.seek(0), docs).

    Partly because the second element of a line in the first file ends with a newline character, thus you are comparing e.g. "abc" with "abc\n" etc. which will never be true.

    Hence the output file will be completely empty.

    So how do you solve the problem? Reading a file more than once seems overly complicated, don't do that. I suggest you do something along the lines of:

    # open all the files simultaneously
    with open('file1', 'r') as (f1
      ), open('file2', 'r') as (f2
      ), open('output', 'w') as (outf
      ):
        lines_left = True
    
        while lines_left:
            f1_line = f1.readline().rstrip()
    
            # check if there's more to read
            if len(f1_line) != 0:
    
                f1_line_tokens = f1_line.split(' ')
    
                # no need to strip the line from the second file
                f2_line_tokens = f2.readline().split(' ')
    
                if f1_line_tokens[1] == f2_line_tokens[0]:
                    outf.write(f1_line + ' ' + f2_line_tokens[1])
            else:
                lines_left = False
    

    I've tested it on your example input and it produces the correct output (where file1 is the first example file and file2 is the second). If we talk about huge files (millions of lines), this version will be considerably faster than aarons. In other cases the performance difference will be negligible.