Search code examples
pythoniterationmatchingstring-matching

How to extract text in one file based from text in another file using Python


I'm new to Python. I searched other questions here and didn't find the exact situation I'm running into.

I need to be able to read in the contents of File-A and pull out the matching lines from File-B.

I know how to do this in PowerShell, but on large files it is very slow and I am trying to learn Python.

File-A contains just loan numbers - An 8 to 10 digit number per line File-A can contain 1 to thousands of lines

File-B can contain 1 to thousands of lines and has more data in it but each line will start with the same 8 to 10 digit loan number.

I need to read in File-A and find the matching line in File-B and write out those matching lines to a new file (all are text files)

Example contents of File-A - no spaces - 1 loan per line

272991
272992
272993

Example contents of File-B

272991~20210129~\\Serv1\LOC7\675309\867530\016618\272991.pdf~0
272992~20210129~\\Serv1\LOC7\675309\867530\016618\272992.pdf~0
272993~20210129~\\Serv1\LOC7\675309\867530\016618\272993.pdf~0

Is someone able to assist, point me in the right direction or better yet, provide a working solution?

Here is what I have tried so far but all it does is create the new PulledLoans.txt file with nothing in it

import os
# os.system('cls')
os.chdir('C:\\Temp\\')
print(os.getcwd())
# read file
loanFile = 'Loans.txt'
SourceFile = 'Orig.txt'
NewFile = 'PulledLoans.txt'

with open(loanFile, 'r') as content, open(SourceFile, 'r') as source:
    # print(content.readlines())
    for loan in content:
        # print(loan, end='')
        if loan in source:
            print('found loan')

with open(SourceFile) as dfile, open(loanFile) as ifile:
    lines = "\n".join(set(dfile.read().splitlines()) & set(ifile.read().splitlines()))
    print(lines)
    
with open(NewFile, 'w') as ofile:
    ofile.write(lines)

Solution

  • First, read everything from fileB into a dictionary where the keys are the identifier, and the values are the entire line

    file_b_data = dict()
    
    with open("fileB") as f_b:
        for line in f_b:
            line = line.strip() # Remove whitespace at start and end
            if not line:
                continue # If the line is blank, skip
    
            row = line.split("~") # Split by ~
            identifier = row[0]   # First element is the identifier
            file_b_data[identifier] = line # Set the value of the dictionary
    

    Next, read the lines from fileA and get the matching values from your dictionary

    with open("fileA") as f_a, open("outfile", "w") as f_w:
        for identifier in f_a:
            identifier = identifier.strip()
            if not identifier:
                continue
            if identifier in file_b_data: # Check that the identifier exists in previously read data
                out_line = file_b_data[identifier] + "\n" # Get the value from the dict
                f_w.write(out_line) # Write it to output file
    

    Alternatively, you could use the pandas module to read all of fielA and fileB into a dataframe, and then find the correct rows.

    import pandas as pd
    
    file_b_data = pd.read_csv("fileB.txt", sep="~", names=["identifier", "date", "path", "something"], index_col=0)
    

    Which gives us this dataframe:

    identifier date     path                                         something
    272991     20210129 \\Serv1\LOC7\675309\867530\016618\272991.pdf 0
    272992     20210129 \\Serv1\LOC7\675309\867530\016618\272992.pdf 0
    272993     20210129 \\Serv1\LOC7\675309\867530\016618\272993.pdf 0
    

    Same for fileA: (I removed 272992 to illustrate that it actually works)

    file_a_data = pd.read_csv("fileA.txt", names="identifier")
    

    giving us

       identifier
    0      272991
    1      272993
    

    Then, look for these indices in file_b_data:

    wanted_ids = file_a_data['identifiers']
    wanted_rows = file_b_data.loc[wanted_ids, :]
    wanted_rows.to_csv("out_file.txt", sep="~",header=None)
    

    which will write this file: (notice the 272992 row is missing because it wasn't in fileA)

    272991~20210129~\\Serv1\LOC7\675309\867530\016618\272991.pdf~0
    272993~20210129~\\Serv1\LOC7\675309\867530\016618\272993.pdf~0