Search code examples
pythonawkextractpartiallines

Extract all lines from a text file that partial match keywords listed in another file


I have exhausted online searches trying to find out how to do this.

I have tab delimited file searchfile.txt with two columns and >200 rows. Sample here:

A(H1N1)/SWINE/COTES-DARMOR/388/2009 X?  4.28144245
A(H1N2)/SWINE/SCOTLAND/410440/1994 X?   7.25878836
A(H1)/SWINE/ENGLAND/117316/1986 X?  3.305392038
A(H1)/SWINE/ENGLAND/438207/1994 X?  7.66078717

I have another file keywords.txt with some keywords that partially match the IDs in searchfile.txt:

ENGLAND/117316    
DARMOR/388   
438207

I want to extract all lines from searchfile.txt that contain any of the keywords in keywords.txt

Using solutions from other similar questions I tried:

grep -F -f keywords.txt searchfile.txt > selected.txt 

grep -f keywords.txt searchfile.txt

awk 'FNR==NR {a[$0];next} ($NF in a)' keywords.txt searchfile.txt > result.txt

I also got part of the way there with this python script:

infile = r"/path/to/searchfile.txt"

results = []
to_keep = ["ENGLAND/117316",
            "DARMOR/388",
            "438207"]

with open(infile) as f:
    f = f.readlines()

for line in f:
    for phrase in to_keep:
        if phrase in line:
            results.append(line)
            break

print(results)

And it outputs this in the terminal window:

[
    'A(H1N1)/SWINE/COTES-DARMOR/388/2009 X?\t4.28144245\n',   
    'A(H1)/SWINE/ENGLAND/117316/1986 X?\t3.305392038\n', 
    'A(H1)/SWINE/ENGLAND/438207/1994 X?\t7.66078717\n'
]

Is there a way to

a) modify this script to read from a file like keywords.txt and output lines to another file? (My python skills are not up to that)

OR

b) use grep, awk, sed... to do this

I think the problem is that my keywords are not whole separate words and have to partially match what's in the searchfile.txt.

Grateful for any help! Thanks.


Solution

  • This is fairly straightforward in python. Assuming you have keywords.txt and input.txt files and want to output to output.txt:

    # 1
    with open('keywords.txt', 'r') as k:
        keywords = k.read().splitlines()
        
    #2
    with open('input.txt') as f, open('output.txt', 'w') as o:
        for line in f:
            if any(key in line for key in keywords):
                o.writelines(line)
    

    this reads in the keywords file, and stores each line from it in a list (#1). We then open our input and output text files, looping through the input file line-by-line and write to the output file if we find any of our keywords in the line (#2).