Search code examples
pythonregexpython-2.7nltkpyscripter

Python program to perform keyword matches for contents present in two files


I have used nltk to obtain a list of tokenised keywords. The Output is

['Natural', 'Language', 'Processing', 'with', 'PythonNatural', 'Language', 'Processingwith', 'PythonNatural', 'Language', 'Processing', 'with', 'Python', 'Editor', ':', 'Production', 'Editor', ':', 'Copyeditor']

I have a text file keyword.txt which contains following keywords:

Processing
Editor
Pyscripter
Language
Registry
Python

How can i match the keywords obtained from tokenization with my keyword.txt file such that a third file is created for the matched keywords.

This is a program i have been working on, but it creates an union of these two files:

import os
with open(r'D:\file3.txt', 'w') as fout:
  keywords_seen = set()
  for filename in r'D:\File1.txt', r'D:\Keyword.txt':
    with open(filename) as fin:
        for line in fin:
            keyword = line.strip()
            if keyword not in keywords_seen:
                fout.write(line + "\n")
                keywords_seen.add(keyword)

Solution

  • How can i match the keywords obtained from tokenization with my keyword.txt file such that a third file is created for the matched keywords

    Here's a simple solution, adjust the filenames as needed.

    # these are the tokens:
    tokens = set(['Natural', 'Language', 'Processing', 'with', 'PythonNatural', 'Language', 'Processingwith', 'PythonNatural', 'Language', 'Processing', 'with', 'Python', 'Editor', ':', 'Production', 'Editor', ':', 'Copyeditor'])
    
    # create a set containing the keywords
    with open('keywords.txt', 'r') as keywords:
        keyset = set(keywords.read().split())
    
    # write outputfile
    with open('matches.txt', 'w') as matches:
        for word in keyset:
            if word in tokens:
                matches.write(word + '\n')
    

    This will produce a file matches.txt with the words

    Language
    Processing
    Python
    Editor