Search code examples
pythonpandasnlp

Save paragraphs containing keywords into txt file


Recently I have an ongoing research project that requires me to only keep paragraphs containing keywords of each txt file. Does there have any way to do that?

keywords=["cryptocurren","virtual curren","digital curren"]

txt sample

The widespread adoption of new technologies, including internet services, cryptocurrencies and payment systems, could require substantial expenditures to modify or adapt our existing products and services as we grow and develop our internet banking and mobile banking channel strategies in addition to remote connectivity solutions.

A significant natural disaster, such as a tornado, hurricane, earthquake, fire or flood, could have a material adverse impact on our ability to conduct business, and our insurance coverage may be insufficient to compensate for losses that may occur. Acts of terrorism, war, civil unrest, or pandemics could cause disruptions to our business or the economy as a whole. While we have established and regularly test disaster recovery procedures, the occurrence of any such event could have a material adverse effect on our business, operations and financial condition.

As the text showed above, only the first paragraph contains the keyword of the keyword list. Thus, I only want the txt file contain the 1st paragraph.

Thank you in advance!

I hope to find a way to only keep paragraphs that contain the keywords of the txt file.


Solution

  • You have to figure out the paragraphs and than search the keyword. I used regex:

    import re
    
    data = """The widespread adoption of new technologies, including internet
    services, cryptocurrencies and payment systems, could require
    substantial expenditures to modify or adapt our existing products
    and services as we grow and develop our internet banking and
    mobile banking channel strategies in addition to remote 
    connectivity solutions.
    
    A significant natural disaster, such as a tornado, hurricane, 
    earthquake, fire or flood, could have a material adverse impact on 
    our ability to conduct business, and our insurance coverage may
    be insufficient to compensate for losses that may occur. Acts of 
    terrorism, war, civil unrest, or pandemics could cause disruptions
    to our business or the economy as a whole. While we have
    established and regularly test disaster recovery procedures, the 
    occurrence of any such event could have a material adverse effect 
    on our business, operations and financial condition."""
    
    keywords=["cryptocurren","virtual curren","digital curren"]
    # keywords = ["insurance"]
    for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', data):
       print(match.start(), match.end())
       start = match.start()
       end = match.end()
       step = 1
       if [word for word in keywords if word in data[start:end:step]]:
           print(data[start:end:step])
    

    Output:

    0 334
    The widespread adoption of new technologies, including internet
    services, cryptocurrencies and payment systems, could require
    substantial expenditures to modify or adapt our existing products
    and services as we grow and develop our internet banking and
    mobile banking channel strategies in addition to remote 
    connectivity solutions.
    
    335 905