Search code examples
pythonnltk

Creating a list from a file then checking and printing matching token from the list


I am trying to achieve a task where I have a file which consists of a sample conversation. On the other hand I have some action keywords that needs to match the starting of the sentence and print the whole line.

File:

Hey Salam Daniyal, can you hear me alright? hey hey Walikumassalam Joe, how are you? Yes for sure I can hear you. How is it going brother? All good bro all good, so I wanted to discuss something with you, yeah sure shoot. Okay so, I want a simple 4 page website for my business on WordPress. I need home page about us services and contact us. Can you make it? Yeah no problem send me the content and all the images I will start working on it and will show you some samples. Great, one more thing. Can we use "one page multipurpose" theme from themeForest? Yeah sure. Alright great! Sending you the images and content.

To Achieve this I've write:

import re

textfile = open("FILEPATH", 'r')
filetext = textfile.read()
a = [i[0].strip() for i in re.findall(r"((\W\w+){1,}(?=(\,|\.|\!|\?)))", filetext)]

print(a)

Output:

['Salam Daniyal', 'can you hear me alright', 'hey hey Walikumassalam Joe', 'how are you', 'Yes for sure I can hear you', 'How is it going brother', 'All good bro all good', 'so I wanted to discuss something with you', 'yeah sure shoot', 'Okay so', 'I want a simple 4 page website for my business on WordPress', 'I need home page about us services and contact us', 'Can you make it', 'Yeah no problem send me the content and all the images I will start working on it and will show you some samples', 'Great', 'one more thing', 'theme from themeForest', 'Yeah sure', 'Alright great', 'Sending you the images and content']

Another approach can be using nltk which gives a precise list.

from nltk.tokenize import sent_tokenize
  
textfile = open("FILEPATH", 'r')
filetext = textfile.read()

textfile.close()

print(sent_tokenize(filetext))

Output:

['Hey Salam Daniyal, can you hear me alright?', 'hey hey Walikumassalam Joe, how are you?', 'Yes for sure I can hear you.', 'How is it going brother?', 'All good bro all good, so I wanted to discuss something with you, yeah sure shoot.', 'Okay so, I want a simple 4 page website for my business on WordPress.', 'I need home page about us services and contact us.', 'Can you make it?', 'Yeah no problem send me the content and all the images I will start working on it and will show you some samples.', 'Great, one more thing.', 'Can we use "one page multipurpose" theme from themeForest?', 'Yeah sure.', 'Alright great!', 'Sending you the images and content.']

this one creates a list of whole sentence and regex doesn't. But also in regex I can print list from it's index and in nltk I can't.

In regex:
print(a[11]) //will print list on 11th index

Output:
I need home page about us services and contact us

In NLTK:
print(sent_tokenize(filetext[11]))

Output:
['a']

Which one is the better option to create list, now to match action keywords what approach should I take? As I have a list of action keywords which needs to match from the above list and print the results, ActionKeywords = "I need" , "can we", "I want a", "we need"

So according to current action keywords I want my code to print these sentences from the list as these sentences starts from my action keywords:

'I need home page about us services and contact us'
'I want a simple 4 page website for my business on WordPress.'
'Can we use "one page multipurpose" theme from themeForest?'

Solution

  • If you have data like below

    a = ['Hey Salam Daniyal, can you hear me alright?', 'hey hey Walikumassalam Joe, how are you?', 'Yes for sure I can hear you.', 'How is it going brother?', 'All good bro all good, so I wanted to discuss something with you, yeah sure shoot.', 'Okay so, I want a simple 4 page website for my business on WordPress.', 'I need home page about us services and contact us.', 'Can you make it?', 'Yeah no problem send me the content and all the images I will start working on it and will show you some samples.', 'Great, one more thing.', 'Can we use "one page multipurpose" theme from themeForest?', 'Yeah sure.', 'Alright great!', 'Sending you the images and content.']
    

    and action_words like so:

    action_keywords = ["I need" , "Can we", "I want a", "We need"]
    

    You can filter a using python's inbuilt filter method like below

    def extract(x):
        for e in action_keywords:
            if e in x:
                return True
        return False
        
    ans = filter(extract, a)
    print(list(ans))
    

    Outputs:

    ['Okay so, I want a simple 4 page website for my business on WordPress.', 'I need home page about us services and contact us.', 'Can we use "one page multipurpose" theme from themeForest?']
    

    Please note that you can modify the logic inside extract based on the size of your data and other conditions