I'm currently working on my first python project. The goal is to be able to summarise a webpage's information by searching for and printing sentences that contain a specific word from a word list I generate. For example, the following (large) list contains 'business key terms' I generated by using cewl on business websites;
business_list = ['business', 'marketing', 'market', 'price', 'management', 'terms', 'product', 'research', 'organisation', 'external', 'operations', 'organisations', 'tools', 'people', 'sales', 'growth', 'quality', 'resources', 'revenue', 'account', 'value', 'process', 'level', 'stakeholders', 'structure', 'company', 'accounts', 'development', 'personal', 'corporate', 'functions', 'products', 'activity', 'demand', 'share', 'services', 'communication', 'period', 'example', 'total', 'decision', 'companies', 'service', 'working', 'businesses', 'amount', 'number', 'scale', 'means', 'needs', 'customers', 'competition', 'brand', 'image', 'strategies', 'consumer', 'based', 'policy', 'increase', 'could', 'industry', 'manufacture', 'assets', 'social', 'sector', 'strategy', 'markets', 'information', 'benefits', 'selling', 'decisions', 'performance', 'training', 'customer', 'purchase', 'person', 'rates', 'examples', 'strategic', 'determine', 'matrix', 'focus', 'goals', 'individual', 'potential', 'managers', 'important', 'achieve', 'influence', 'impact', 'definition', 'employees', 'knowledge', 'economies', 'skills', 'buying', 'competitive', 'specific', 'ability', 'provide', 'activities', 'improve', 'productivity', 'action', 'power', 'capital', 'related', 'target', 'critical', 'stage', 'opportunities', 'section', 'system', 'review', 'effective', 'stock', 'technology', 'relationship', 'plans', 'opportunity', 'leader', 'niche', 'success', 'stages', 'manager', 'venture', 'trends', 'media', 'state', 'negotiation', 'network', 'successful', 'teams', 'offer', 'generate', 'contract', 'systems', 'manage', 'relevant', 'published', 'criteria', 'sellers', 'offers', 'seller', 'campaigns', 'economy', 'buyers', 'everyone', 'medium', 'valuable', 'model', 'enterprise', 'partnerships', 'buyer', 'compensation', 'partners', 'leaders', 'build', 'commission', 'engage', 'clients', 'partner', 'quota', 'focused', 'modern', 'career', 'executive', 'qualified', 'tactics', 'supplier', 'investors', 'entrepreneurs', 'financing', 'commercial', 'finances', 'entrepreneurial', 'entrepreneur', 'reports', 'interview', 'ansoff']
And the following program allows me to copy all the text from a URL i specify and organises it into a list, in which the elements are separated by sentence;
from bs4 import BeautifulSoup
import urllib.request as ul
url = input("Enter URL: ")
html = ul.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
for script in soup(["script", "style"]):
script.decompose()
strips = list(soup.stripped_strings)
# Joining list to form single text
text = " ".join(strips)
text = text.lower()
# Replacing substitutes of '.'
for i in range(len(text)):
if text[i] in "?!:;":
text = text.replace(text[i], ".")
# Splitting text by sentences
sentences = text.split(".")
My current objective is for the program to print all sentences that contain one (or more) of the key terms above, however i've only been succesful with single words at a time;
# Word to search for in the text
word_search = input("Enter word: ")
word_search = word_search.lower()
sentences_with_word = []
for x in sentences:
if x.count(word_search)>0:
sentences_with_word.append(x)
# Separating sentences into separate lines
sentence_text = "\n\n".join(sentences_with_word)
print(sentence_text)
Could somebody demonstrate how this could be achieved for an entire list at once? Thanks.
Edit
As suggested by MachineLearner, here is an example of the output for a single word. If I use wikipedia's page on marketing for the URL and choose the word 'marketing' as the input for 'word_search', this is a segment of the output generated (although the entire output is almost 600 lines long);
marketing mix the marketing mix is a foundational tool used to guide decision making in marketing
the marketing mix represents the basic tools which marketers can use to bring their products or services to market
they are the foundation of managerial marketing and the marketing plan typically devotes a section to the marketing mix
the 4ps [ edit ] the traditional marketing mix refers to four broad levels of marketing decision
Use a double loop to check multiple words contained in a list:
for sentence in sentences:
for word in words:
if sentence.count(word) > 0:
output.append(sentence)
# Do not forget to break the second loop, else
# you'll end up with multiple times the same sentence
# in the output array if the sentence contains
# multiple words
break