I have two sets of word lists - first one I called search words
and the second one I called key words
. My goal is to calculate the frequency of search words
within 10 words of key words
. For example, assume that the word - acquire - is in key words
list, then I will look for the words in search words
list within 10 words of acquire. Within 10 words mean, 10 words forward from key words and 10 words backward from key words, meaning that both forward and backward movement.
Below is my search word
and key word
lists -
search_words = ['access control', 'Acronis', 'Adaware', 'AhnLab', 'AI Max Dev Labs', 'Alibaba Security',
'anti-adware', 'anti-keylogger', 'anti-malware', 'anti-ransomware', 'anti-rootkit', 'anti-spyware',
'anti-subversion', 'anti-tamper', 'anti-virus', 'Antiy', 'Avast', 'AVG', 'Avira', 'Baidu', 'Barracuda',
'Bitdefender', 'BullGuard', 'Carbon Black', 'Check Point', 'Cheetah Mobile', 'Cisco', 'Clario',
'Comodo', 'computer security', 'CrowdStrike', 'cryptography', 'Cybereason', 'cybersecurity',
'Cylance', 'data security', 'diagnostic program', 'Elastic', 'Emsisoft', 'encryption', 'Endgame', 'end point security',
'Ensilo', 'eScan', 'ESET', 'FireEye', 'firewall', 'Fortinet', 'F-Secure', 'G Data',
'Immunet', 'information security', 'Intego', 'intrusion detection system', 'K7', 'Kaspersky', 'log management software', 'Lookout',
'MacKeeper', 'Malwarebytes', 'McAfee', 'Microsoft', 'network security',
'NOD32', 'Norton', 'Palo Alto Networks', 'Panda Security', 'PC Matic', 'PocketBits',
'Qihoo', 'Quick Heal', 'records management', 'SafeDNS', 'Saint Security', 'sandbox', 'Sangfor',
'Securion', 'security event management', 'security information and event management',
'security information management', 'SentinelOne', 'Seqrite', 'Sophos',
'SparkCognition', 'steganography', 'Symantec', 'Tencent', 'Total AV', 'Total Defense',
'Trend Micro', 'Trustport', 'Vipre', 'Webroot', 'ZoneAlarm']
key_words = ['acquire', 'adopt', 'advance', 'agree', 'boost', 'capital resource',
'capitalize', 'change', 'commitment', 'complete', 'configure', 'design', 'develop', 'enhance', 'expand',
'expenditure', 'expense', 'implement', 'improve', 'increase', 'initiate', 'install',
'integrate', 'invest', 'lease',
'modernize', 'modify', 'move', 'obtain', 'plan', 'project', 'purchase', 'replace', 'spend',
'upgrade', 'use']
A small Example -
text_dict = {
'ITEM7':["Last year, from AVG we have acquired Alibaba Security. This year we are in the process \
of adopting Symantec. We believe these technologies will improve our access control. \
Moreover, we also integrated data security diagnostic program.",
"We are planning to install end-point security, which will upgrade intrusion detection system."]
}
df = pd.DataFrame(text_dict)
My expected outcome is -
ITEM7 Frequency
Last year, from AVG we have acquired Alibaba S... 6
We are planning to install end-point security,... 2
For the first row in df
, we see the word AVG
and Alibaba Security
are from search_words
list and around the word acquired, the base form of which - acquire - is in the key_words
list. Similarly, Symantec
, Access Control
, data security
, diagnostic program
are from search_words
list and these words are within 10 words of adopting
, improve
, integrated
from key_words
list. So, total search words are 6 (AVG+Alibaba Security+Symantec+Access Control+Data Security+Diagnostic Program). Therefore, in the Frequency
column of df
, the value is 6.
Please note that the words in key_words
are in basically base form, so their variation (like adopted, adopting) should be counted as key words also.
You need to process each row of text by identifying occurrences of key_words
and capturing a 10-word window around them. Within this window, you need to check for multi-word search_words, ensuring they are matched as phrases. Each unique search_word
found within these windows needs to be counted, avoiding double-counting across the row. Stored the results as a frequency count for each row, accurately reflecting the number of unique search_words
near key_words
.
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import string
import re
text_dict = {
'ITEM7': [
"Last year, from AVG we have acquired Alibaba Security. This year we are in the process "
"of adopting Symantec. We believe these technologies will improve our access control. "
"Moreover, we also integrated data security diagnostic program.",
"We are planning to install end-point security, which will upgrade intrusion detection system."
]
}
df = pd.DataFrame(text_dict)
search_words = [
'access control', 'Acronis', 'Adaware', 'AhnLab', 'AI Max Dev Labs', 'Alibaba Security',
'anti-adware', 'anti-keylogger', 'anti-malware', 'anti-ransomware', 'anti-rootkit', 'anti-spyware',
'anti-subversion', 'anti-tamper', 'anti-virus', 'Antiy', 'Avast', 'AVG', 'Avira', 'Baidu', 'Barracuda',
'Bitdefender', 'BullGuard', 'Carbon Black', 'Check Point', 'Cheetah Mobile', 'Cisco', 'Clario',
'Comodo', 'computer security', 'CrowdStrike', 'cryptography', 'Cybereason', 'cybersecurity',
'Cylance', 'data security', 'diagnostic program', 'Elastic', 'Emsisoft', 'encryption', 'Endgame', 'end point security',
'Ensilo', 'eScan', 'ESET', 'FireEye', 'firewall', 'Fortinet', 'F-Secure', 'G Data',
'Immunet', 'information security', 'Intego', 'intrusion detection system', 'K7', 'Kaspersky', 'log management software', 'Lookout',
'MacKeeper', 'Malwarebytes', 'McAfee', 'Microsoft', 'network security',
'NOD32', 'Norton', 'Palo Alto Networks', 'Panda Security', 'PC Matic', 'PocketBits',
'Qihoo', 'Quick Heal', 'records management', 'SafeDNS', 'Saint Security', 'sandbox', 'Sangfor',
'Securion', 'security event management', 'security information and event management',
'security information management', 'SentinelOne', 'Seqrite', 'Sophos',
'SparkCognition', 'steganography', 'Symantec', 'Tencent', 'Total AV', 'Total Defense',
'Trend Micro', 'Trustport', 'Vipre', 'Webroot', 'ZoneAlarm'
]
key_words = [
'acquire', 'adopt', 'advance', 'agree', 'boost', 'capital resource',
'capitalize', 'change', 'commitment', 'complete', 'configure', 'design', 'develop', 'enhance', 'expand',
'expenditure', 'expense', 'implement', 'improve', 'increase', 'initiate', 'install',
'integrate', 'invest', 'lease', 'modernize', 'modify', 'move', 'obtain', 'plan', 'project',
'purchase', 'replace', 'spend', 'upgrade', 'use'
]
def preprocess_text_no_lemmatization(text):
tokens = re.findall(r'\b\w+\b', text.lower())
return tokens
def calculate_final_frequency(row, search_phrases, key_phrases):
text = row.lower()
tokens = preprocess_text_no_lemmatization(text)
search_phrases = [phrase.lower() for phrase in search_phrases]
key_phrases = [phrase.lower() for phrase in key_phrases]
all_matches = set()
token_len = len(tokens)
for idx, token in enumerate(tokens):
if any(token.startswith(key) for key in key_phrases):
window_start = max(0, idx - 10)
window_end = min(token_len, idx + 10 + 1)
window_tokens = tokens[window_start:window_end]
window_text = " ".join(window_tokens)
for phrase in search_phrases:
if phrase in window_text:
all_matches.add(phrase)
return len(all_matches)
df['Frequency'] = df['ITEM7'].apply(lambda x: calculate_final_frequency(x, search_words, key_words))
print(df)
Which returns
ITEM7 Frequency
0 Last year, from AVG we have acquired Alibaba S... 6
1 We are planning to install end-point security,... 2