Search code examples
pythonpandasnlp

Counting the Frequency of Some Words within some other Key Words in Text


I have two sets of word lists - first one I called search words and the second one I called key words. My goal is to calculate the frequency of search words within 10 words of key words. For example, assume that the word - acquire - is in key words list, then I will look for the words in search words list within 10 words of acquire. Within 10 words mean, 10 words forward from key words and 10 words backward from key words, meaning that both forward and backward movement.

Below is my search word and key word lists -

search_words = ['access control', 'Acronis', 'Adaware', 'AhnLab', 'AI Max Dev Labs', 'Alibaba Security',
 'anti-adware', 'anti-keylogger', 'anti-malware', 'anti-ransomware', 'anti-rootkit', 'anti-spyware',
 'anti-subversion', 'anti-tamper', 'anti-virus', 'Antiy', 'Avast', 'AVG', 'Avira', 'Baidu', 'Barracuda',
 'Bitdefender', 'BullGuard', 'Carbon Black', 'Check Point', 'Cheetah Mobile', 'Cisco', 'Clario',
 'Comodo', 'computer security', 'CrowdStrike', 'cryptography', 'Cybereason', 'cybersecurity',
 'Cylance', 'data security', 'diagnostic program', 'Elastic', 'Emsisoft', 'encryption', 'Endgame', 'end point security', 
 'Ensilo', 'eScan', 'ESET', 'FireEye', 'firewall', 'Fortinet', 'F-Secure', 'G Data',
 'Immunet', 'information security', 'Intego', 'intrusion detection system', 'K7', 'Kaspersky', 'log management software', 'Lookout', 
 'MacKeeper', 'Malwarebytes', 'McAfee', 'Microsoft', 'network security', 
 'NOD32', 'Norton', 'Palo Alto Networks', 'Panda Security', 'PC Matic', 'PocketBits',
 'Qihoo', 'Quick Heal', 'records management', 'SafeDNS', 'Saint Security', 'sandbox', 'Sangfor',
 'Securion', 'security event management', 'security information and event management', 
 'security information management', 'SentinelOne', 'Seqrite', 'Sophos',
 'SparkCognition', 'steganography', 'Symantec', 'Tencent', 'Total AV', 'Total Defense', 
 'Trend Micro', 'Trustport', 'Vipre', 'Webroot', 'ZoneAlarm']

key_words = ['acquire', 'adopt', 'advance', 'agree', 'boost', 'capital resource',
 'capitalize', 'change', 'commitment', 'complete', 'configure', 'design', 'develop', 'enhance', 'expand',
 'expenditure', 'expense', 'implement', 'improve', 'increase', 'initiate', 'install', 
 'integrate', 'invest', 'lease',
 'modernize', 'modify', 'move', 'obtain', 'plan', 'project', 'purchase', 'replace', 'spend',
  'upgrade', 'use']

A small Example -

text_dict = {
    'ITEM7':["Last year, from AVG we have acquired Alibaba Security. This year we are in the process \
    of adopting Symantec. We believe these technologies will improve our access control. \
        Moreover, we also integrated data security diagnostic program.",
        "We are planning to install end-point security, which will upgrade intrusion detection system."]
}

df = pd.DataFrame(text_dict)

My expected outcome is -

                 ITEM7                          Frequency
Last year, from AVG we have acquired Alibaba S...   6
We are planning to install end-point security,...   2

For the first row in df, we see the word AVG and Alibaba Security are from search_words list and around the word acquired, the base form of which - acquire - is in the key_words list. Similarly, Symantec, Access Control, data security, diagnostic program are from search_words list and these words are within 10 words of adopting, improve, integrated from key_words list. So, total search words are 6 (AVG+Alibaba Security+Symantec+Access Control+Data Security+Diagnostic Program). Therefore, in the Frequency column of df, the value is 6.

Please note that the words in key_words are in basically base form, so their variation (like adopted, adopting) should be counted as key words also.


Solution

  • You need to process each row of text by identifying occurrences of key_words and capturing a 10-word window around them. Within this window, you need to check for multi-word search_words, ensuring they are matched as phrases. Each unique search_word found within these windows needs to be counted, avoiding double-counting across the row. Stored the results as a frequency count for each row, accurately reflecting the number of unique search_words near key_words.

    import pandas as pd
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords
    import string
    import re
    
    text_dict = {
        'ITEM7': [
            "Last year, from AVG we have acquired Alibaba Security. This year we are in the process "
            "of adopting Symantec. We believe these technologies will improve our access control. "
            "Moreover, we also integrated data security diagnostic program.",
            "We are planning to install end-point security, which will upgrade intrusion detection system."
        ]
    }
    df = pd.DataFrame(text_dict)
    
    search_words = [
        'access control', 'Acronis', 'Adaware', 'AhnLab', 'AI Max Dev Labs', 'Alibaba Security',
        'anti-adware', 'anti-keylogger', 'anti-malware', 'anti-ransomware', 'anti-rootkit', 'anti-spyware',
        'anti-subversion', 'anti-tamper', 'anti-virus', 'Antiy', 'Avast', 'AVG', 'Avira', 'Baidu', 'Barracuda',
        'Bitdefender', 'BullGuard', 'Carbon Black', 'Check Point', 'Cheetah Mobile', 'Cisco', 'Clario',
        'Comodo', 'computer security', 'CrowdStrike', 'cryptography', 'Cybereason', 'cybersecurity',
        'Cylance', 'data security', 'diagnostic program', 'Elastic', 'Emsisoft', 'encryption', 'Endgame', 'end point security',
        'Ensilo', 'eScan', 'ESET', 'FireEye', 'firewall', 'Fortinet', 'F-Secure', 'G Data',
        'Immunet', 'information security', 'Intego', 'intrusion detection system', 'K7', 'Kaspersky', 'log management software', 'Lookout',
        'MacKeeper', 'Malwarebytes', 'McAfee', 'Microsoft', 'network security',
        'NOD32', 'Norton', 'Palo Alto Networks', 'Panda Security', 'PC Matic', 'PocketBits',
        'Qihoo', 'Quick Heal', 'records management', 'SafeDNS', 'Saint Security', 'sandbox', 'Sangfor',
        'Securion', 'security event management', 'security information and event management',
        'security information management', 'SentinelOne', 'Seqrite', 'Sophos',
        'SparkCognition', 'steganography', 'Symantec', 'Tencent', 'Total AV', 'Total Defense',
        'Trend Micro', 'Trustport', 'Vipre', 'Webroot', 'ZoneAlarm'
    ]
    
    key_words = [
        'acquire', 'adopt', 'advance', 'agree', 'boost', 'capital resource',
        'capitalize', 'change', 'commitment', 'complete', 'configure', 'design', 'develop', 'enhance', 'expand',
        'expenditure', 'expense', 'implement', 'improve', 'increase', 'initiate', 'install',
        'integrate', 'invest', 'lease', 'modernize', 'modify', 'move', 'obtain', 'plan', 'project',
        'purchase', 'replace', 'spend', 'upgrade', 'use'
    ]
    
    def preprocess_text_no_lemmatization(text):
        tokens = re.findall(r'\b\w+\b', text.lower())  
        return tokens
    
    def calculate_final_frequency(row, search_phrases, key_phrases):
        text = row.lower()
        tokens = preprocess_text_no_lemmatization(text) 
        search_phrases = [phrase.lower() for phrase in search_phrases]  
        key_phrases = [phrase.lower() for phrase in key_phrases] 
    
        all_matches = set()
        token_len = len(tokens)
        
        for idx, token in enumerate(tokens):
            if any(token.startswith(key) for key in key_phrases):  
                window_start = max(0, idx - 10)
                window_end = min(token_len, idx + 10 + 1)
                window_tokens = tokens[window_start:window_end]
                window_text = " ".join(window_tokens)  
    
                for phrase in search_phrases:
                    if phrase in window_text:
                        all_matches.add(phrase)  
        return len(all_matches)
    
    df['Frequency'] = df['ITEM7'].apply(lambda x: calculate_final_frequency(x, search_words, key_words))
    
    print(df)
    

    Which returns

                                                   ITEM7  Frequency
    0  Last year, from AVG we have acquired Alibaba S...          6
    1  We are planning to install end-point security,...          2