Search code examples
pythonstringfunctionreadfilewritefile

generating a score for a domain with a keyword


I have keywords in my keyword.txt list and I have domains in my domain.txt list. I want to generate a score with a check. For example, if the keyword is apple, it will be 30 points, and I will search for it in all domains and give points for each domain. How do I do that ?

My code:

score_dict = {"apple":"30",
              "bananas":"50"}

def generate_score():
    with open("keyword.txt", "r") as file:
        keywords = file.read().splitlines()

    with open("domain.txt", 'r') as g:
        score = []
        for line in g:
            if any(keyword in line.strip() for keyword in keywords):
                score.append(line)

keyword.txt:

apple #if the domain contains the word apple, it will get a +30 score

domain.txt:

apple.com
redapple.com
paple.com

The output I want:

apple.com , 30
redapple.com, 30
paple.com, 0

Solution

  • Here is an iterative approach that loops the lines in domain.txt and searches for partial string matches from the dictionary:

    score_dict = {"apple":"30",
                  "bananas":"50"}
    
    results = []
    with open("domain.txt", 'r') as g:
        for domain in g.readlines(): #loop lines in domain.txt
            hit = None
            for substring, score in score_dict.items(): #loop score_dict
                if substring in domain:
                    hit = True
                    results.append({'domain': domain.strip(), 'substring':substring, 'score': score})
                    break #break on hit to avoid unnecessary iterations
            if not hit: #assign score 0 if there is no hit
                results.append({'domain': domain.strip(), 'substring':substring, 'score': 0})
    

    Output:

    [{'domain': 'apple.com', 'substring': 'apple', 'score': '30'},
     {'domain': 'redapple.com', 'substring': 'apple', 'score': '30'},
     {'domain': 'paple.com', 'substring': 'bananas', 'score': 0}]
    

    Note that this solution can be slow when working with large documents. In that case you could vectorize the issue using pandas with str.extract:

    import pandas as pd
    
    score_dict = {"apple":"30",
                  "bananas":"50"}
    
    df = pd.read_csv('domain.txt', names=['string']) #read domain.txt as pandas DataFrame
    df['score']=(df['string'].str.extract('('+'|'.join(score_dict.keys())+')',expand=False).map(score_dict)).fillna(0)
    df.to_csv('output.csv') # save output to csv
    

    Results:

    string score
    0 apple.com 30
    1 redapple.com 30
    2 paple.com 0