Search code examples
pythonbeautifulsouppython-requestsscreen-scrapingcpu-word

code for counting word frequency in website using Python doesn't output the right frequency


I'd like to count the frequency of a list of words in a specific website. The code however doesn't return the exact number of words that a manual "control F" command would. What am I doing wrong?

Here's my code:

import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import re

url='https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr=[] 
wanted = ['tender','2020','date']    
for word in wanted:
    a=requests.get(url).text.count(word)
    dic={'phrase':word,
          'frequency':a,              
            }          
    fr.append(dic)  
    print('Frequency of',word, 'is:',a)
data=pd.DataFrame(fr)    

Solution

  • Refer to the comments in your question to see why using requests might be a bad idea to count the frequency of a word in the "visible spectrum" of a webpage (what you actually see in the browser).

    If you want to go about this with selenium, you could try:

    from selenium import webdriver
    
    url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
    
    driver = webdriver.Chrome(chromedriver_location)
    driver.get(url)
    body = driver.find_element_by_tag_name('body')
    
    fr = [] 
    wanted = ['tender', '2020', 'date']    
    for word in wanted:
        freq = body.text.lower().count(word) # .lower() to account for count's case sensitive behaviour
        dic = {'phrase': word, 'frequency': freq}          
        fr.append(dic)  
        print('Frequency of', word, 'is:', freq)
    

    which gave me the same results that a CTRL + F does.

    You can test BeautifulSoup too (which you're importing by the way) by modifying your code a little bit:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
    fr = [] 
    wanted = ['tender','2020','date']    
    a = requests.get(url).text
    soup = BeautifulSoup(a, 'html.parser')
    for word in wanted:
        freq = soup.get_text().lower().count(word)
        dic = {'phrase': word, 'frequency': freq}          
        fr.append(dic)  
        print('Frequency of', word, 'is:', freq)
    

    That gave me the same results, except for the word tender, which according to BeautifulSoup appears 12 times, and not 11. Test them out for yourself and see what suits you.