Search code examples
pythonscreen-scraping

Python web scraping, only collects 80 to 90% of intended data rows. Is there something wrong with my loop?


I'm trying to collect the 150 rows of data from the text that appears at the bottom of a given Showbuzzdaily.com web page (example), but my script only collects 132 rows.

I'm new to Python. Is there something I need to add to my loop to ensure all records are collected as intended?

To troubleshoot, I created a list (program_count) to verify this is happening in the code before the CSV is generated, which shows there are only 132 items in the list, rather than 150. Interestingly, the final row (#132) ends up being duplicated at the end of the CSV for some reason.

I experience similar issues scraping Google Trends (using pytrends), where only about 80% of the data I try to scrape ended up in the CSV. So I'm suspecting there's something wrong with my code or that I'm overwhelming my target with requests.

Adding time.sleep(0.1) to for while loop in this code didn't produce different results.

import time
import requests
import datetime
from bs4 import BeautifulSoup
import pandas as pd # import pandas module

from datetime import date, timedelta

# creates empty 'records' list
records = []

start_date = date(2021, 4, 12)
orig_start_date = start_date # Used for naming the CSV
end_date = date(2021, 4, 12)
delta = timedelta(days=1) # Defines delta as +1 day

print(str(start_date) + ' to ' + str(end_date)) # Visual reassurance

# begins while loop that will continue for each daily viewership report until end_date is reached
while start_date <= end_date: 
    start_weekday = start_date.strftime("%A") # define weekday name

    start_month_num = int(start_date.strftime("%m")) # define month number
    start_month_num = str(start_month_num) # convert to string so it is ready to be put into address

    start_month_day_num = int(start_date.strftime("%d")) # define day of the month
    start_month_day_num = str(start_month_day_num) # convert to string so it is ready to be put into address
    
    start_year = int(start_date.strftime("%Y")) # define year
    start_year = str(start_year) # convert to string so it is ready to be put into address

    #define address (URL)
    address = 'http://www.showbuzzdaily.com/articles/showbuzzdailys-top-150-'+start_weekday.lower()+'-cable-originals-network-finals-'+start_month_num+'-'+start_month_day_num+'-'+start_year+'.html'
    print(address) # print for visual reassurance

    # read the web page at the defined address (URL)
    r = requests.get(address)

    soup = BeautifulSoup(r.text, 'html.parser')

    # we're going to deal with results that appear within <td> tags
    results = soup.find_all('td')

    # reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
    date_line = results[0].text.split(": ",1)[1] # reads the text after the colon and space (': '), which is where the date information is located
    weekday_name = date_line.split(' ')[0] # stores the weekday name
    month_name = date_line.split(' ',2)[1] # stores the month name
    day_month_num = date_line.split(' ',1)[1].split(' ')[1].split(',')[0] # stores the day of the month
    year = date_line.split(', ',1)[1] # stores the year

    # concatenates and stores the full date value
    mmmmm_d_yyyy = month_name+' '+day_month_num+', '+year

    del results[:10] # deletes the first 10 results, which contained the date information and column headers

    program_count = [] # empty list for program counting

    # (within the while loop) begins a for loop that appends data for each program in a daily viewership report
    for result in results:
        rank = results[0].text # stores P18-49 rank
        program = results[1].text # stores program name
        network = results[2].text # stores network name
        start_time = results[3].text # stores program's start time
        mins = results[4].text # stores program's duration in minutes
        p18_49 = results[5].text # stores program's P18-49 rating
        p2 = results[6].text # stores program's P2+ viewer count (in thousands)
        records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list

        program_count.append(program) # adds each program name to the list.

        del results[:7] # deletes the first 7 results remaining, which contained the data for 1 row (1 program) which was just stored in 'records'
   
    print(len(program_count)) # Toubleshooting: prints to screen the number of programs counted. Should be 150.

    records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2)) # appends the data to the 'records' list
    print(str(start_date)+' collected...') # Visual reassurance one page/day is finished being collected
    start_date += delta # at the end of while loop, advance one day


df = pd.DataFrame(records, columns=['Date','Weekday','P18-49 Rank','Program','Network','Start time','Mins','P18-49','P2+']) # Creates DataFrame using the columns listed
df.to_csv('showbuzz '+ str(orig_start_date) + ' to '+ str(end_date) + '.csv', index=False, encoding='utf-8') # generates the CSV file, using start and end dates in filename

Solution

  • It seems like you're making debugging a lot tougher on yourself by pulling all the table data (<td>) individually like that. After stepping through the code and making a couple of changes, my best guess is the bug is coming from the fact that you're deleting entries from results while iterating over it, which gets messy. As a side note, you're also never using result from the loop which would make the declaration pointless. Something like this ends up a little cleaner, and gets you your 150 results:

    results = soup.find_all('tr')
    
    # reads the date text at the top of the web page so it can be inserted later to the CSV in the 'Date' column
    date_line = results[0].select_one('td').text.split(": ", 1)[1] # Selects first td it finds under the first tr
    weekday_name = date_line.split(' ')[0]
    month_name = date_line.split(' ', 2)[1]
    day_month_num = date_line.split(' ', 1)[1].split(' ')[1].split(',')[0]
    year = date_line.split(', ', 1)[1]
    
    mmmmm_d_yyyy = month_name + ' ' + day_month_num + ', ' + year
    
    program_count = []  # empty list for program counting
    
    for result in results[2:]:
        children = result.find_all('td')
        rank = children[0].text  # stores P18-49 rank
        program = children[1].text  # stores program name
        network = children[2].text  # stores network name
        start_time = children[3].text  # stores program's start time
        mins = children[4].text  # stores program's duration in minutes
        p18_49 = children[5].text  # stores program's P18-49 rating
        p2 = children[6].text  # stores program's P2+ viewer count (in thousands)
        records.append((mmmmm_d_yyyy, weekday_name, rank, program, network, start_time, mins, p18_49, p2))
    
        program_count.append(program)  # adds each program name to the list.
    

    You also shouldn't need to use a second list to get the number of programs you've retrieved (appending programs to program_count). It ends up the same amount in both lists no matter what since you're appending a program name from every result. So instead of print(len(program_count)) you could've instead used print(len(records)). I'm assuming it was just for debugging purposes though.