Specific HTML parsing with Python 3 and BeautifulSoup

I am trying to parse the info in the bottom right table of the following link, the table that says Current schedule submissions:

dnedesign.us.to/tables/

I was able to parse it down to:

{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"15:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"16:30";s:7:"endTime";s:5:"18:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"14:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:7:"Tuesday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}

and here is the code that performs the parsing to get the above:

try:
    from urllib.request  import urlopen
except ImportError:
    from urllib2 import urlopen
    from bs4 import BeautifulSoup
url = 'http://dnedesign.us.to/tables/'
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            print(td.text[4:])

I am trying to parse it down to the following:

Day:Tuesday    Starttime:14:30    Endtime:16:30
Day:Sunday     Starttime:12:30    Endtime:14:30
Day:Sunday     Starttime:12:30    Endtime:16:30
Day:Sunday     Starttime:12:30    Endtime:16:30
....
....

And so on for the rest of the table.

I am using Python 3.6.9 and Httpie 0.9.8 on Linux Mint Cinnamon 19.1. This is for my graduation project, any help would be appreciated, thanks. Neil M.

Solution

You can use regex to parse the well-formed table data, taking care to look out for empty strings:

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import re
from bs4 import BeautifulSoup

url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
data = []

for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
            data.append({cols[x]: cols[x+1] for x in range(0, len(cols), 2)})

for row in data[::-1]:
    row = {
        k: re.sub(
            r"[a-zA-Z]+", lambda x: x.group().capitalize(), "%s:%s" % (k, v)
        ) for k, v in row.items()
    }
    print("    ".join([row["Day"], row["startTime"], row["endTime"]]))

Output:

Day:Tuesday    Starttime:14:30    Endtime:16:30
Day:Sunday    Starttime:12:30    Endtime:14:30
Day:Sunday    Starttime:12:30    Endtime:16:30
Day:Sunday    Starttime:12:30    Endtime:16:30
Day:Sunday    Starttime:    Endtime:
Day:Sunday    Starttime:    Endtime:
Day:Sunday    Starttime:16:30    Endtime:18:30
Day:Sunday    Starttime:14:30    Endtime:15:30
Day:Sunday    Starttime:14:30    Endtime:16:30

The second stage creates strings to your format specification, but the intermediate step of creating the data list to store key-value pairs of column data for each row is the meat of the work.

In terms of your request to put the items into a class, you can create an instance of Schedule and populate relevant fields instead of using a dictionary:

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import re
from bs4 import BeautifulSoup


class Schedule: 
    def __init__(self, day, start, end): 
        self.day = day
        self.start = start 
        self.end = end 


url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
schedules = []

for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
            data = {cols[x]: cols[x+1] for x in range(0, len(cols), 2)}
            schedules.append(Schedule(data["Day"], data["startTime"], data["endTime"]))

for schedule in schedules:
    print(schedule.day, schedule.start, schedule.end)