Search code examples
pythonlistweb-scraping

How can I get my python code to scrape the correct part of a website?


I am trying to get python to scrape a page on Mississippi's state legislature website. My goal is scrape a page and add what I've scraped into a new csv. My command prompt doesn't give me errors, but I am only scraping a " symbol and that is it. Here is what I have so far:

import requests
from bs4 import BeautifulSoup
import pandas as pd

list = ['http://www.legislature.ms.gov/legislation/all-measures/']

temp_dict = {}

for page in list:
   r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')

    temp_dict = [item.text for item in soup.select('tbody')]

df = pd.DataFrame.from_dict(temp_dict, orient='index').transpose()
df.to_csv('3-New Bills.csv')

I believe the problem is with line 13:

    temp_dict = [item.text for item in soup.select('tbody')]

What should I replace 'tbody' with in this code to see all of the bills? Thank you so much for your help.


Solution

  • EDIT: Please see Sergey K' comment below, for a more elegant solution.

    That table is being loaded in an xframe, so you would have to scrape that xframe's source for data. The following code will return a dataframe with 3 columns (measure, shorttitle, author):

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
    
    list_for_df = []
    r = requests.get('http://billstatus.ls.state.ms.us/2022/pdf/all_measures/allmsrs.xml', headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    for x in soup.select('msrgroup'):
        list_for_df.append((x.measure.text.strip(), x.shorttitle.text.strip(), x.author.text.strip()))
    
    df = pd.DataFrame(list_for_df, columns = ['measure', 'short_title', 'author'])
    df
    

    Result:

        measure short_title author
    0   HB 1    Use of technology portals by those on probatio...   Bell (65th)
    1   HB 2    Youth court records; authorize judge to releas...   Bell (65th)
    2   HB 3    Sales tax; exempt retail sales of severe weath...   Bell (65th)
    3   HB 4    DPS; require to establish training component r...   Bell (65th)
    4   HB 5    Bonds; authorize issuance to assist City of Ja...   Bell (65th)
    ... ... ... ...
    

    You can add more data to that table, like measurelink, authorlink, action, etc - whatever is available in the xml document tags.