Search code examples
pythonhtmlpython-requests

Is there a way to retrieve search results from a public domain in Python


Looking at something like this:

https://disclosures-clerk.house.gov/FinancialDisclosure

Using the 'Search' function in the box on the left, I'd like to select a year in the 'Filing Year' dropdown and retrieve PDFs hyperlinked to in the results in Python.

For instance, for year 2024, I'd like to retrieve PDFs linked to for the 140 entries returned. Ideally, I'd also be able to filter out based on 'Filing'. Any way to do this?


Solution

  • Try:

    import requests
    from bs4 import BeautifulSoup
    
    data = {
        "LastName": "",
        "FilingYear": "2022",  # <-- change year here
        "State": "",
        "District": "",
    }
    
    api_url = (
        "https://disclosures-clerk.house.gov/FinancialDisclosure/ViewMemberSearchResult"
    )
    
    soup = BeautifulSoup(requests.post(api_url, data=data).content, "html.parser")
    
    for a in soup.select('a[href$=".pdf"]'):
        print(a.text, a["href"])
    

    Prints:

    
    ...
    
    Wittman, Hon.. Robert J.  public_disc/ptr-pdfs/2022/20021150.pdf
    Wittman, Hon.. Robert J.  public_disc/ptr-pdfs/2022/20021344.pdf
    Wittman, Hon.. Robert J.  public_disc/ptr-pdfs/2022/20021515.pdf
    Wittman, Hon.. Robert J.  public_disc/ptr-pdfs/2022/20021679.pdf
    Wittman, Hon.. Robert J.  public_disc/ptr-pdfs/2022/20021807.pdf
    Wittman, Hon.. Robert J.  public_disc/ptr-pdfs/2022/20022101.pdf
    Wittman, Hon.. Robert J.  public_disc/financial-pdfs/2022/30018513.pdf
    Womack, Hon.. Steve  public_disc/financial-pdfs/2022/10054531.pdf
    Womack, Hon.. Steve  public_disc/ptr-pdfs/2022/20022049.pdf
    Yakym, Hon.. Rudy III. public_disc/financial-pdfs/2022/10052905.pdf
    Yakym, Hon.. Rudy III. public_disc/ptr-pdfs/2022/20022181.pdf
    Yakym, Hon.. Rudy III. public_disc/financial-pdfs/2022/30018183.pdf
    Zinke, Hon.. Ryan K.  public_disc/financial-pdfs/2022/10053424.pdf