Search code examples
pythonpython-requestspyquery

Extracting URLs from website using Pyquery and requests


I have this code:

from pyquery import PyQuery as pq
import requests

url = "https://www.mba.org/news-and-research/forecasts-and-commentary"
content = requests.get(url).content
doc = pq(content)

Latest_Report_MO = doc("#ContentPlaceholder_C012_Col01")

print(Latest_Report_MO)

I get this result:

<div id="ContentPlaceholder_C012_Col01" class="sf_colsIn grid__unit grid__unit--1-3-l" data-sf-element="Column 2" data-placeholder-label="Column 2">&#13; <div>&#13;
    <div class="sfContentBlock sf-Long-text"><a target="_blank" href="/docs/default-source/research-and-forecasts/historical-mortgage-origination-estimates.xlsx?sfvrsn=8c6933cb_5"/><a style="margin-bottom:20px;" href="/docs/default-source/research-and-forecasts/forecasts/2023/historical-mortgage-origination-estimates.xlsx?sfvrsn=a7595901_1"/><a href="/docs/default-source/research-and-forecasts/historical-mortgage-origination-estimates.xlsx?sfvrsn=8c6933cb_5"/><a href="/docs/default-source/research-and-forecasts/historical-mortgage-origination-estimates.xlsx?sfvrsn=8c6933cb_5"/><a href="/docs/default-source/research-and-forecasts/historical-mortgage-origination-estimates.xlsx?sfvrsn=8c6933cb_5"><img src="/images/default-source/research/20125-research-forecast-web-button-qoe.png?sfvrsn=e73fc287_0" alt="" sf-size="66661"/></a> <p>Historical record of single-family, one- to four-unit loan origination estimates. Last updated June 2023. </p></div>    &#13; </div>
    </div>

I am interested in the href="/docs/default-source/research-and-forecasts/historical-mortgage-origination-estimates.xlsx?sfvrsn=8c6933cb_5"

How do I use the .attr() to extract this URL? Or is there any other method?


Solution

  • Here you can go with doc("#ContentPlaceholder_C012_Col01 .sfContentBlock a[target='_blank']")

    from pyquery import PyQuery as pq
    import requests
    
    url = "https://www.mba.org/news-and-research/forecasts-and-commentary"
    content = requests.get(url).content
    doc = pq(content)
    
    items = doc("#ContentPlaceholder_C012_Col01 .sfContentBlock a[target='_blank']")
    print(pq(items[0]).attr('href'))