Search code examples
pythonweb-scrapinggetpython-requests

Get method from requests library seems to return homepage rather than specific URL


I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.

I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.

I haven't been able to find any documentation on the web about anyone else having this problem...

I'm using the Spyder IDE.

# Import library
import requests

# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'

# Pull contracts page
page = requests.get(url)

# Check that correct page is being pulled
print(page.text)

This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.


Solution

  • After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.

    You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.

    (Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)

    As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.