Search code examples
pythonjsonhttppython-requestsscreen-scraping

Python Requests Library - Scraping separate JSON and HTML responses from POST request


I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.

I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:

However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.

JSON Response HTML Response

How can I separate the two and only obtain the JSON response using the Python requests library?

I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:

"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"

I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!


Solution

  • This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.

    You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests

    Edit:

    If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    url = "your url"
    browser = webdriver.PhantomJS()
    browser.get(url)
    html = browser.page_source
    soup = BeautifulSoup(html, 'lxml')
    
    # Then parse the html code here
    

    Or you could load the dynamic content with scrapy

    I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.

    Edit 2:

    To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':

    Getting the request info

    After that, you simply reproduce the request using the requests library:

    import requests
    import json
    
    url = 'http://paste.the.url/?here='
    
    response = requests.get(url)
    
    content = response.content
    
    # in my case content was byte string 
    # (it looks like b'data' instead of 'data' when you print it)
    # if this is you case, convert it to string, like so
    
    content_string = content.decode()
    
    content_json = json.loads(content_string)
    
    # do whatever you like with the data
    

    You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc