python json http python-requests screen-scraping

Python Requests Library - Scraping separate JSON and HTML responses from POST request

I'm new to web scraping, programming, and StackOverflow, so I'll try to phrase things as clearly as I can.

I'm using the Python requests library to try to scrape some info from a local movie theatre chain. When I look at the Chrome developer tools response/preview tabs in the network section, I can see what appears to be very clean and useful JSON:

However, when I try to use requests to obtain this same info, instead I get the entire page content (pages upon pages of html). Upon further inspection of the cascade in the Chrome developer tools, I can see there are two events called GetNowPlayingByCity: One contains the JSON info while the other seems to be the HTML.

JSON Response HTML Response

How can I separate the two and only obtain the JSON response using the Python requests library?

I have already tried modifying the headers within requests.post (the Chrome developer tools indicate this is a post method) to include "accept: application/json, text/plain, */*" but didn't see a difference in the response I was getting with requests.post. As it stands I can't parse any JSON from the response I get with requests.post and get the following error:

"json.decoder.JSONDecodeError: Expecting value: line 4 column 1 (char 3)"

I can always try to parse the full HTML, but it's so long and complex I would much rather work with friendly JSON info. Any help would be much appreciated!

Solution

This is probably because the javascript the page sends to your browser is making a request to an API to get the json info about the movies.

You could either try sending the request directly to their API (see edit 2), parse the html with a library like Beautiful Soup or you can use a dedicated scraping library in python. I've had great experiences with scrapy. It is much faster than requests

Edit:

If the page uses dynamically loaded content, which I think is the case, you'd have to use selenium with the PhantomJS browser instead of requests. here is an example:

from bs4 import BeautifulSoup
from selenium import webdriver

url = "your url"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')

# Then parse the html code here

Or you could load the dynamic content with scrapy

I recommend the latter if you want to get into scraping. It would take a bit more time to learn but it is a better solution.

Edit 2:

To make a request directly to their api you can just reproduce the request you see. Using google chrome, you can see the request if you click on it and go to 'Headers':

After that, you simply reproduce the request using the requests library:

import requests
import json

url = 'http://paste.the.url/?here='

response = requests.get(url)

content = response.content

# in my case content was byte string 
# (it looks like b'data' instead of 'data' when you print it)
# if this is you case, convert it to string, like so

content_string = content.decode()

content_json = json.loads(content_string)

# do whatever you like with the data

You can modify the url as you see fit, for example if it is something like http://api.movies.com/?page=1&movietype=3 you could modify movietype=3 to movietype=2 to see a different type of movie, etc