Search code examples
pythonhtml-parsingbeautifulsoupheadless-browser

How to extract a JSON object that was defined in a HTML page javascript block using Python?


I am downloading HTML pages that have data defined in them in the following way:

... <script type= "text/javascript">    window.blog.data = {"activity":{"type":"read"}}; </script> ...

I would like to extract the JSON object defined in 'window.blog.data'. Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)

Thanks

Edit: Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?


Solution

  • BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).

    In simple cases you could:

    1. extract <script>'s text using an html parser
    2. assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
    3. assume that the string is a valid json and parse it using json module

    Example:

    #!/usr/bin/env python
    html = """<!doctype html>
    <title>extract javascript object as json</title>
    <script>
    // ..
    window.blog.data = {"activity":{"type":"read"}};
    // ..
    </script>
    <p>some other html here
    """
    import json
    import re
    from bs4 import BeautifulSoup  # $ pip install beautifulsoup4
    soup = BeautifulSoup(html)
    script = soup.find('script', text=re.compile('window\.blog\.data'))
    json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
                          script.string, flags=re.DOTALL | re.MULTILINE).group(1)
    data = json.loads(json_text)
    assert data['activity']['type'] == 'read'
    

    If the assumptions are incorrect then the code fails.

    To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by @approximatenumber):

    from slimit import ast  # $ pip install slimit
    from slimit.parser import Parser as JavascriptParser
    from slimit.visitors import nodevisitor
    
    soup = BeautifulSoup(html, 'html.parser')
    tree = JavascriptParser().parse(soup.script.string)
    obj = next(node.right for node in nodevisitor.visit(tree)
               if (isinstance(node, ast.Assign) and
                   node.left.to_ecma() == 'window.blog.data'))
    # HACK: easy way to parse the javascript object literal
    data = json.loads(obj.to_ecma())  # NOTE: json format may be slightly different
    assert data['activity']['type'] == 'read'
    

    There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).