Search code examples
pythonscrapyparsel

How to extract raw html from a Scrapy selector?


I'm extracting js data using response.xpath('//*')re_first() and later converting it to python native data. The problem is extract/re methods don't seem to provide a way to not unquote html i.e.

original html:

{my_fields:['O'Connor Park'], }

extract output:

{my_fields:['O'Connor Park'], }

turning this output into json won't work.

What's the easiest way around it?


Solution

  • Short answer:

    • Scrapy/Parsel selectors' .re() and .re_first() methods replace HTML entities (except <, &)
    • instead, use .extract() or .extract_first() to get raw HTML (or raw JavaScript instructions) and use Python's re module on extracted string

    Long answer:

    Let's look at an example input and various ways of extracting Javascript data from HTML.

    Sample HTML:

    <html lang="en">
    <body>
    <div>
        <script type="text/javascript">
            var i = {a:['O&#39;Connor Park']}
        </script>
    </div>
    </body>
    </html>
    

    Using scrapy Selector, which is using the parsel library underneath, you have several ways of extracting the Javascript snippet:

    >>> import scrapy
    >>> t = """<html lang="en">
    ... <body>
    ... <div>
    ...     <script type="text/javascript">
    ...         var i = {a:['O&#39;Connor Park']}
    ...     </script>
    ...     
    ... </div>
    ... </body>
    ... </html>
    ... """
    >>> selector = scrapy.Selector(text=t, type="html")
    >>> 
    >>> # extracting the <script> element as raw HTML
    >>> selector.xpath('//div/script').extract_first()
    u'<script type="text/javascript">\n        var i = {a:[\'O&#39;Connor Park\']}\n    </script>'
    >>> 
    >>> # only getting the text node inside the <script> element
    >>> selector.xpath('//div/script/text()').extract_first()
    u"\n        var i = {a:['O&#39;Connor Park']}\n    "
    >>> 
    

    Now, Using .re (or .re_first) you get different result:

    >>> # I'm using a very simple "catch-all" regex
    >>> # you are probably using a regex to extract
    >>> # that specific "O'Connor Park" string
    >>> selector.xpath('//div/script/text()').re_first('.+')
    u"        var i = {a:['O'Connor Park']}"
    >>> 
    >>> # .re() on the element itself, one needs to handle newlines
    >>> selector.xpath('//div/script').re_first('.+')
    u'<script type="text/javascript">'    # only first line extracted
    >>> import re
    >>> selector.xpath('//div/script').re_first(re.compile('.+', re.DOTALL))
    u'<script type="text/javascript">\n        var i = {a:[\'O\'Connor Park\']}\n    </script>'
    >>> 
    

    The HTML entity &#39; has been replaced by an apostrophe. This is due to a w3lib.html.replace_entities() call in .re/re_first implementation (see parsel source code, in extract_regex function), which is not used when simply calling extract() or extract_first()