Search code examples
pythonweb-scrapingscrapyscrapy-shell

regex that access json data from javascript html tag with scrapy


I'm new to scrapy, learning atm and I'm trying to access JSON data on a page html and put them in a python dict and work with data later so I did try serval things, all failed, would appreciate if anyone could help me with that

I found the response.css to the desired tag which result looks like this in scrapy shell:

response.css('div.rich-snippet script').get()

'<script type="application/ld+json">{\n    some json data with newline chars \n  }\n    ]\n}</script>'

I need everything between {} but, so I tried regex to do it, like this:

response.css('div.rich-snippet script').re(r'\{[^}]*\}')

this regex should pick everything between brackets but there are more of these symbols in JSON and there are other things in the response before the JSON data so this returns just empty list I tried more but always the same results, an empty list

.re(r'<script>\{[^}]*\}</script>')
.re(r'<script>(.|\n)*?<\/script>')
...

so I tried something else, inside the spider I tried to parse the response directly to json.loads method and save the results in file from cli, that doesn't work either, perhaps I'm parsing the tag wrong or it's not even possible

    import scrapy
    import json

 class SomeSpider(scrapy.Spider):
    name = 'test'
    start_urls = [
        'url'
    ]

    def parse(self, response, **kwargs):
        json_file = response.css('div.rich-snippet script').get()

        yield json.loads(json_file)

yet again, an empty result

Pls help me to understand, thanks.


Solution

  • Your css selector should specify that you only want the part inside the tag, that is should be ::text, so your code becomes:

    
        def parse(self, response, **kwargs):
            json_file = response.css('div.rich-snippet script::text')
    
            yield json.loads(json_file)
    

    You might also want to have a look at: https://github.com/scrapinghub/extruct

    It might better fit parsing ld+json