Search code examples
pythonpython-3.xjupyter-notebookscrapy

Scrapy dont save json file on Jupyter Notebook


I have a script created in a Jupyter Notebook that scrap a url and should save the result in a json file, but it doesn't do it, even in the log it says it does. I am using Google Drive to save the files and it is correctly mounted.

I leave you the code, although I think the problem may be in FEEDS, because in the log I see that it has selected well all the lines that it has to catch.

Thank you very much for your help.

import scrapy
from scrapy.crawler import CrawlerProcess


class DatosSpider(scrapy.Spider):
    name = 'spider_datos'
    start_urls = ['URL']
    custom_settings = {
        'FEEDS': { 'data.json': { 'format': 'json', 'overwrite': True}}
    }

        def parse(self, response):
            events = response.xpath('//*[@id="PageContent"]/div[3]/table/tbody/tr')
            for event in events:
                dato1 = event.xpath('.//td[1]/text()').get()
                dato2 = event.xpath('.//td[2]/text()').get()

            datos = {
                'Dato 1': dato1.strip() if dato1 else None,
                'Dato 2': dato2.strip() if dato2 else None,
            }

        yield datos
        
process = CrawlerProcess()
process.crawl(DatosSpider)
process.start()

Solution

  • The following code is tested and works (although why use Scrapy for a single piece of data in that page?

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    
    class DatosSpider(scrapy.Spider):
        name = 'spider_datos'
        start_urls = ['https://geoinfo.nmt.edu/nmtso/events/home.cfml']
        custom_settings = {
            'FEEDS': { 'data.json': { 'format': 'json', 'encoding': 'utf-8', 'overwrite': True}}
        }
    
        def parse(self, response):
            events = response.xpath('//*[@id="PageContent"]/div[3]/table/tbody/tr')
            for event in events:
                dato1 = event.xpath('.//td[1]/text()').get()
                dato2 = event.xpath('.//td[2]/text()').get()
    
                datos = {
                    'Dato 1': dato1.strip() if dato1 else None,
                    'Dato 2': dato2.strip() if dato2 else None,
                }
        
                yield datos
            
    process = CrawlerProcess()
    process.crawl(DatosSpider)
    process.start()
    

    The result is a JSON file looking like this:

    [
    {"Dato 1": "2022-11-30 15:40:32.0", "Dato 2": "32.640"}
    ]
    

    As this looks too much like an X-Y Problem, and you in fact may be after the data in that table, why not scrape the data with a 3 line code?

    import pandas as pd
    df = pd.read_html('https://geoinfo.nmt.edu/nmtso/events/home.cfml')[0]
    print(df)
    

    Result in terminal:

        Date+Time (UTC) Latitude    Longitude (WGS84)   Depth (km)  Magnitude   RMS STD (km)    #Stations   Unnamed: 8
    0   2023-11-16 13:35:53.0   36.843  -104.925    5.00    2.52    0.63    2.89    8   NaN
    1   2023-11-13 20:57:06.0   35.599  -107.487    5.00    2.13    0.49    4.77    8   NaN
    2   2023-11-13 16:40:57.0   34.565  -106.833    5.00    2.53    0.45    4.03    11  NaN
    3   2023-11-12 11:31:58.0   32.264  -104.468    6.75    2.34    0.46    2.22    20  NaN
    4   2023-11-11 14:01:21.0   32.304  -104.497    7.43    2.54    0.46    1.86    26  NaN
    ... ... ... ... ... ... ... ... ... ...
    170 2022-12-04 08:08:57.0   33.990  -106.880    5.00    2.40    0.40    1.41    11  NaN
    171 2022-12-01 07:50:24.0   34.010  -106.920    5.00    3.50    0.50    2.24    16  NaN
    172 2022-12-01 07:41:50.0   34.000  -106.920    5.00    2.90    0.60    1.41    18  NaN
    173 2022-11-30 16:34:43.0   32.640  -104.420    5.00    2.10    0.40    1.41    19  NaN
    174 2022-11-30 15:40:32.0   32.640  -104.440    5.00    2.10    0.50    1.41    16  NaN
    175 rows × 9 columns