python python-3.x jupyter-notebook scrapy

Scrapy dont save json file on Jupyter Notebook

I have a script created in a Jupyter Notebook that scrap a url and should save the result in a json file, but it doesn't do it, even in the log it says it does. I am using Google Drive to save the files and it is correctly mounted.

I leave you the code, although I think the problem may be in FEEDS, because in the log I see that it has selected well all the lines that it has to catch.

Thank you very much for your help.

import scrapy
from scrapy.crawler import CrawlerProcess


class DatosSpider(scrapy.Spider):
    name = 'spider_datos'
    start_urls = ['URL']
    custom_settings = {
        'FEEDS': { 'data.json': { 'format': 'json', 'overwrite': True}}
    }

        def parse(self, response):
            events = response.xpath('//*[@id="PageContent"]/div[3]/table/tbody/tr')
            for event in events:
                dato1 = event.xpath('.//td[1]/text()').get()
                dato2 = event.xpath('.//td[2]/text()').get()

            datos = {
                'Dato 1': dato1.strip() if dato1 else None,
                'Dato 2': dato2.strip() if dato2 else None,
            }

        yield datos
        
process = CrawlerProcess()
process.crawl(DatosSpider)
process.start()

Solution

The following code is tested and works (although why use Scrapy for a single piece of data in that page?

import scrapy
from scrapy.crawler import CrawlerProcess


class DatosSpider(scrapy.Spider):
    name = 'spider_datos'
    start_urls = ['https://geoinfo.nmt.edu/nmtso/events/home.cfml']
    custom_settings = {
        'FEEDS': { 'data.json': { 'format': 'json', 'encoding': 'utf-8', 'overwrite': True}}
    }

    def parse(self, response):
        events = response.xpath('//*[@id="PageContent"]/div[3]/table/tbody/tr')
        for event in events:
            dato1 = event.xpath('.//td[1]/text()').get()
            dato2 = event.xpath('.//td[2]/text()').get()

            datos = {
                'Dato 1': dato1.strip() if dato1 else None,
                'Dato 2': dato2.strip() if dato2 else None,
            }
    
            yield datos
        
process = CrawlerProcess()
process.crawl(DatosSpider)
process.start()

The result is a JSON file looking like this:

[
{"Dato 1": "2022-11-30 15:40:32.0", "Dato 2": "32.640"}
]

As this looks too much like an X-Y Problem, and you in fact may be after the data in that table, why not scrape the data with a 3 line code?

import pandas as pd
df = pd.read_html('https://geoinfo.nmt.edu/nmtso/events/home.cfml')[0]
print(df)

Result in terminal:

    Date+Time (UTC) Latitude    Longitude (WGS84)   Depth (km)  Magnitude   RMS STD (km)    #Stations   Unnamed: 8
0   2023-11-16 13:35:53.0   36.843  -104.925    5.00    2.52    0.63    2.89    8   NaN
1   2023-11-13 20:57:06.0   35.599  -107.487    5.00    2.13    0.49    4.77    8   NaN
2   2023-11-13 16:40:57.0   34.565  -106.833    5.00    2.53    0.45    4.03    11  NaN
3   2023-11-12 11:31:58.0   32.264  -104.468    6.75    2.34    0.46    2.22    20  NaN
4   2023-11-11 14:01:21.0   32.304  -104.497    7.43    2.54    0.46    1.86    26  NaN
... ... ... ... ... ... ... ... ... ...
170 2022-12-04 08:08:57.0   33.990  -106.880    5.00    2.40    0.40    1.41    11  NaN
171 2022-12-01 07:50:24.0   34.010  -106.920    5.00    3.50    0.50    2.24    16  NaN
172 2022-12-01 07:41:50.0   34.000  -106.920    5.00    2.90    0.60    1.41    18  NaN
173 2022-11-30 16:34:43.0   32.640  -104.420    5.00    2.10    0.40    1.41    19  NaN
174 2022-11-30 15:40:32.0   32.640  -104.440    5.00    2.10    0.50    1.41    16  NaN
175 rows × 9 columns