Search code examples
pythonwordpressbeautifulsoupscreen-scraping

Scraping news title from a page with bs4 in python


I was trying to scrape the "entry-title" of the last news on the site "https://www.abafg.it/category/avvisi/" and prints [ ] instead, what am i doing the wrong way?

The result of what the code returns instead of the "entry-title" of the page i want to scrape the info

I tried to scrape the class "entry-title" to let me save the title, the link of where that news leads and the date of publish


Solution

  • The entry-title class is not of the link a tag, but of the h2 wrapped around it. You can use

    names = [h.a for h in soup.find_all('h2', class_='entry-title')]
    

    But I think using CSS selectors would be better here

    names = soup.select('h2.entry-title > a[href]')
    

    will select any a tag with a href attribute and with a h2 parent of class entry-title.


    Then,

    for a in names: print(a.get_text().strip(), a.get('href'))
    

    will print

    AVVISO LEZIONI DI SCULTURA : PROF.BORRELLI https://www.abafg.it/avviso-lezioni-di-scultura-prof-borrelli/
    ORARIO DELLE LEZIONI A.A.2022/2023 IN VIGORE DAL 21 NOVEMBRE 2022 https://www.abafg.it/orario-delle-lezioni-a-a-2022-2023-in-vigore-dal-21-novembre-2022/
    PROROGA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 https://www.abafg.it/proroga-bando-affidamenti-interni-d-d-n-3-del-4-11-2022/
    D.D. n.  7 del 15.11.2022 DECRETO GRADUATORIA PROVVISORIA ABPR19 https://www.abafg.it/d-d-n-7-del-15-11-2022-decreto-graduatoria-provvisoria-abpr19/
    D.D. n. 5 DEL 10.11.2022 DECRETO DI NOMINA COMMISSIONE ABPR19 https://www.abafg.it/d-d-n-5-del-10-11-2022-decreto-di-nomina-commissione-abpr19/
    RIAPERTURA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 https://www.abafg.it/riapertura-bando-affidamenti-interni-d-d-n-4-del-4-11-2022/
    D.D.81 del 26.10.2022 GRADUATORIA DEFINITIVA ABST48 STORIA DELLE ARTI APPLICATE https://www.abafg.it/d-d-81-del-26-10-2022-graduatoria-definitiva-abst48-storia-delle-arti-applicate/
    AVVISO PRESENTAZIONE DOMANDE CULTORE DELLA MATERIA A.A.22.23-SCADENZA 11.11.2022 https://www.abafg.it/avviso-presentazione-domande-cultore-della-materia-a-a-22-23-scadenza-11-11-2022/
    D.D. N.78 DEL 19/10/2022 BANDO GRADUATORIE D’ISTITUTO-SCADENZA 9/11/2022. https://www.abafg.it/d-d-n-78-bando-graduatorie-distituto-scadenza-9-11-2022/
    ORARIO PROVVISIORIO DELLE LEZIONI A.A. 2022/2023: TRIENNIO E BIENNIO https://www.abafg.it/orario-provvisiorio-delle-lezioni-a-a-2022-2023-triennio-e-biennio/
    


    Added EDIT: to save the printed text into a file, you could first save it as one string with .join first

    asText = '\n'.join([f'{a.get_text().strip()} {a.get("href")}' for a in names])
    

    and then you could save it with

    with open('./resources/titles.txt', 'w', encoding='utf-8') as f: 
        f.write(asText)
    

    If you want something more visuals-friendly, I suggest using pandas

    asDF = pandas.DataFrame([{
        'title': a.get_text().strip(), 'link': a.get('href')
    } for a in names])
    asText = asDF.to_markdown(index=False)
    

    and now asText looks like

    | title                                                                            | link                                                                                                   |
    |:---------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|
    | ORARIO DELLE LEZIONI A.A.2022/2023 IN VIGORE DAL 21 NOVEMBRE 2022                | https://www.abafg.it/orario-delle-lezioni-a-a-2022-2023-in-vigore-dal-21-novembre-2022/                |
    | PROROGA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022                        | https://www.abafg.it/proroga-bando-affidamenti-interni-d-d-n-3-del-4-11-2022/                          |
    | D.D. n.  7 del 15.11.2022 DECRETO GRADUATORIA PROVVISORIA ABPR19                 | https://www.abafg.it/d-d-n-7-del-15-11-2022-decreto-graduatoria-provvisoria-abpr19/                    |
    | D.D. n. 5 DEL 10.11.2022 DECRETO DI NOMINA COMMISSIONE ABPR19                    | https://www.abafg.it/d-d-n-5-del-10-11-2022-decreto-di-nomina-commissione-abpr19/                      |
    | RIAPERTURA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022                     | https://www.abafg.it/riapertura-bando-affidamenti-interni-d-d-n-4-del-4-11-2022/                       |
    | D.D.81 del 26.10.2022 GRADUATORIA DEFINITIVA ABST48 STORIA DELLE ARTI APPLICATE  | https://www.abafg.it/d-d-81-del-26-10-2022-graduatoria-definitiva-abst48-storia-delle-arti-applicate/  |
    | AVVISO PRESENTAZIONE DOMANDE CULTORE DELLA MATERIA A.A.22.23-SCADENZA 11.11.2022 | https://www.abafg.it/avviso-presentazione-domande-cultore-della-materia-a-a-22-23-scadenza-11-11-2022/ |
    | D.D. N.78 DEL 19/10/2022 BANDO GRADUATORIE D’ISTITUTO-SCADENZA 9/11/2022.        | https://www.abafg.it/d-d-n-78-bando-graduatorie-distituto-scadenza-9-11-2022/                          |
    | ORARIO PROVVISIORIO DELLE LEZIONI A.A. 2022/2023: TRIENNIO E BIENNIO             | https://www.abafg.it/orario-provvisiorio-delle-lezioni-a-a-2022-2023-triennio-e-biennio/               |
    | GRADUATORIA DEFINITIVA  ABST47 STILE,STORIA DELL’ARTE E DEL COSTUME              | https://www.abafg.it/graduatoria-definitiva-abst47-stilestoria-dellarte-e-del-costume/                 |
    

    And then, instead of TXT, you could also save it as CSV with

    asDF.to_csv('./resources/titles.csv', index=False)
    

    so that you can view it as a spreadsheet csv