Search code examples
pythonxmlweb-scrapingbeautifulsouprss

How to scrape keywords that change every time?


I am trying to scrape a keyword in an xml document with BeautifulSoup but am unsure how to do so.

The xml document contains "Central Index Key," which changes each time for each document scraped. How would I be able to log the central index key for every unique xml document I scrape?

A sample is below. I want to log the string "0001773427" in this example:

<SEC-DOCUMENT>0001104659-22-079974.txt : 20220715
<SEC-HEADER>0001104659-22-079974.hdr.sgml : 20220715
<ACCEPTANCE-DATETIME>20220715060341
ACCESSION NUMBER:       0001104659-22-079974
CONFORMED SUBMISSION TYPE:  8-K
PUBLIC DOCUMENT COUNT:      14
CONFORMED PERIOD OF REPORT: 20220714
ITEM INFORMATION:       Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers: Compensatory Arrangements of Certain Officers
ITEM INFORMATION:       Financial Statements and Exhibits
FILED AS OF DATE:       20220715
DATE AS OF CHANGE:      20220715

FILER:

    COMPANY DATA:   
        COMPANY CONFORMED NAME:         SpringWorks Therapeutics, Inc.
        CENTRAL INDEX KEY:          0001773427
        STANDARD INDUSTRIAL CLASSIFICATION: BIOLOGICAL PRODUCTS (NO DIAGNOSTIC SUBSTANCES) [2836]
        IRS NUMBER:             000000000
        STATE OF INCORPORATION:         DE
        FISCAL YEAR END:            1231

    FILING VALUES:
        FORM TYPE:      8-K
        SEC ACT:        1934 Act
        SEC FILE NUMBER:    001-39044
        FILM NUMBER:        221084206

    BUSINESS ADDRESS:   
        STREET 1:       100 WASHINGTON BOULEVARD
        CITY:           STAMFORD
        STATE:          CT
        ZIP:            06902
        BUSINESS PHONE:     203-883-9490

    MAIL ADDRESS:   
        STREET 1:       100 WASHINGTON BOULEVARD
        CITY:           STAMFORD
        STATE:          CT
        ZIP:            06902
</SEC-HEADER>

Solution

  • from bs4 import BeautifulSoup
    import requests
    
    headers = {'User-Agent': 'Sample Company Name AdminContact@<sample company domain>.com'}
    r = requests.get('https://www.sec.gov/Archives/edgar/data/1773427/000110465922079974/0001104659-22-079974.txt', headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    print([x.strip().replace('\t', ' ') for x in soup.text.splitlines() if 'CENTRAL INDEX KEY:' in x ][0])
    

    This will return:

    CENTRAL INDEX KEY:   0001773427
    

    If you only want the key:

    print([x.replace('\t', ' ') for x in soup.text.splitlines() if 'CENTRAL INDEX KEY:' in x ][0].split(':')[1].strip())