Search code examples
pythonxmlweb-scrapingbeautifulsouprss

How do I return the first link in a non-list output


I am attempting to return only the first url that pops up when scraping "https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=8-k&company=&dateb=&owner=include&%C2%A0%20Istart=0&count=40&output=atom." However, while a list is created when scraping, it is archived incorrectly, as the [0] in the list returns "h", [1] returns "t" and so on.

For example, outputting print(link[0]) does not return the first link, but returns h h h h

How can I make it so I only return the first URL that is listed in the xml file?

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Sample Company Name AdminContact@<sample company domain>.com'}

xml_text = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent&CIK=&type=8-k&company=&dateb=&owner=include&%C2%A0%20Istart=0&count=40&output=atom', headers=headers).text.lower()

soup = BeautifulSoup(xml_text, 'xml')

for e in soup.select('entry'):
    link = e.link['href']
    print(link)

Solution

  • For example, outputting print(link[0]) does not return the first link, but returns h h h h

    This is expected because link is only ever a single URL string, so link[0] is the first character of that, an "h".

    If you want to collect all of the links in a list, change this code

    for e in soup.select('entry'):
        link = e.link['href']
        print(link)
    

    to something like

    links = [e.link['href'] for e in soup.select('entry')]
    

    Then you can access the first link with your index notation, e.g.

    print(links[0])
    # https://www.sec.gov/archives/edgar/data/1718405/000171840522000045/0001718405-22-000045-index.htm
    

    Alternatively, you could do something like:

    link = None
    for e in soup.select('entry'):
        link = e.link['href']
        break
    
    print(link)
    

    which would begin to walk the parsed XML but break after the first entry.