Search code examples
pythoncsvweb-scrapingterminalmacos-sierra

Python: script is not writing the links from variable


My script below...

I feel like I'm missing one line of code to make this work properly. Using Reddit as a test source to scrap sport links.

# import libraries
import bs4
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.reddit.com/r/BoxingStreams/comments/6w2vdu/mayweather_vs_mcgregor_archive_footage/'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")

hyperli = page_soup.findAll("form")


filename = "sportstreams.csv"
f = open(filename, "w")

headers = "Sport Links"

f.write(headers)

for containli in hyperli:
    link = containli.a["href"] 

    print(link)

    f.write(str(link)+'\n')

f.close() 

Everything works except that it only grabs the link from the first row [0]. If I don't use the code ["href"] then it adds all the (a href links) except that it also adds the word NONE to the CSV file. Using the ["href"] would (I hope) just add the http links and avoid adding the word NONE.

What am I missing here?


Solution

  • As explained in the documentation Navigating using tag names:

    Using a tag name as an attribute will give you only the first tag by that name
    ...
    If you need to get all the <a> tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all():

    In your case, you could use page_soup.select("form a[href]") to find all the links in forms that have href attributes.

    links = page_soup.select("form a[href]")
    for link in links:
        href = link["href"]
        print(href)
        f.write(href + "\n")