Search code examples
pythonbeautifulsoupurllib2

Loop through a python dataframe with 10 urls and extract contents from them (BeautifulSoup)


I have a csv called 'df' with 1 column. I have a header and 10 urls.

Col
"http://www.cnn.com"
"http://www.fark.com"
etc 
etc

This is my ERROR code

import bs4 as bs
df_link = pd.read_csv('df.csv')    
for link in df_link:
        x = urllib2.urlopen(link[0])
        new = x.read()
# Code does not even get past here as far as I checked
        soup = bs.BeautifulSoup(new,"lxml")
        for text in soup.find_all('a',href = True):
            text.append((text.get('href')))

I am getting an error which says

ValueError: unknown url type: C

I also get other variations of this error like

The issue is, it is not even getting past

x = urllib2.urlopen(link[0])

On the other hand; This is the WORKING CODE...

url = "http://www.cnn.com"
x = urllib2.urlopen(url)
new = x.read()
soup = bs.BeautifulSoup(new,"lxml")
for link in soup.find_all('a',href = True):
    links.append((link.get('href')))

Solution

  • Fixed answer

    I didn't realize you were using pandas, so what I said wasn't very helpful.

    The way you want to do this using pandas is to iterate over the rows and extract the info from them. The following should work without having to get rid of the header:

    import bs4 as bs
    import pandas as pd
    import urllib2
    
    df_link = pd.read_csv('df.csv')
    
    for link in df_link.iterrows():
        url = link[1]['Col']
        x = urllib2.urlopen(url)
        new = x.read()
        # Code does not even get past here as far as I checked
        soup = bs.BeautifulSoup(new,"lxml")
        for text in soup.find_all('a',href = True):
            text.append((text.get('href')))
    

    Original misleading answer below

    It looks like the header of your CSV file is not being treated separately, and so in the first iteration through df_link, link[0] is "Col", which isn't a valid URL.