I have a csv called 'df' with 1 column. I have a header and 10 urls.
Col
"http://www.cnn.com"
"http://www.fark.com"
etc
etc
This is my ERROR code
import bs4 as bs
df_link = pd.read_csv('df.csv')
for link in df_link:
x = urllib2.urlopen(link[0])
new = x.read()
# Code does not even get past here as far as I checked
soup = bs.BeautifulSoup(new,"lxml")
for text in soup.find_all('a',href = True):
text.append((text.get('href')))
I am getting an error which says
ValueError: unknown url type: C
I also get other variations of this error like
The issue is, it is not even getting past
x = urllib2.urlopen(link[0])
On the other hand; This is the WORKING CODE...
url = "http://www.cnn.com"
x = urllib2.urlopen(url)
new = x.read()
soup = bs.BeautifulSoup(new,"lxml")
for link in soup.find_all('a',href = True):
links.append((link.get('href')))
I didn't realize you were using pandas
, so what I said wasn't very helpful.
The way you want to do this using pandas
is to iterate over the rows and extract the info from them. The following should work without having to get rid of the header:
import bs4 as bs
import pandas as pd
import urllib2
df_link = pd.read_csv('df.csv')
for link in df_link.iterrows():
url = link[1]['Col']
x = urllib2.urlopen(url)
new = x.read()
# Code does not even get past here as far as I checked
soup = bs.BeautifulSoup(new,"lxml")
for text in soup.find_all('a',href = True):
text.append((text.get('href')))
It looks like the header of your CSV file is not being treated separately, and so in the first iteration through df_link
, link[0]
is "Col"
, which isn't a valid URL.