can not read the url in txt file I want to read and open the url addresses in txt one by one, and I want to get the title of the title with regex from the source of url addresses Error messages:
Traceback (most recent call last): File "Mypy.py", line 14, in UrlsOpen = urllib2.urlopen(listSplit) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 420, in open req.timeout = timeout AttributeError: 'list' object has no attribute 'timeout'
Mypy.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import requests
import urllib2
import threading
UrlListFile = open("Url.txt","r")
UrlListRead = UrlListFile.read()
UrlListFile.close()
listSplit = UrlListRead.split('\r\n')
UrlsOpen = urllib2.urlopen(listSplit)
ReadSource = UrlsOpen.read().decode('utf-8')
regex = '<title.*?>(.+?)</title>'
comp = re.compile(regex)
links = re.findall(comp,ReadSource)
for i in links:
SaveDataFiles = open("SaveDataMyFile.txt","w")
SaveDataFiles.write(i)
SaveDataFiles.close()
When you are calling urllib2.urlopen(listSplit)
listSplit is a list when it needs to be a string or request object. It's a simple fix to iterate over the listSplit instead of passing the entire list to urlopen.
Also re.findall()
will return a list for each ReadSource searched. You can handle this a couple of ways:
I chose to handle it by just making a list of lists
websites = [ [link, link], [link], [link, link, link]
and iterating over both lists. This makes it so you can do something specific for each list of urls from each website (put in different file ect...).
You could also flatten the website
list to just contain the links instead of another list that then contains the links:
links = [link, link, link, link]
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import urllib2
from pprint import pprint
UrlListFile = open("Url.txt", "r")
UrlListRead = UrlListFile.read()
UrlListFile.close()
listSplit = UrlListRead.splitlines()
pprint(listSplit)
regex = '<title.*?>(.+?)</title>'
comp = re.compile(regex)
websites = []
for url in listSplit:
UrlsOpen = urllib2.urlopen(url)
ReadSource = UrlsOpen.read().decode('utf-8')
websites.append(re.findall(comp, ReadSource))
with open("SaveDataMyFile.txt", "w") as SaveDataFiles:
for website in websites:
for link in website:
pprint(link)
SaveDataFiles.write(link.encode('utf-8'))
SaveDataFiles.close()