Search code examples
pythonpython-3.xpython-3.3

Crawler only loads one title


i did some questions in here and them one guy gave me this code. But I need help because it is only bringing one result of my websites.txt

Crawler.py

import urllib.request
import re

regex = "<title>(.+?)</title>"
pattern = re.compile(regex)
txtfl = open('websites.txt')
webpgsinfile = txtfl.readlines()
urls = webpgsinfile
htmlfile = urllib.request.urlopen(urls[i])
htmltext = htmlfile.read().decode('utf8')
titles = re.findall(pattern,htmltext)

if len(titles) > 0:
    print(titles[0])
    i+=1

The websites.txt

http://youtube.com
http://bigsolutions.com.br

Solution

  • import re
    from urllib.request import urlopen
    
    def get_page(url, encoding='utf-8'):
        return urlopen(url).read().decode(encoding, errors='ignore')
    
    def get_title(txt, reg=re.compile('<title>(.*)</title>', re.IGNORECASE | re.DOTALL)):
        match = reg.search(txt)
        if match is None:
            return ''
        else:
            return match.group(1).strip()
    
    def main():
        with open('websites.txt') as inf:
            urls = [line.strip() for line in inf]
        titles = [get_title(get_page(url)) for url in urls if url]
        print(titles)
    
    if __name__=="__main__":
        main()
    

    results in

    ["LimeCD - Lime's Code Library", 'YouTube', 'Big Solutions - Aqui nós pensamos grande!']