Search code examples
pythonhtmlparsingbeautifulsoupurllib2

Parse activity unstable, getting a few random results


Here's the code:

# -*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup

with open('/users/Rachael/Desktop/CheckTitle.csv', 'r') as readcsv:
    for row in readcsv.readlines():
        try:
            openitem = urllib2.urlopen(row).read()
            soup = BeautifulSoup(openitem, 'lxml')
            print soup.head.find('title').get_text()

        except urllib2.URLError:
            print 'passed'
            pass

I'm getting following results:

(a):

passed
贝贝网京外裁员10%:团队要保持狼性和危机感_新浪财经_新浪网
垂直电商贝贝网被曝裁员 回应称只是10%人员优化_新浪财经_新浪网

(b):

passed
Traceback (most recent call last):
  File "C:/Users/Rachael/PycharmProjects/untitled1/GetTitle.py", line 10, in 
<module>
    print soup.head.find('title').get_text()
AttributeError: 'NoneType' object has no attribute 'find'

(c):

passed
贝贝网京外裁员10%:团队要保持狼性和危机感_新浪财经_新浪网
Traceback (most recent call last):
  File "C:/Users/Rachael/PycharmProjects/untitled1/GetTitle.py", line 10, in <module>
    print soup.head.find('title').get_text()
AttributeError: 'NoneType' object has no attribute 'find'

I'm getting these three types of results randomly.

If I do soup.title OR soup.title.text OR soup.title.string instead, it will return the same/similar error.

Please help!

I found this very hard to describe so if this is a dup in any ways please give me the link to similar posts.

Thanks!!


Solution

  • 'NoneType' object has no attribute is an error that happens when there are no results for this object, try print only the print soup.head.find('title') title without printing the .text it should return something like '[]' or 'None'
    Answer: There is no actual title tag or there's a bot protection of some kind on one of those sites you have in that file.