Search code examples
pythonpython-2.7python-3.xbeautifulsoupurllib2

How to rectify a TypeError in python?Beautiful Soup string to tag error?


Here is simple snippet to scrape Wikipedia website and to print each of its contents separately like cast in separate variable and production in separate variable and so on .. Here in the first div named "bodyContent" there is a another div names "mw-content-text" here my problem is retrieve the data of the first paragraphs before the tag "h2" and i have a code snippet to work out this and unable to convert from BeautifulSoup tag from string and the error is TypeError: unsupported operand type(s) for +: 'Tag' and 'str'

import urllib
from bs4 import BeautifulSoup

url ="https://en.wikipedia.org/wiki/Deadpool_(film)"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()

soup = BeautifulSoup(htmltext,"lxml")
#print soup.prettify()
movie_title = soup.find('h1',{'id':'firstHeading'})
print movie_title.text
movie_info = soup.find_all('p')
#print movie_info[0].text 
#print movie_info[1].text
'''I dont want like this because we dont know how many
 intro paragraphs will be so we have to scrape all paras just before that h2 tag'''

Here the problem rises i want to iterate and add .next_sibling and to make a try-exception block to find if the

"resultant_next_url.name == 'p' "

def findNextSibling(base_url):
    tag_addition = 'next_sibling'
    next_url = base_url+'.'+tag_addition
    return next_url

And finally to do like this

base_url = movie_info[0]
resultant_url = findNextSibling(base_url)
print resultant_url.text

Solution

  • Finally found answer, this is solving the problem

    import urllib
    from bs4 import BeautifulSoup
    
    url ="https://en.wikipedia.org/wiki/Deadpool_(film)"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    
    soup = BeautifulSoup(htmltext,"lxml")
    #print soup.prettify()
    movie_title = soup.find('h1',{'id':'firstHeading'})
    print movie_title.text
    
    movie_info = soup.find_all('p')
    # print movie_info[0].text
    # print movie_info[1].text
    
    def findNextSibling(resultant_url):
        #tag_addition = 'next_sibling'
        #base_url.string = base_url.string + '.' + tag_addition
        return resultant_url.next_sibling
    
    resultant_url = movie_info[0]
    resultant_url = findNextSibling(resultant_url)
    print resultant_url.text