Here is simple snippet to scrape Wikipedia website and to print each of its contents separately like cast in separate variable and production in separate variable and so on .. Here in the first div named "bodyContent" there is a another div names "mw-content-text" here my problem is retrieve the data of the first paragraphs before the tag "h2" and i have a code snippet to work out this and unable to convert from BeautifulSoup tag from string and the error is TypeError: unsupported operand type(s) for +: 'Tag' and 'str'
import urllib
from bs4 import BeautifulSoup
url ="https://en.wikipedia.org/wiki/Deadpool_(film)"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext,"lxml")
#print soup.prettify()
movie_title = soup.find('h1',{'id':'firstHeading'})
print movie_title.text
movie_info = soup.find_all('p')
#print movie_info[0].text
#print movie_info[1].text
'''I dont want like this because we dont know how many
intro paragraphs will be so we have to scrape all paras just before that h2 tag'''
Here the problem rises i want to iterate and add .next_sibling and to make a try-exception block to find if the
"resultant_next_url.name == 'p' "
def findNextSibling(base_url):
tag_addition = 'next_sibling'
next_url = base_url+'.'+tag_addition
return next_url
And finally to do like this
base_url = movie_info[0]
resultant_url = findNextSibling(base_url)
print resultant_url.text
Finally found answer, this is solving the problem
import urllib
from bs4 import BeautifulSoup
url ="https://en.wikipedia.org/wiki/Deadpool_(film)"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext,"lxml")
#print soup.prettify()
movie_title = soup.find('h1',{'id':'firstHeading'})
print movie_title.text
movie_info = soup.find_all('p')
# print movie_info[0].text
# print movie_info[1].text
def findNextSibling(resultant_url):
#tag_addition = 'next_sibling'
#base_url.string = base_url.string + '.' + tag_addition
return resultant_url.next_sibling
resultant_url = movie_info[0]
resultant_url = findNextSibling(resultant_url)
print resultant_url.text