Search code examples
pythonbeautifulsoupwikipedia

How to get sub-content from wikipedia page using BeautifulSoup


I am trying to scrape sub-content from Wikipedia pages based on the internal link using python, The problem is that scrape all content from the page, how can scrape just internal link paragraph, Thanks in advance

base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
sub_link="#الأسباب"
total=base_link+sub_link
r=requests.get(total)
soup = bs(r.text, 'html.parser')          
results=soup.find('p')           
print(results)

Solution

  • It is because it's not a sublink you are trying to scrape. It's an anchor.

    Try to request the entire page and then to find the given id.

    Something like this:

    from bs4 import BeautifulSoup as soup
    import requests
    
    base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
    anchor_id="ﺍﻸﺴﺑﺎﺑ"
    r=requests.get(base_link)
    page = soup(r.text, 'html.parser')
    span = page.find('span', {'id': anchor_id})
    results = span.parent.find_next_siblings('p')
    print(results[0].text)