Search code examples
pythonbeautifulsoupmultilingualarabic

How to get beautiful soup to scrape pages in Arabic from a multilingual website where pages in different languages have the same URL


I am trying to scrape pages from this website Text The pages in Arabic and French have the same URL I tried the following code

    headers = {'Accept-Language': "lang=\"AR-DZ"}
    r = requests.get("http://www.mae.gov.dz/news_article/6396.aspx",headers)
    soup = BeautifulSoup(r.content,"lxml")
    print(soup.getText)

I get the following error message:

<bound method Tag.get_text of <html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br/><br/>Your support ID is: 12750291427324767866<br/><br/><a href="javascript:history.back();">[Go Back]</a></body></html>>

when I remove the header Beautifulsoup scrapes the page in French.

My goal is to scrape the statements and speeches in Arabic in order to build a corpus. Any help appreciated.


Solution

  • First: in "lang=\"AR-DZ" you have opening " before AR-DZ but you don't have closing " after AR-DZ but you should rather use "lang=AR-DZ"


    Normally in browser to change language on this page you have to click link with url http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx which has language=ar - so you can do the same in code.

    Use Session() to remeber cookies and first use requests.get() with this url. It will set correct language in cookies.

    import requests
    from bs4 import BeautifulSoup 
    
    #headers = {'User-Agent': 'Mozilla/5.0'}
    #headers = {'Accept-Language': "lang=AR-DZ"}
    
    s = requests.Session()
    
    url = 'http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx'
    r = s.get(url)#, headers=headers)
    
    url = 'http://www.mae.gov.dz/news_article/6396.aspx'
    r = s.get(url)#, headers=headers)
    
    soup = BeautifulSoup(r.content, "lxml")
    print(soup.getText)