I am trying to scrape pages from this website Text The pages in Arabic and French have the same URL I tried the following code
headers = {'Accept-Language': "lang=\"AR-DZ"}
r = requests.get("http://www.mae.gov.dz/news_article/6396.aspx",headers)
soup = BeautifulSoup(r.content,"lxml")
print(soup.getText)
I get the following error message:
<bound method Tag.get_text of <html><head><title>Request Rejected</title></head><body>The requested URL was rejected. Please consult with your administrator.<br/><br/>Your support ID is: 12750291427324767866<br/><br/><a href="javascript:history.back();">[Go Back]</a></body></html>>
when I remove the header Beautifulsoup scrapes the page in French.
My goal is to scrape the statements and speeches in Arabic in order to build a corpus. Any help appreciated.
First: in "lang=\"AR-DZ"
you have opening "
before AR-DZ
but you don't have closing "
after AR-DZ
but you should rather use "lang=AR-DZ"
Normally in browser to change language on this page you have to click link with url http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx which has language=ar
- so you can do the same in code.
Use Session()
to remeber cookies
and first use requests.get()
with this url. It will set correct language in cookies
.
import requests
from bs4 import BeautifulSoup
#headers = {'User-Agent': 'Mozilla/5.0'}
#headers = {'Accept-Language': "lang=AR-DZ"}
s = requests.Session()
url = 'http://www.mae.gov.dz/select_language.aspx?language=ar&file=default_ar.aspx'
r = s.get(url)#, headers=headers)
url = 'http://www.mae.gov.dz/news_article/6396.aspx'
r = s.get(url)#, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print(soup.getText)