Search code examples
pythonhtmlweb-scrapingbeautifulsouppython-requests-html

Why is text inside HTML tags getting translated when requested while Web Scraping?


I am learning a little bit about web scraping and currently i am trying to do a small project. So with this code I am storing the HTML code inside soup variable.

source=requests.get(URL)
soup=BeautifulSoup(source.text,'html.parser')

The problem is: when I inspect the code inside my browser it looks like this:

<a ...>The Godfather</a>

but when I try to use it in my program only the text inside tag (The Godfather) gets translated to my native language (Кум):

<a ...>Кум</a>

I dont want it to get translated. My browser is completely in English and I have no idea why is this happening. Any help would be much appreciated!


Solution

  • Try to specify Accept-Language HTTP header in your request:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://www.imdb.com/search/title/?groups=top_100&sort=user_rating,desc"
    
    headers = {"Accept-Language": "en-US,en;q=0.5"}
    soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
    
    
    for h3 in soup.select("h3"):
        print(h3.get_text(strip=True, separator=" "))
    

    Prints:

    1. The Shawshank Redemption (1994)
    2. The Godfather (1972)
    3. The Dark Knight (2008)
    4. The Lord of the Rings: The Return of the King (2003)
    5. Schindler's List (1993)
    6. The Godfather Part II (1974)
    
    ...