python beautifulsoup python-requests screen-scraping google-search

Extract specific paragraph while scraping a google search result

I'm working currently on web scraping and I need to extract a description of a city in a google search result.

Let's say that I want a description of Madrid city, I searched and got the following result:

This is the source code for the target div:

<div jscontroller="GCSbhd" class="kno-rdesc" jsaction="seM7Qe:c0XUbe;Iigoee:c0XUbe;rcuQ6b:npT2md">
    <h3 class="Uo8X3b OhScic zsYMMe">Description</h3>
    <span>Située au centre de l'Espagne, Madrid, sa capitale, est une ville dotée d'élégants boulevards et de vastes parcs très bien entretenus comme le Retiro. Elle est réputée pour ses riches collections d'œuvres d'art européennes, avec notamment celles du musée du Prado, réalisées par Goya, Velázquez et d'autres maîtres espagnols. Au cœur de la vieille Madrid des Habsbourgs se trouve la Plaza&nbsp;Mayor, bordée de portiques, et, à proximité, le Palais royal baroque et son Armurerie, qui comporte des armes historiques.
        <span>
            <span class="eHaQD"> ―&nbsp;Google
            </span>
        </span>
    </span>
</div>

I tried scraping the content and selecting the <h3> tag and then select its sibling but the result is None, this is the code used:

import requests
from bs4 import BeautifulSoup
url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, 'html.parser')
target_div_PresMadrid = soup_PresMadrid.find('h3', {'class': 'Uo8X3b OhScic zsYMMe'})
print(target_div_PresMadrid)

I even tried to select the only parent <div> that doesn't change its class but the code returns None as well, this the code for it:

import requests
from bs4 import BeautifulSoup
url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, 'html.parser')
target_div_PresMadrid = soup_PresMadrid.find('div', {'class': 'liYKde g VjDLd'})
print(target_div_PresMadrid)

Can anyone help me to understand the mechanics of the search engine so that I can extract that paragraph

Solution

If you disable JavaScript in your browser, you'll see that the paragraph you want is actually under the class BNeawe s3v9rd AP7Wnd:

<div class="BNeawe s3v9rd AP7Wnd">
 Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry.
</div>

the requests library doesn't support JavaScript. So, you need to access this class BNeawe s3v9rd AP7Wnd.

Although there are multiple classes with that name, since find() only returns the first match, you are fine to use it

import requests
from bs4 import BeautifulSoup


url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, "html.parser")
target_div_PresMadrid = soup_PresMadrid.find("div", {"class": "BNeawe s3v9rd AP7Wnd"})
print(target_div_PresMadrid.text)

Output:

Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry.