Search code examples
pythonhtmlhtml-parsingtitle

How to Extract title from <a> tag within DIV using Python?


I am new to Python, I want to extract all the title/s inside < a > tags that are placed in Divs. it could have 0 title or as many as 100.

it is the child DIV <div class="Shl zI7 iyn Hsu"> that contains < a > tag and title in it.

this is the first Main DIV code that contains all child DIV in it:

<div class="Eqh F6l Jea k1A zI7 iyn Hsu"><div class="Shl zI7 iyn Hsu"><a data-test-id="search-guide" 
href="" title="Search for &quot;living room colors&quot;"><div class="Jea Lfz XiG fZz gjz qDf zI7 iyn 
Hsu" style="white-space: nowrap; background-color: rgb(162, 152, 139);"><div class="tBJ dyH iFc MF7 
erh tg7 IZT mWe">Living</div></div></a>

in the above example, I want to get the "living room colors" not everything in front of title=, I guess I could have some RegEx later, but I have the problem of getting the title from HTML parsing.

I have tried the following Python:

import requests
from bs4 import BeautifulSoup

url = "https://www.pinterest.com/search/pins/?q=room%20color"
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, "html.parser")
DivTitle = soup.select('a.Shl.zI7.iyn.Hsu')[0].text.strip()
print(DivTitle)

I get: IndexError: list index out of range

as I search the above keyword, there is more than one title ( suggestion keywords) that appears in the search result.

appreciate your help.

EDITED: OK, I got this working, but I am trying to make it work parsing from URL instead of pasting my code:

here is the part that I used:

import requests
vgm_url = 'https://www.pinterest.com/search/pins/?q=skin%20care'
html_text = requests.get(vgm_url).text
soup = BeautifulSoup(html_text, 'html.parser')

but I get nothing, no error either.


Solution

  • Your selector is wrong as the DIV has the classes you want and the A is a child of that DIV. title is an attribute of the A element.

    from bs4 import BeautifulSoup
    
    data = '''\
    <html>
      <head>
        <meta name="generator"
        content="HTML Tidy for HTML5 (experimental) for Windows https://github.com/w3c/tidy-html5/tree/c63cc39" />
        <title></title>
      </head>
      <body>
        <div class="Eqh F6l Jea k1A zI7 iyn Hsu">
          <div class="Shl zI7 iyn Hsu">
            <a data-test-id="search-guide" href="" title="Search for &quot;living room colors&quot;">
              <div class="Jea Lfz XiG fZz gjz qDf zI7 iyn Hsu" style="white-space: nowrap; background-color: rgb(162, 152, 139);">
                <div class="tBJ dyH iFc MF7 erh tg7 IZT mWe">Living</div>
              </div>
            </a>
          </div>
        </div>
      </body>
    </html>
    '''
    
    soup = BeautifulSoup(data, 'html.parser')
    
    a = soup.select('div.Shl.zI7.iyn.Hsu a')[0]
    
    print(a['title'])