Search code examples
pythonhtmlbeautifulsouphtml-parsing

beautiful soup - getting a text from a tag inside another tag


i am trying to parse a web page using beautiful soup [for the first time in my life] and i am experiencing a strange error. there is a tag within a tag in html structure, and i keep getting the error

AttributeError: 'NoneType' object has no attribute 'text'

the structure of html tag is following: the whole grid of items on the page is within div class "properties_reviews" which then goes into div class "preview" for a particular item and that class "preview" has two more classes: "preview-media" for photo and "preview-content" for text info i need to parse. the class "preview-content" has [a] tag that contains two [span] tags with price and square of the item, and a [h2] tag with a territory i also need.

<div class="properties-previews">
    <div class="preview"
        <div class="preview-media">
        <div class="preview-content">
            <a href="/properties/1042-us-highway-1-hancock-me-04634/1330428"
               class="preview__link">
                <span class="preview__price">$89,900</span>
                <span class="preview__size">1 ac</span>
                <div class="preview__subtitle">
                    <h2 class="-g-truncated preview__subterritory">Hancock County
                    </h2>
                    <span class="preview__extended">-- sq ft</span>
                </div>
            </a>

so i am trying to get out $89,990 from preview_price; 1 ac from preview_size; hancock county from preview_subtitle and my python code so far has been something like this (i have omitted all imports and requests):

landplots = soup.find_all('div', class_ = 'properties-previews')

for l in landplots:
  plot_price = l.find('span', {"class": 'preview_price'})
  plot_square = l.find('span', {"class": 'preview_size'})
  plot_county = l.find('h2', class_ = '-g-truncated preview__subterritory').text
  plot_location = l.find('span', class_ = 'preview__locality -g-truncated').text

  print(plot_price).text
  print(plot_county)

what am i doing wrong? i've come to understanding that once a tag is within another tag there should be some special syntax to get those words, but the error saying i have no text at all (on both prints i am doing) confuses me a lot. please help!


Solution

  • Each value is under a text node. So you can invoke .find_next(text=True) to extract the desired data items

    html='''
    <div class="properties-previews">
     <div <div="" class="preview-media">
      <div class="preview-content">
       <a class="preview__link" href="/properties/1042-us-highway-1-hancock-me-04634/1330428">    
        <span class="preview__price">
         $89,900
        </span>
        <span class="preview__size">
         1 ac
        </span>
        <div class="preview__subtitle">
         <h2 class="-g-truncated preview__subterritory">
          Hancock County
         </h2>
         <span class="preview__extended">
          -- sq ft
         </span>
        </div>
       </a>
      </div>
     </div>
    </div>
    '''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'html.parser')
    #print(soup.prettify())
    
    landplots = soup.find_all('div', class_ = 'preview-content')#.find_all('div',class_="preview-media")
    
    for l in landplots:
      plot_price = l.find('span', {"class": 'preview__price'}).find_next(text=True).get_text(strip=True)
      plot_square = l.find('span', {"class": 'preview__size'}).find_next(text=True).get_text(strip=True)
      plot_county = l.find('h2', class_ = '-g-truncated preview__subterritory').find_next(text=True).text
     
      print(plot_price)
      print( plot_square)
    

    Output:

    $89,900
    1 ac
    

    Update: It's working fine without any issues according html dom

    import requests
    from bs4 import BeautifulSoup 
    url='https://www.landsearch.com/industrial/united-states/p1'
    res= requests.get(url)
    
    soup = BeautifulSoup(res.content,'lxml')
    
    landplots = soup.find_all('div', class_ = 'preview-content')#.find_all('div',class_="preview-media")
    
    for l in landplots:
      plot_price = l.find('span', {"class": 'preview__price'}).find_next(text=True).get_text(strip=True)
      plot_square = l.find('span', {"class": 'preview__size'}).find_next(text=True).get_text(strip=True)
      plot_county = l.find('h2', class_ = '-g-truncated preview__subterritory').find_next(text=True).text
     
      print(plot_price)
      print( plot_square)
    

    Output:

    $89,900
    1 ac    
    $995,000
    2.32 ac 
    $85,000 
    0.93 ac 
    $888,000
    11 ac   
    $599,000
    21.6 ac 
    $225,000
    3.72 ac 
    $100,000
    6.5 ac  
    $75,000
    4.48 ac
    $749,000
    8.2 ac
    $225,000
    84.5 ac
    $225,000
    84.5 ac
    $275,000
    29 ac
    $275,000
    29 ac
    $40,000
    0.22 ac
    $2,330,000
    2.8 ac
    $535,000
    3.71 ac
    $169,900
    34 ac
    $499,000
    1 ac
    $299,000
    2.53 ac
    $299,000
    2.53 ac
    $299,000
    2.53 ac
    $799,000
    2 ac
    $199,000
    0.79 ac
    $997,600
    3.27 ac
    $699,000
    1.71 ac
    $529,000
    1 ac
    $499,900
    1 ac
    $50,000
    1.14 ac
    $250,000
    55 ac
    $50,000
    1.14 ac
    $11,000,000
    31.4 ac
    $1,200,000
    1.68 ac
    $94,900
    85 ac
    $896,000
    2.38 ac
    $189,000
    1 ac