Search code examples
pythonseleniumweb-scrapingbeautifulsoupcraigslist

BeautifulSoup & Craiglist - Trouble getting data with identical attributes and structure


I am having trouble scraping the HTML below as all the info is stored in a structure that doesn't have very much distinction.

I want to get a place that retrieves the b tag contained within the span tag that has text = 'VIN: ', and the b tag contained within the span tag that has text = 'odometer: ', etc..

</p>
</div>
<p class="attrgroup">
<span><b>2001 PORSCHE 911</b></span>
<br/>
</p>
<p class="attrgroup">
<span>VIN: <b>WP0CA29961S653221</b></span>
<br/>
<span>fuel: <b>gas</b></span>
<br/>
<span>odometer: <b>46000</b></span>
<br/>
<span>paint color: <b>silver</b></span>
<br/>
<span>size: <b>sub-compact</b></span>
<br/>
<span>title status: <b>clean</b></span>
<br/>
<span>transmission: <b>manual</b></span>
<br/>
<span>type: <b>convertible</b></span>
<br/>
</p>
</div>

I have tried the following variations with no avail:

all = soup.find_all('section',{'class':'body'})
for i in all:
    print(i.find_all('span'))

&

all = soup.find_all('section',{'class':'body'})
for i in all:
     print(i.find_all('b'))

&

all = soup.find_all('section',{'class':'body'})
for i in all:
    print(i.find_all('p',{'class':'attrgroup'}))

The fields are dynamic, so the structure can change. For example, another listing may not have the odometer information, or the fuel option, so breaking this into a list and getting specific information by index will not be consistent.

How do I successfully do this?


Solution

  • Try with some similar like this :

    from bs4 import BeautifulSoup
    
    html = """
    </p>
    </div>
    <p class="attrgroup">
    <span><b>2001 PORSCHE 911</b></span>
    <br/>
    </p>
    <p class="attrgroup">
    <span>VIN: <b>WP0CA29961S653221</b></span>
    <br/>
    <span>fuel: <b>gas</b></span>
    <br/>
    <span>odometer: <b>46000</b></span>
    <br/>
    <span>paint color: <b>silver</b></span>
    <br/>
    <span>size: <b>sub-compact</b></span>
    <br/>
    <span>title status: <b>clean</b></span>
    <br/>
    <span>transmission: <b>manual</b></span>
    <br/>
    <span>type: <b>convertible</b></span>
    <br/>
    </p>
    </div>
    """
    soup = BeautifulSoup(html,'html.parser')
    prefixes = ["VIN", "odometer"]
    for n in soup.find_all('p', attrs={'class': 'attrgroup'}):    
       for x in n.find_all('span'):
        if(x.text.startswith(tuple(prefixes))):
          print(x.find('b').text)    
    

    Result:

    WP0CA29961S653221
    46000