Search code examples
pythonhtmlweb-scrapingbeautifulsoupattributes

scrape a sub attribute? with bs4 in python


I'm trying to scrape the id's on a website, but I can't figure out how to specify the entry I want to work with. this is the most I could narrow it down to a specific class, but I'm not sure how to target the number by 'id' under subclass 'data-preview.' here's what I've narrow the variable soup down to:

<li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png",  }'>
<div class="Li Inner FnImage">
<span class="Image" style="background-image:url(www.website.com/image.png);"></span>
</div>
<div class="ImgPreview FnPreviewImage MdNonDisp">
<span class="Image FnPreview" style="background-image:url(www.website.com/image.png);">
</span></div>
</li>

here is the relevant snippet of what I have so far:

from pathlib import Path
from bs4 import BeautifulSoup
import requests
import re

url = "www.website.com/image.png"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

elsoupo = soup.find(attrs={"class": "a fancy title for this class"})
print(elsoupo)

just started working with python, so hopefully I'm wording this so it makes some sense.

Tried to narrow it down with a second attribute that could have any number but I just None back.

elsoupoNum = elsoupo.find(attrs={"id":"^[-+]?[0-9]+$"})

print(elsoupoNum)

Solution

  • data-preview is an attribute for li element with a (ill-formed) json string as its value. I corrected it for simplicity, you may want to check this.

    code

    from bs4 import BeautifulSoup
    import json
    
    str = '''
    <li class="Li FnPreviewItem" data-preview='{ "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png"  }'>
    <div class="Li Inner FnImage">
    <span class="Image" style="background-image:url(www.website.com/image.png);"></span>
    </div>
    <div class="ImgPreview FnPreviewImage MdNonDisp">
    <span class="Image FnPreview" style="background-image:url(www.website.com/image.png);">
    </span></div>
    </li>
    '''
    
    soup = BeautifulSoup(str, 'html.parser')
    li = soup.select_one('li[data-preview]')
    data = li.attrs['data-preview']
    print(data)
    j=json.loads(data)
    print(j['id'])
    

    output

    { "type" : "animation", "id" : "288857982", "staticUrl" : "www.website.com/image.png"  }
    288857982