Search code examples
pythonhtmlbeautifulsoupscreen-scraping

Can I scrape a "value" attribute with BeautifulSoup from an img Tag?


I've been testing my comprehension of web scraping and have been unable to pull specific values to attributes within an img tag. I can narrow down to the proper leading tags but once I try to pull the value attributed to "alt" (img alt="what_i_want") I get a none type. Or in some other code variations, I only get a single item returned. From what I understand, the value I'm trying to grab is not technically a text or string so BS doesn't really have anything to grab. Is this correct?

I'm trying to grab the "EVGA" and other brand names listed within each container:

[<a class="item-brand" href="https://www.newegg.com/EVGA/BrandStore/ID-1402">
    <img alt="EVGA" src="//c1.neweggimages.com/Brandimage_70x28//Brand1402.gif" title="EVGA" />
</a>]

What I've got so far:

webpage = requests.get('https://www.newegg.com/p/pl?Submit=StoreIM&Depa=1&Category=38')
content = webpage.content
soup = BeautifulSoup(content, 'lxml')

containers = soup.find_all("div", class_="item-container")

brand = []

for container in containers:
    cont_brand = container.find_all("div",{"class":"item-info"})
for name_brand in cont_brand:
    brand.append(name_brand.find("img").get("alt"))
print(brand) 

This will actually get me a return value of ['ASUS'] which is somewhere in the middle of the list of containers I can identify. I am unable to find any variances within the html code that might single this one out over the others. Another code format returned the last value ['ASRock'], but again I can't find a reason for just that one. I assume it has something to do with BS4 (find) mechanics...? Most other code variations that use (find_all) will return a NoneType error which I think I understand based on BS documentation. I've tried swapping out for 'html.parser' with no change. Currently looking into using Selenium to see if there is an answer there.

Any help would be greatly appreciated.


Solution

  • This is because your first for loop returns all elements.However when you put next for loop outside the outer one it is always giving you the last element. it should be inside outer for loop.

    Now try.

    webpage = requests.get('https://www.newegg.com/p/pl?Submit=StoreIM&Depa=1&Category=38')
    content = webpage.content
    soup = BeautifulSoup(content, 'lxml')
    
    containers = soup.find_all("div", class_="item-container")
    
    brand = []
    
    for container in containers:
        cont_brand = container.find_all("div",{"class":"item-info"})
        for name_brand in cont_brand:
            brand.append(name_brand.find("img").get("alt"))
    print(brand)
    

    Output:

    ['EVGA', 'MSI', 'ASUS', 'MSI', 'Sapphire Tech', 'EVGA', 'GIGABYTE', 'XFX', 'ASUS', 'ASRock', 'EVGA', 'ASUS', 'EVGA', 'GIGABYTE', 'GIGABYTE', 'GIGABYTE', 'EVGA', 'EVGA', 'MSI', 'ASRock', 'EVGA', 'XFX', 'Sapphire Tech', 'ASRock', 'GIGABYTE', 'ASUS', 'MSI', 'MSI', 'MSI', 'MSI', 'MSI', 'EVGA', 'GIGABYTE', 'EVGA', 'ASUS', 'GIGABYTE']
    

    If you have BS 4.7.1 or above you can use this css selector.

    webpage = requests.get('https://www.newegg.com/p/pl?Submit=StoreIM&Depa=1&Category=38')
    content = webpage.content
    soup = BeautifulSoup(content, 'lxml')
    
    brand = []
    
    for name_brand in soup.select(".item-container .item-info"):
            brand.append(name_brand.find_next('img').get("alt"))
    print(brand)