Search code examples
pythonlistuniquestriphtml-content-extraction

Lstrip and Rstrip won't work, need help removing text from an output in Python 3


The output is part of a list. When I try to figure out the output's type using type() it returns : <class 'bs4.element.Tag'>.

I am trying to remove everything to the left of "href" and everything to the right of "<img". I have tried lstrip and rstrip but they do not work because each output in my list is unique. Even though each output in the list is unqiue they all have the same format with "href" and "<img".

Here is an example of what one of the outputs in my list:

<a class="BlogList-item-image-link" href="/new-blog/nova-approval">
<img alt="Nova Approval" data-image="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg" data-image-dimensions="2432x2688" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg"/>
</a>

Solution

  • Using lstrip and rstrip won't be the answer.

    Have you tried looking at the bs4 docs?

    Because the type of your output is a bs4 object. You can just find the attribute of the object to get the href.

    <a class="BlogList-item-image-link" href="/new-blog/nova-approval">
    <img alt="Nova Approval" data-image="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg" data-image-dimensions="2432x2688" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://static1.squarespace.com/static/54ceeff4e4b0d9096117315a/5a3ff7e48165f5d70b78414a/5a504ba90d9297f9a55e4ab6/1516062801655/7P1A5814+cropped.jpg"/>
    </a>
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup('html') #put the link there
    
    links = soup.find_all('a') # All of the anchor tags in a list
    
    for link in links:
        print(link.get('href'))
    

    This will print all of the href values in the HTML file.