Search code examples
pythonweb-scrapingbeautifulsoup

Beautifulsoup extracting


I have a few simple bs questions (1-3 go together and 4-6 go together). Suppose I have the HTML with the structure as follows:

<meta property="tall"/>
<meta property="wide" content="spiral"/>
<meta name="red"/>
<meta name="tall"/>
  1. How can I find all of the instances of property?
  2. How can I then extract "tall" and "wide"?
  3. How can I then extract property?
  4. How can I find all of the instances of "tall"?
  5. How can I then extract name and property
  6. How can I then extract "tall"?

    What I can easily do is extract all instances of meta:

    soup1.find_all("meta")
    

    But, after that, I have to access each element of the resulting list and then I can get the things like property and name. But I would rather skip this step and directly get all instances of property and name if possible.

  7. Finally, if I want to get the url from a website using requests.get, and it is a website that you have to click on a button at the bottom to make it load more, and I want the extra stuff, how can I make this happen?


Solution

  • I'm not an expert at using BeautifulSoup but I gave it a try and here's what I came up with, which is hopefully enough to get you started. Just be aware that there might me more elegant solutions.

    Boilerplate:

    from bs4 import BeautifulSoup
    import re
    
    a = """<meta property="tall"/>
    <meta property="wide" content="spiral"/>
    <meta name="red"/>
    <meta name="tall"/>"""
    
    soup = BeautifulSoup(a)
    

    Questions:

    I.

    p = soup.findAll('meta', attrs = {"property":re.compile('.*')})
    >> [<meta property="tall"/>, <meta content="spiral" property="wide"/>]
    

    II.

    ex = [p[i]['property'] for i in range(len(p))]
    >> ['tall', 'wide']
    

    III. I'm not sure what you mean, maybe it's covered already?

    IV.

    alltall = soup.findAll('meta', attrs = {'name':'tall'})
    alltall += (soup.findAll('meta', attrs = {'property':'tall'}))
    >> [<meta name="tall"/>, <meta property="tall"/>]
    

    V./VI. I spent some time searching but did not find an elegant way to do it this way around. Maybe I'm overlooking something.