Search code examples
pythonhtmltwitterweb-scrapingbeautifulsoup

Python Beautiful Soup Extracting HTML Meta Data


I am getting some odd behavior that I do not quite understand. I am hoping someone can explain what is going on.

Consider this metadata:

<meta property="og:title" content="This is the Tesla Semi truck">
<meta name="twitter:title" content="This is the Tesla Semi truck">

This line successfully finds ALL "og" properties and returns a list.

opengraphs = doc.html.head.findAll(property=re.compile(r'^og'))

However, this line fails to do the same thing for the twitter cards.

twitterCards = doc.html.head.findAll(name=re.compile(r'^twitter'))

Why does the first line successfully find all the "og" (opengraph cards), but fail to find the twitter cards?


Solution

  • Problem is name= which has special meaning. It is used to find tag name - in your code it is meta

    You have to add "meta" and use dictionary with "name"

    Example with different items.

    from bs4 import BeautifulSoup
    import re
    
    data='''
    <meta property="og:title" content="This is the Tesla Semi truck">
    <meta property="twitter:title" content="This is the Tesla Semi truck">
    <meta name="twitter:title" content="This is the Tesla Semi truck">
    '''
    
    head = BeautifulSoup(data)
    
    print(head.findAll(property=re.compile(r'^og'))) # OK
    print(head.findAll(property=re.compile(r'^tw'))) # OK
    
    print(head.findAll(name=re.compile(r'^meta'))) # OK
    print(head.findAll(name=re.compile(r'^tw')))   # empty
    
    print(head.findAll('meta', {'name': re.compile(r'^tw')})) # OK