Search code examples
beautifulsouppython-3.7

extract text from specific sections in html, python


I'm trying to do a program that show uou lyrics of a song, but i get stuck on this error:

AttributeError: 'NoneType' object has no attribute 'text'

here's the code:

def get_lyrics(url):
    lyrics_html = requests.get(url)
    soup = BeautifulSoup(lyrics_html.content, "html.parser")
    lyrics = soup.find('div', {"class": "lyrics"})
    return lyrics.text

This is the site where i take the lyrics. I can't explain whats wrong, for example i'll search the lyrics of this song, so here's the lyrics of the song: click. You can see from yourself that in the page the "place" where the lyrics is, a div with class "lyrics". This is how all lyrics pages of this site are made. Can someone help me pls? Ty


Solution

  • The page returns two versions of page (probably to confuse scrapers and bots). One version with class that begins on "Lyrics__Container..." and one with class lyrics. If a tag with class Lyrics__Container is not found, the lyrics are inside the tag with class lyrics.

    This should always print a lyrics:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://genius.com/Luis-sal-ciao-mi-chiamo-luis-lyrics'
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    
    text = soup.select_one('div[class^="Lyrics__Container"], .lyrics').get_text(strip=True, separator='\n')
    print(text)
    

    Prints:

    [Intro]
    Ah, mhh (ehi)
    Ho la bocca piena
    Va bene
    [Verse]
    Ciao, mi chiamo Luis (eh, eh-eh)
    Ciao, mi chiamo Luis (eh, eh-eh)
    Ciao, Ciao mi chiamo Luis (eh, eh-eh)
    Ciao, mi chiamo Luis
    Si, si, si Sal
    A a a a Si si si si si si
    Proprio così mi chiamo io
    Ciao mi chiamo Luis Aah
    
    ... and so on.
    

    EDIT: Updated version:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://genius.com/Avicii-the-nights-lyrics'
    soup = BeautifulSoup(requests.get(url).content, 'lxml')
    
    def get_text(elements):
        text = ''
        for c in elements:
            for t in c.select('a, span'):
                t.unwrap()
            if c:
                c.smooth()
                text += c.get_text(strip=True, separator='\n')
        return text
    
    
    cs = soup.select('div[class^="Lyrics__Container"]')
    if cs:
        text = get_text(cs)
    else:
        text = get_text(soup.select('.lyrics'))
    
    print(text)
    

    Prints:

    [Verse 1]
    (Hey)
    Once upon a younger year
    When all our shadows disappeared
    The animals inside came out to play (Hey)
    Hey, went face to face with all our fears
    Learned our lessons through the tears
    Made memories we knew would never fade
    [Pre-Chorus]
    One day my father he told me
    Son, don't let it slip away
    
    ...etc.