Search code examples
pythonweb-scrapingbeautifulsouppython-requestsbotdetect

How to bypass bot detection and scrape a website using python


The problem

I was new to web scraping and I was trying to create a scraper which looks at a playlist link and gets the list of the music and the author.

But the site kept rejecting my connection because it thought that I was a bot, so I used UserAgent to create a fake useragent string to try and bypass the filter.

It sort of worked? But the problem was that when you visited the website by a browser, you could see the contents of the playlist, but when you tried to extract the html code with requests, the contents of the playlist was just a big blank space.

Mabye I have to wait for the page to load? Or there is a stronger bot filter?

My code

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

melon_site="http://kko.to/IU8zwNmjM"

headers = {'User-Agent' : ua.random}
result = requests.get(melon_site, headers = headers)


print(result.status_code)
src = result.content
soup = BeautifulSoup(src,'html.parser')
print(soup)

Link of website

playlist link

html I get when using requests

html with blank space where the playlist was supposed to be


Solution

  • You wanna check out this link to get the content you wish to grab.

    The following attempt should fetch you the artist names and their song names.

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://www.melon.com/mymusic/playlist/mymusicplaylistview_listSong.htm?plylstSeq=473505374'
    
    r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(r.text,"html.parser")
    for item in soup.select("tr:has(#artistName)"):
        artist_name = item.select_one("#artistName > a[href*='goArtistDetail']")['title']
        song = item.select_one("a[href*='playSong']")['title']
        print(artist_name,song)
    

    Output are like:

    Martin Garrix - 페이지 이동 Used To Love (feat. Dean Lewis) 재생 - 새 창
    Post Malone - 페이지 이동 Circles 재생 - 새 창
    Marshmello - 페이지 이동 Here With Me 재생 - 새 창
    Coldplay - 페이지 이동 Cry Cry Cry 재생 - 새 창
    

    Note: your BeautifulSoup version should be 4.7.0 or later in order for the script to support pseudo selector.