Search code examples
pythonpython-3.xurlbeautifulsoupurllib

url not working over code but manual search works


I am trying to take a string input in my python code to be converted and implemented into a URL to search the string on the website. The website I am using is songbpm.com and what I want is to search a song and I receive the speed of the song. Finding the relevant information within the HTML is not the problem, I have already finished this and my url creation is working, which is here:

import urllib.request
import urllib.parse

song = input("")
fin = ""
for i in song:
    if i == "(":
        tempone = song
        song = tempone.split("(")[0] + tempone.split(") ")[1]

previous = ""
for i in song:
    if i.isalpha():
        temp = fin
        fin = temp + i
    else:
        if previous.isalpha():
            temp = fin
            fin = temp + "-"
    previous = i


songencoded = urllib.parse.quote(song, safe='')
print('https://songbpm.com/'+ fin.lower() + '?q=' + songencoded)

response = urllib.request.urlopen('https://songbpm.com/'+ fin.lower() + '?q=' + songencoded)
text = str(response.read()).split('\\n')

The urls, which are returned are identical to the url when I manually enter the search input on the website, however, when I run this code, it always reads the html data for the no results redirect.

Also, if I paste the computer-generated URL into the browser, it redirects to the no results page, however, after searching the same string by hand in the browser, the computer-generated url works as well (when retrying).

What I have also observed is that after manually opening a certain URL, I can run the code with the same search query and it works - it seems as if searches are cached for a certain amount of time if a user, not a code opens it.

How do I tackle this issue of the code, although generating the exact URL, not being able to open webpages similar to the user.


Solution

  • The site has a few extra requirements to make a suitable request. Firstly it uses cookies, so a cookiejar is needed. This can be loaded by first requesting the homepage without making a search. This also then gives you the value for _csrf which is needed when submitting the request form. Lastly, the POST request can be generated from your input search by using urlencode() to build q correctly:

    from operator import itemgetter
    from bs4 import BeautifulSoup
    import http.cookiejar
    import urllib.request
    import urllib.parse
    
    
    song = input('Enter song: ')
    
    cookie_jar = http.cookiejar.CookieJar()
    cookie_processor = urllib.request.HTTPCookieProcessor(cookie_jar)
    opener = urllib.request.build_opener(cookie_processor)
    
    with opener.open('https://songbpm.com') as response:
        html_1 = response.read().decode('utf-8')
    
    soup_1 = BeautifulSoup(html_1, 'html.parser')    
    data = urllib.parse.urlencode({'q' : song, '_csrf' : soup_1.input['value']}).encode('ascii')
    
    with opener.open('https://songbpm.com/searches', data) as response:
        html_2 = response.read().decode('utf-8')
    
    soup_2 = BeautifulSoup(html_2, 'html.parser')
    
    for a in soup_2.find_all('a', {'class' : 'media'}):
        print(', '.join(itemgetter(0, 1, 4)([p.get_text(strip=True) for p in a.find_all('p')])))
    

    Which would give you the following results:

    Enter song: clean bandit - solo
    Clean Bandit, Solo (feat. Demi Lovato), 105
    Clean Bandit, Solo (feat. Demi Lovato) - Acoustic, 0
    Clean Bandit, Solo (feat. Demi Lovato) - Ofenbach Remix, 121
    Clean Bandit, Solo (feat. Demi Lovato) - Sofi Tukker Remix, 127
    Clean Bandit, Solo (feat. Demi Lovato) - Wideboys Remix, 122
    

    Using beautifulsoup makes it easy to extract all the details. itemgetter() is just a quick way to get certain items from a given list.