Search code examples
pythonpython-requestsscreen-scraping

organize names (found scraping the internet) to use them as input in requests


I have used a Crawler to gather some names of famous artists, singer, musicians, groups. a lot of names in my list are structured having the surname before the name, and a comma in the middle. I write a sample from my list:

Aalegra, Snoh
Beach Boys
Groove Coverage
Night Verses
Gang Of Youths
Marcy Playground
Fito Blanko
Lowery, Clint
Josh Garrels
Pausini, Laura
Moses, Joe
Julian Trono
Meg Donnelly
Jack Gray
Jola, Marion
Pink Floyd
Judd, Wynonna
Bo Bruce

I have a function that pick up the html of wikipedia and extract some infos from the table that is on the right ( infos like Origin of Group or date and place of birth oh the person and so on) but when the string is "surname , name" wikipedia clearly doesn't find the page.

any Ideas?

Should I change all the strings taht present this problem? or avoid using requests and try selenium? I don't know the shortest and easiest way...

below my foo:

def get_other_info(artist):  
    r = requests.get('https://en.wikipedia.org/wiki/' + artist).text
    sleep(randint(2,15))
    obj = BeautifulSoup(r, 'html.parser')
    table = obj.find('table', class_='infobox vcard plainlist')
    for t in table.select('th'):
        if t.text == 'Origin' or t.text == 'Born':
            orig = t.find_next_siblings('td')[0].text
        elif t.text == 'Genres':
            gen = [i.text for i in t.find_next_siblings('td')[0].find_all('li')]
        elif t.text == 'Years active':
            yr = t.find_next_siblings('td')[0].text
    return [orig, gen, yr]

Solution

  • You could use a function like this one:

    def searchstring(s):
        """Returns Wikipedia-friendly version of input string s."""
        if ',' in s:
            last, first = s.split(', ')
            return first + ' ' + last
        else:
            return s
    
    names = ['Aalegra, Snoh', 'Beach Boys', 'Lowery, Clint', 'Josh Garrels']
    
    for name in names:
        print(searchstring(name))
    
    Snoh Aalegra
    Beach Boys
    Clint Lowery
    Josh Garrels