I have used a Crawler to gather some names of famous artists, singer, musicians, groups. a lot of names in my list are structured having the surname before the name, and a comma in the middle. I write a sample from my list:
Aalegra, Snoh
Beach Boys
Groove Coverage
Night Verses
Gang Of Youths
Marcy Playground
Fito Blanko
Lowery, Clint
Josh Garrels
Pausini, Laura
Moses, Joe
Julian Trono
Meg Donnelly
Jack Gray
Jola, Marion
Pink Floyd
Judd, Wynonna
Bo Bruce
I have a function that pick up the html of wikipedia and extract some infos from the table that is on the right ( infos like Origin of Group or date and place of birth oh the person and so on) but when the string is "surname , name" wikipedia clearly doesn't find the page.
any Ideas?
Should I change all the strings taht present this problem? or avoid using requests and try selenium? I don't know the shortest and easiest way...
below my foo:
def get_other_info(artist):
r = requests.get('https://en.wikipedia.org/wiki/' + artist).text
sleep(randint(2,15))
obj = BeautifulSoup(r, 'html.parser')
table = obj.find('table', class_='infobox vcard plainlist')
for t in table.select('th'):
if t.text == 'Origin' or t.text == 'Born':
orig = t.find_next_siblings('td')[0].text
elif t.text == 'Genres':
gen = [i.text for i in t.find_next_siblings('td')[0].find_all('li')]
elif t.text == 'Years active':
yr = t.find_next_siblings('td')[0].text
return [orig, gen, yr]
You could use a function like this one:
def searchstring(s):
"""Returns Wikipedia-friendly version of input string s."""
if ',' in s:
last, first = s.split(', ')
return first + ' ' + last
else:
return s
names = ['Aalegra, Snoh', 'Beach Boys', 'Lowery, Clint', 'Josh Garrels']
for name in names:
print(searchstring(name))
Snoh Aalegra
Beach Boys
Clint Lowery
Josh Garrels