I am scraping some data from a website via python.
I want to do two things
I want to skip the first 2 words "Dubai" and "UAE" which are common in every webscraping result.
I want to save the last two words in two different variables with strip without the extra spaces.
try:
area= soup.find('div', 'location')
area_result= str(area.get_text().strip().encode("utf-8"))
print "Area: ",area_result
except StandardError as e:
area_result="Error was {0}".format(e)
print area_result
area_result: consists of the following data:
'UAE \xe2\x80\xaa>\xe2\x80\xaa\n \n Dubai \xe2\x80\xaa>\xe2\x80\xaa\n \n Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n \n Executive Towers \n \n\n\n \n\n\n\t \n\t \n\t \n\t\n\n\n \n ;\n \n \n \n 1.4 km from Burj Khalifa Tower'
I want the above result to be displayed as (Note the >
between Executive Towers
and 1.4 km..
Executive Towers > 1.4 km from Burj Khalifa Tower
import string
def cleanup(s, remove=('\n', '\t')):
newString = ''
for c in s:
# Remove special characters defined above.
# Then we remove anything that is not printable (for instance \xe2)
# Finally we remove duplicates within the string matching certain characters.
if c in remove: continue
elif not c in string.printable: continue
elif len(newString) > 0 and c == newString[-1] and c in ('\n', ' ', ',', '.'): continue
newString += c
return newString
Throw something like that in there in order to cleanup your code?
The net result is:
>>> s = 'UAE \xe2\x80\xaa>\xe2\x80\xaa\n \n Dubai \xe2\x80\xaa>\xe2\x80\xaa\n \n Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n \n Executive Towers \n \n\n\n \n\n\n\t \n\t \n\t \n\t\n\n\n \n ;\n \n \n \n 1.4 km from Burj Khalifa Tower'
>>> cleanup(s)
'UAE > Dubai > Business Bay > Executive Towers 1.4 km from Burj Khalifa Tower'
Here's a good SO reference to the string library.
Going back to the question is see that the user don't want the first two blocks (between >
) to be present, quite simply do:
area_result = cleanup(area_result).split('>')[3].replace(';', '>')