Search code examples
pythonweb-scrapingstrip

python strip function not working properly


I am scraping some data from a website via python.

I want to do two things

  1. I want to skip the first 2 words "Dubai" and "UAE" which are common in every webscraping result.

  2. I want to save the last two words in two different variables with strip without the extra spaces.

        try:
            area= soup.find('div', 'location')
            area_result= str(area.get_text().strip().encode("utf-8"))
            print "Area: ",area_result
    except StandardError as e:
            area_result="Error was {0}".format(e)
            print area_result
    

area_result: consists of the following data:

'UAE \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Dubai \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Executive Towers \n            \n\n\n        \n\n\n\t    \n\t        \n\t    \n\t\n\n\n        \n        ;\n        \n            \n                \n                    1.4 km from Burj Khalifa Tower'

I want the above result to be displayed as (Note the > between Executive Towers and 1.4 km..

Executive Towers > 1.4 km from Burj Khalifa Tower

Solution

  • import string
    def cleanup(s, remove=('\n', '\t')):
        newString = ''
        for c in s:
            # Remove special characters defined above.
            # Then we remove anything that is not printable (for instance \xe2)
            # Finally we remove duplicates within the string matching certain characters.
            if c in remove: continue
            elif not c in string.printable: continue
            elif len(newString) > 0 and c == newString[-1] and c in ('\n', ' ', ',', '.'): continue
            newString += c
        return newString
    

    Throw something like that in there in order to cleanup your code?
    The net result is:

    >>> s = 'UAE \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Dubai \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Business Bay \xe2\x80\xaa>\xe2\x80\xaa\n            \n                Executive Towers \n            \n\n\n        \n\n\n\t    \n\t        \n\t    \n\t\n\n\n        \n        ;\n        \n            \n                \n                    1.4 km from Burj Khalifa Tower'
    >>> cleanup(s)
    'UAE > Dubai > Business Bay > Executive Towers 1.4 km from Burj Khalifa Tower'
    

    Here's a good SO reference to the string library.

    Going back to the question is see that the user don't want the first two blocks (between >) to be present, quite simply do:

    area_result = cleanup(area_result).split('>')[3].replace(';', '>')