Search code examples
pythonbeautifulsoupgoogle-search-api

collect text from web pages of a given site


There is a site that I frequenly visit and read the "best advice". Here is how I can easily extract the text that I want...

import urllib2
from bs4 import BeautifulSoup  

mylist=list()

myurl='http://www.apartmenttherapy.com/carols-east-side-cottage-house-tour-194787'
s=urllib2.urlopen(myurl)
soup =  BeautifulSoup(s)

hello = soup.find(text='Best Advice: ')
mylist.append(hello.next)

But how do I collect the text snippets from all the pages?


I can search for all pages using this simple google query...

site:http://www.apartmenttherapy.com

Does google search has API that can be used in python? I am looking for one time simple solution for this problem. So I will prefer not to install too many packages to get this task done.


Solution

  • You may read the BeautifulSoup manual first and also learn to use web developper tool to inspect network flow.

    Once done, you may see that you can get the list of house with a GET request http://www.apartmenttherapy.com/search?page=1&q=House+Tour&type=all

    Assuming that, we can iterate from page 1 to X to get all houses index page.

    On each index page you get exactly 15 urls to add to a list.

    Once you got the complete urls list, you may scrap each url to get "best advice" text on each of them.

    Please see the following code which do the job :

    import time
    import requests
    import random
    from bs4 import BeautifulSoup  
    
    #here we get a list of all url to scrap
    url_list=[]
    max_index=2 
    
    for page_index in range(1,max_index):
    
        #get index page
        html=requests.get("http://www.apartmenttherapy.com/search?page="+str(page_index)+"&q=House+Tour&type=all").content
    
        #iterate over teaser
        for teaser in BeautifulSoup(html).findAll('a',{'class':'SimpleTeaser'}):
    
            #add link to url list
            url_list.append(teaser['href'])
    
        #sleep a litte to avoid overload/ to be smart
        time.sleep(random.random()/2.) # respect server side load
    
        #here I break because it s just an example (it does not required to scrap all index page)
        break #comment this break in production
    
    
    #here we show list  
    print url_list
    
    
    #we iterate over url to get the advice
    mylist=[]
    for url in url_list:
    
        #get teaser page
        html=requests.get(url).content
    
        #find best advice text
        hello = BeautifulSoup(html).find(text='Best Advice: ')
    
        #print advice
        print "advice for",url,"\n","=>",
    
        #try to add next text to mylist
        try:
            mylist.append(hello.next)
        except:
            pass
    
        #sleep a litte to avoid overload/ to be smart
        time.sleep(random.random()/2.) # respect server side load
    
    #show list of advice
    print mylist
    

    output are:

    ['http://www.apartmenttherapy.com/house-tour-a-charming-comfy-california-cottage-228229', 'http://www.apartmenttherapy.com/christinas-olmay-oh-my-house-tour-house-tour-191725', 'http://www.apartmenttherapy.com/house-tour-a-rustic-refined-ranch-house-227896', 'http://www.apartmenttherapy.com/caseys-grown-up-playhouse-house-tour-215962', 'http://www.apartmenttherapy.com/allison-and-lukes-comfortable-and-eclectic-apartment-house-tour-193440', 'http://www.apartmenttherapy.com/melissas-eclectic-austin-bungalow-house-tour-206846', 'http://www.apartmenttherapy.com/kates-house-tour-house-tour-197080', 'http://www.apartmenttherapy.com/house-tour-a-1940s-art-deco-apartment-in-australia-230294', 'http://www.apartmenttherapy.com/house-tour-an-art-filled-mid-city-new-orleans-house-227667', 'http://www.apartmenttherapy.com/jeremys-light-and-heavy-home-house-tour-201203', 'http://www.apartmenttherapy.com/mikes-cabinet-of-curiosities-house-tour-201878', 'http://www.apartmenttherapy.com/house-tour-a-family-dream-home-in-illinois-227791', 'http://www.apartmenttherapy.com/stephanies-greenwhich-gemhouse-96295', 'http://www.apartmenttherapy.com/masha-and-colins-worldly-abode-house-tour-203518', 'http://www.apartmenttherapy.com/tims-desert-light-box-house-tour-196764']
    advice for http://www.apartmenttherapy.com/house-tour-a-charming-comfy-california-cottage-228229 
    => advice for http://www.apartmenttherapy.com/christinas-olmay-oh-my-house-tour-house-tour-191725 
    => advice for http://www.apartmenttherapy.com/house-tour-a-rustic-refined-ranch-house-227896 
    => advice for http://www.apartmenttherapy.com/caseys-grown-up-playhouse-house-tour-215962 
    => advice for http://www.apartmenttherapy.com/allison-and-lukes-comfortable-and-eclectic-apartment-house-tour-193440 
    => advice for http://www.apartmenttherapy.com/melissas-eclectic-austin-bungalow-house-tour-206846 
    => advice for http://www.apartmenttherapy.com/kates-house-tour-house-tour-197080 
    => advice for http://www.apartmenttherapy.com/house-tour-a-1940s-art-deco-apartment-in-australia-230294 
    => advice for http://www.apartmenttherapy.com/house-tour-an-art-filled-mid-city-new-orleans-house-227667 
    => advice for http://www.apartmenttherapy.com/jeremys-light-and-heavy-home-house-tour-201203 
    => advice for http://www.apartmenttherapy.com/mikes-cabinet-of-curiosities-house-tour-201878 
    => advice for http://www.apartmenttherapy.com/house-tour-a-family-dream-home-in-illinois-227791 
    => advice for http://www.apartmenttherapy.com/stephanies-greenwhich-gemhouse-96295 
    => advice for http://www.apartmenttherapy.com/masha-and-colins-worldly-abode-house-tour-203518 
    => advice for http://www.apartmenttherapy.com/tims-desert-light-box-house-tour-196764 
    => [u"If you make a bad design choice or purchase, don't be afraid to change it. Try and try again until you love it.\n\t", u" Sisal rugs. They clean up easily and they're very understated. Start with very light colors and add colors later.\n", u"Bring in what you love, add dimension and texture to your walls. Decorate as an individual and not to please your neighbor or the masses. Trends are fun but I love elements of timeless interiors. Include things from any/every decade as well as mixing styles. I'm convinced it's the hardest way to decorate without looking like you are living in a flea market stall. Scale, color, texture, and contrast are what I focus on. For me it takes some toying around, and I always consider how one item affects the next. Consider space and let things stand out by limiting what surrounds them.", u'You don\u2019t need to invest in \u201cdecor\u201d and nothing needs to match. Just decorate with the special things (books, cards, trinkets, jars, etc.) that you\u2019ve collected over the years, and be organized. I honestly think half the battle of having good home design is keeping a neat house. The other half is just displaying stuff that is special to you. Stuff that has a story and/or reminds you of people, ideas, and places that you love. One more piece of advice - the best place to buy picture frames is Goodwill. Pick a frame in decent condition, and just paint it to complement your palette. One last piece of advice\u2014 decor need not be pricey. I ALWAYS shop consignment and thrift, and then I repaint and customize as I see fit.\n', u'From my sister \u2014 to use the second bedroom as my room, as it is dark and quiet, both of which I need in order to sleep.\n', u'Collect things that you love in your travels throughout life. I tend to purchase ceramics when travelling, sometimes a collection of bowls\u2026 not so easy transporting in the suitcase, but no breakages yet!\n\t', u'Keep things authentic to the character of your home and to the character of your family. Then, you can never go wrong!\n\t', u'Contemporary architecture does not require contemporary furnishings.\n']