Search code examples
pythonsocial-media

Using multiple web pages in a web scraper


I've been working on some Python code to be able to get links to social media accounts from government websites, for a research into ease with which municipalities can be contacted. I've managed to adapt some code to work in 2.7, which prints all links to facebook, twitter, linkedin and google+ present on a given input website. The issue I'm currently experiencing is that I'm not looking for links on just the one web page, but on a list of about 200 websites, I have in an Excel file. I have no experience with importing these sorts of lists into Python, so I was wondering if anybody could take a look at the code, and suggest a proper way to set all these web pages as the base_url, if possible;

import cookielib

import mechanize

base_url = "http://www.amsterdam.nl"

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent',
              'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(base_url, timeout=10)

links = {}
for link in br.links():
    if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:
    links[link.url] = {'count': 1, 'texts': [link.text]}

# printing
for link, data in links.iteritems():
print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

Solution

  • You mentioned that you have a excel file with the list of all the websites right ? Therefore you can export the excel file as a csv file which you can then read values from in your python code.

    Here's some more information regarding that.

    Here's how to work directly with excel files

    You can do something along the lines :

    import csv
    
    links = []
    
    with open('urls.csv', 'r') as csv_file:
        csv_reader = csv.reader(csv_file)
        # Simple example where only a single column of URL's is present
        links = list(csv_reader)
    

    Now links is a list of all the URLs. You can then loop over the list inside a function which fetches the page and scrapes the data.

    def extract_social_links(links):
        for link in links:
            base_url = link 
    
            br = mechanize.Browser()
            cj = cookielib.LWPCookieJar()
            br.set_cookiejar(cj)
            br.set_handle_robots(False)
            br.set_handle_equiv(False)
            br.set_handle_redirect(True)
            br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),     max_time=1)
            br.addheaders = [('User-agent',
              'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
            page = br.open(base_url, timeout=10)
    
            links = {}
            for link in br.links():
                if link.url.find('facebook')>=0 or link.url.find('twitter')>=0 or     link.url.find('linkedin')>=0 or link.url.find('plus.google')>=0:
                links[link.url] = {'count': 1, 'texts': [link.text]}
    
            # printing
            for link, data in links.iteritems():
            print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])
    

    As an aside, you should probably split your if conditions to make them more readable.