Search code examples
pythonpandastwittergeolocationtwython

Download User Geo location twitter


I have a list of twitter usernames containing more than 500K in number. I could develop a program that uses twython and API secret keys. The program and Inputs are too large to put here hence uploaded in the Github

Twitter_User_Geolocation

The program runs fine for usernames around 150 in number but not more than that. The limitation makes it impossible to scrape geo locations for the 500K+ usernames.

I am seeking some help in bypassing the API and may be use web scraping technique or any other better alternative to scrape geo locations of usernames.

Every Help Appreciated :)


Solution

  • What I would do is scrap twitter.com/ instead of using twitter API.

    The main reason is frontend is not query limited (or at least way less limited) and even if you needs to call twitter too much time by seconds, you can play with User-Agent and proxy to not be spotted.

    So for me, scrapping is the easiest way to bypass API limit.

    Moreover what you need to crawl is really easy to access, I made a simple'n'dirty code that parse your csv file and output location of users.

    I will make a PR on your repo for fun, but here is the code:

    #!/usr/env/bin python
    
    import urllib2
    from bs4 import BeautifulSoup
    
    with open('00_Trump_05_May_2016.csv', 'r') as csv:
        next(csv)
        for line in csv:
            line = line.strip()
    
            permalink = line.split(',')[-1].strip()
            username  = line.split(',')[0]
            userid    = permalink.split('/')[3]
    
            page_url = 'http://twitter.com/{0}'.format(userid)
    
            try:
                page = urllib2.urlopen(page_url)
            except urllib2.HTTPError:
                print 'ERROR: username {} not found'.format(username)
            content = page.read()
            html = BeautifulSoup(content)
            location = html.select('.ProfileHeaderCard-locationText')[0].text.strip()
    
            print 'username {0} ({1}) located in {2}'.format(username, userid, location)
    

    Output:

    username cenkuygur (cenkuygur) located in Los Angeles
    username ilovetrumptards (ilovetrumptards) located in 
    username MorganCarlston hanifzk (MorganCarlston) located in 
    username mitchellvii (mitchellvii) located in Charlotte, NC
    username MissConception0 (MissConception0) located in #UniteBlue in Semi-Red State
    username HalloweenBlogs (HalloweenBlogs) located in Los Angeles, California
    username bengreenman (bengreenman) located in Fiction and Non-Fiction Both
    ...
    

    Obviously you should update this code to make it more robust, but the basics are done.

    PS: I parse 'permalink' field because it store well formatted slug to use in order to reach profil's page. It's pretty dirty, but quick & it works


    About google API, I surelly would use a kind of cache / database to avoid to much google calls.

    In python, without db you can just make a dict like:

    {
       "San Fransisco": [x.y, z.a],
       "Paris": [b.c, d.e],
    }
    

    And for each location to parse I would first check in this dict if key exists, if yes just take my value from here, else call google API and then save values in db dict.


    I think with this two ways of doing you will be able to get your data.