Search code examples
pythonxmldynamicweb-scrapingxrange

How to make rp and xrange dynamic?


Hey guys many thanks for taking the time to look at my problem, I have been working on this code for about 1 week (I am new to coding and to python 1 week also) Currently the loop only works if x in xrange(x) and 'rp' : 'x' is the correct number of rows available from this xml. The xml updates throughout the day, I was wondering if anyone can offer a solution to make x dynamic?

import mechanize
import urllib
import json
import re
from sched import scheduler
from time import time, sleep

from sched import scheduler
from time import time, sleep

s = scheduler(time, sleep)

def run_periodically(start, end, interval, func):
event_time = start
while event_time < end:
    s.enterabs(event_time, 0, func, ())
    event_time += interval
s.run()

def getData():  
post_url = "urlofinterest_xml"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]

######These are the parameters you've got from checking with the aforementioned tools
parameters = {'page' : '1',
              'rp' : '8',
              'sortname' : 'roi',
              'sortorder' : 'desc'
             }
#####Encode the parameters
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')

xmlload1 = json.loads(trans_array)
pattern1 = re.compile('>&nbsp;&nbsp;(.*)<')
pattern2 = re.compile('/control/profile/view/(.*)\' title=')
pattern3 = re.compile('<span style=\'font-size:12px;\'>(.*)<\/span>')
pattern4 = re.compile('title=\'Naps posted: (.*) Winners:')
pattern5 = re.compile('Winners: (.*)\'><img src=')


for i in xrange(8):
    user_delimiter = xmlload1['rows'][i]['cell']['username']
    selection_delimiter = xmlload1['rows'][i]['cell']['race_horse']

    username_delimiter_results = re.findall(pattern1, user_delimiter)[0]
    userid_delimiter_results = int(re.findall(pattern2, user_delimiter)[0])
    user_selection = re.findall(pattern3, selection_delimiter)[0]
    user_numberofselections = float(re.findall(pattern4, user_delimiter)[0])
    user_numberofwinners = float(re.findall(pattern5, user_delimiter)[0])

    strikeratecalc1 = user_numberofwinners/user_numberofselections
    strikeratecalc2 = strikeratecalc1*100

    print "user id = ",userid_delimiter_results
    print "username = ",username_delimiter_results
    print "user selection = ",user_selection
    print "best price available as decimal = ",xmlload1['rows'][i]['cell']     ['tws.best_price']
    print "race time = ",xmlload1['rows'][i]['cell']['race_time']
    print "race meeting = ",xmlload1['rows'][i]['cell']['race_meeting']
    print "ROI = ",xmlload1['rows'][i]['cell']['roi']
    print "number of selections = ",user_numberofselections
    print "number of winners = ",user_numberofwinners
    print "Strike rate = ",strikeratecalc2,"%"
    print ""


getData()


run_periodically(time()+5, time()+1000000, 15, getData)

Kind regards AEA


Solution

  • First, I'm going to talk about how you iterate over your results. Based on your code, xmlload1['rows'] is an array of dicts, so instead of choosing an arbitrary number, you can iterate over it directly instead. To make this a better example, I'm going to set up some arbitrary data to make this clear:

    xmlload1 = {
       "rows": [{"cell": {"username": "one", "race_horse":"b"}}, {"cell": {"username": "two", "race_horse": "c"}}]
    }
    

    So, given the data above, you can just iterate through rows in a for loop, like this:

    for row in xmlload1['rows']:
        cell = row["cell"]
        print cell["username"]
        print cell["race_horse"]
    

    Each iteration, cell takes on the value of another element in the iterable (the list in xmlload1['rows']). This works with any container or sequence that supports iteration (like lists, tuples, dicts, generators, etc.)

    Notice how I haven't used any magic numbers anywhere, so xmlload1['rows'] could be arbitrarily long and it would still work.

    You can set the requests to be dynamic by using a function, like this:

    def get_data(rp=8, page=1):
        parameters = {'page' : str(page),
                  'rp' : str(rp),
                  'sortname' : 'roi',
                  'sortorder' : 'desc'
                 }
        data = urllib.urlencode(parameters)
        trans_array = browser.open(post_url,data).read().decode('UTF-8')
        return json.loads(trans_array)
    

    Now, you can call get_data(rp=5) to get 5 rows, or get_data(rp=8) to get 8 rows [and get_data(rp=8, page=3) to get the third page], etc. And you can clearly add additional variables or even pass in the parameters dict to the function directly.