Search code examples
pythonperformanceweb-servicesdiscogs-api

python performance of httplib (discogs API)


I wrote a short prog which uses the Discogs API with python, but it is so damn slow thats not usable for real web-applications. Here is the Python code and the python profile results (published only the time consuming spots) :

# -*- coding: utf-8 -*-

import profile
import discogs_client as discogs

def main():
    discogs.user_agent = 'Mozilla/5.0'
    #dump released albums into the file. You could also print it to the console
    f=open('DiscogsTestResult.txt', 'w+')

    #Use another band if you like, 
    #but if you decide to take "beatles" you will wait an hour! (cause of the num of releases)
    artist = discogs.Artist('Faust')
    print >> f, artist
    print  >> f," "

    artistReleases = artist.releases
    for r in artistReleases:
        print >> f, r.data
        print >> f,"---------------------------------------------"


print 'Performance Analysis of Discogs API'
print '=' * 80
profile.run('print main(); print')

and here is the result of pythons profile:

Performance Analysis of Discogs API
================================================================================
   82807 function calls (282219 primitive calls) in 177.544 seconds
   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      188  121.013    0.644  121.013    0.644 :0(connect)
      206   52.080    0.253   52.080    0.253 :0(recv)
        1    0.036    0.036  177.494  177.494 <string>:1(<module>)
      188    0.013    0.000  175.234    0.932 adapters.py:261(send)
      376    0.005    0.000    0.083    0.000 adapters.py:94(init_poolmanager)
      188    0.008    0.000  176.569    0.939 api.py:17(request)
      188    0.007    0.000  176.577    0.939 api.py:47(get)
      188    0.015    0.000  173.922    0.925 connectionpool.py:268(_make_request)
      188    0.015    0.000  174.034    0.926 connectionpool.py:332(urlopen)
        1    0.496    0.496  177.457  177.457 discogsTestFullDump.py:6(main)
      564    0.009    0.000  176.613    0.313 discogs_client.py:66(_response)
      188    0.012    0.000  176.955    0.941 discogs_client.py:83(data)
      188    0.011    0.000   51.759    0.275 httplib.py:363(_read_status)
      188    0.017    0.000   52.520    0.279 httplib.py:400(begin)
      188    0.003    0.000  121.198    0.645 httplib.py:754(connect)
      188    0.007    0.000  121.270    0.645 httplib.py:772(send)
      188    0.005    0.000  121.276    0.645 httplib.py:799(_send_output)
      188    0.003    0.000  121.279    0.645 httplib.py:941(endheaders)
      188    0.003    0.000  121.348    0.645 httplib.py:956(request)
      188    0.016    0.000  121.345    0.645 httplib.py:977(_send_request)
      188    0.009    0.000   52.541    0.279 httplib.py:994(getresponse)
        1    0.000    0.000  177.544  177.544 profile:0(print main(); print)
      188    0.032    0.000  176.322    0.938 sessions.py:225(request)
      188    0.030    0.000  175.513    0.934 sessions.py:408(send)
      752    0.015    0.000  121.088    0.161 socket.py:223(meth)
     2256    0.224    0.000   52.127    0.023 socket.py:406(readline)
      188    0.009    0.000  121.195    0.645 socket.py:537(create_connection)

Does anybody has any idea how to speed this up. I hope that whith some changes in the discogs_client.py it would be faster. Maybe changing from httplib to something else, or whatever. Or mybe it is faster to use another protocol instead of http?

(The source of discogs_client.py can be accessed here :"https://github.com/discogs/discogs_client/blob/master/discogs_client.py")

If anybody has any idea please respond, a lot of people would benefit from this.

Regards Daniel


Solution

  • UPDATE: From the discogs documentation: Requests are throttled by the server to one per second per IP address. Your application should (but doesnt have to) take this into account and throttle requests locally, too.

    The bottleneck seems to be at the (discogs) server end, retrieving individual releases. There is nothing you can really do about that, except give them money for faster servers.

    My suggestion would be to to cache the results, it's probably the only thing that will help. Rewrite discogs.APIBase._response, as follows:

    def _response(self):
        if not self._cached_response:
            self._cached_response=self._load_response_from_disk()
        if not self._cached_response:
            if not self._check_user_agent():
                raise UserAgentError("Invalid or no User-Agent set.")
            self._cached_response = requests.get(self._uri, params=self._params, headers=self._headers)
            self._save_response_to_disk()
    
        return self._cached_response
    

    An alternative approach is to write requests to a log and say "we don't know, try again later", then in another process, read the log, download the data, store it in a database. Then when they come back later, the requested data will be there ready.

    You would need to write _load_response_from_disk() and _save_response_to_disk() yourself - The stored data should have _uri, _params, and _headers as the key, and should include a timestamp with the data. If the data is too old (under the circumstances, I would suggest in the order of months - I have no idea if the numbering is persistent - I would guess trying days - weeks initially), or not found, return None. The storage would have to handle concurrent access, and fast indexes - probably a database.