Search code examples
pythonurlopen

Web scraping urlopen in python


I am trying to get the data from this website: http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS

It seems like urlopen don't get the html code and I don't understand why. It goes like:

html = urllib.request.urlopen("http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS")
print (html)

My code is right, I get the html source of other webpages with the same code, but it seems like it doesn't recognise this address.

it prints: b''

Maybe another library is more appropriate? Why urlopen doesn't return the html code of the webpage? help thanks!


Solution

  • Personally , I write:

    # Python 2.7
    
    import urllib
    
    url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'
    sock = urllib.urlopen(url)
    content = sock.read() 
    sock.close()
    
    print content
    

    Et si tu parles français,.. bonjour sur stackoverflow.com !

    update 1

    In fact, I prefer now to employ the following code, because it is faster:

    # Python 2.7
    
    import httplib
    
    conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)
    
    req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'
    
    try:
        conn.request('GET',req)
    except:
         print 'echec de connexion'
    
    content = conn.getresponse().read()
    
    print content
    

    Changing httplib to http.client in this code should be enough to adapt it to Python 3.

    .

    I confirm that, with these two codes, I obtain the source code in which I see the data in which you are interested:

            <td class="L20" width="33%" align="center">11:57:44</td>
    
            <td class="L20" width="33%" align="center">1.4486</td>
    
            <td class="L20" width="33%" align="center">0</td>
    
    </tr>
    
                                            <tr>
    
            <td  width="33%" align="center">11:57:43</td>
    
            <td  width="33%" align="center">1.4486</td>
    
            <td  width="33%" align="center">0</td>
    
    </tr>
    

    update 2

    Adding the following snippet to the above code will allow you to extract the data I suppose you want:

    for i,line in enumerate(content.splitlines(True)):
        print str(i)+' '+repr(line)
    
    print '\n\n'
    
    
    import re
    
    regx = re.compile('\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d\d:\d\d:\d\d)</td>\r\n'
                      '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">([\d.]+)</td>\r\n'
                      '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d+)</td>\r\n')
    
    print regx.findall(content)
    

    result (only the end)

    .......................................
    .......................................
    .......................................
    .......................................
    98 'window.config.graphics = {};\n'
    99 'window.config.accordions = {};\n'
    100 '\n'
    101 "window.addEvent('domready', function(){\n"
    102 '});\n'
    103 '</script>\n'
    104 '<script type="text/javascript">\n'
    105 '\t\t\t\tsas_tmstp = Math.round(Math.random()*10000000000);\n'
    106 '\t\t\t\tsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_test\n'
    107 '\t\t\t\tvar sas_formatids = "8968";\n'
    108 '\t\t\t\tsas_target = "symb=1xEURUS#"; // TargetingArray\n'
    109 '\t\t\t\tdocument.write("<scr"+"ipt src=\\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\\"></scr"+"ipt>");\t\t\t\t\n'
    110 '\t\t\t</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">\r\n'
    111 "\twindow.addEvent('domready', function(){\r\n"
    112 'sas_move(1,8968);\t});\r\n'
    113 '</script>\n'
    114 '<script type="text/javascript">\n'
    115 'var _gaq = _gaq || [];\n'
    116 "_gaq.push(['_setAccount', 'UA-1623710-1']);\n"
    117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);\n"
    118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);\n"
    119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);\n"
    120 "_gaq.push(['_trackPageLoadTime']);\n"
    121 "_gaq.push(['_trackPageview']);\n"
    122 '(function() {\n'
    123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n"
    124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n"
    125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n"
    126 '})();\n'
    127 '</script>\n'
    128 '</body>\n'
    129 '</html>'
    
    
    
    [('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]
    

    I hope you don't plan to "play" trading on the Forex: it's one of the best way to loose money rapidly.

    update 3

    SORRY ! I forgot you are with Python 3. So I think you must define the regex like that:

    regx = re.compile(b'\t\t\t\t\t......)

    that is to say with b before the string, otherwise you'll get an error like in this question