Search code examples
pythonhtmlencodingurllibthai

urllib2 doesn't get same html string as normal browsers with same agents (encoding error?)


I m trying to get the page from this site http://www.francais-thai.com/dicoweb/fran/00012.htm

but in python

(there is thai text in the page)

here is the code I tried: (it should download the page)

# -*- coding: utf-8 -*-
import urllib2

agents = {'User-Agent':"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.36"}

url = ("http://www.francais-thai.com/dicoweb/fran/00012.htm")
request = urllib2.Request(url, headers=agents)
page = urllib2.urlopen(request).read()


file = open("00012.htm","w")
file.write(page)
file.close()

but the page I m getting this way is not the same at all as firefox/chrome/etc gave me when I show the source

here is the page I m getting with chrome:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>abeille</title>
<style type="text/css">
<!--
body {font-size: medium;}
.FAG {font-family: Arial, Helvetica, sans-serif; font-weight: bold; color: #000;}
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
.PN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #00F;}
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
.TN {font-family: "Angsana New"; font-weight: normal; color: #F00;}
-->
</style>
</head>
<body lang="fr">
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
</body>
</html>

and this is the bugged page I m getting with my code:

<html xmlns="http://www.w3.org/1999/xhtml">
਍㰀栀攀愀搀㸀ഀഀ
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
਍㰀琀椀琀氀攀㸀愀戀攀椀氀氀攀㰀⼀琀椀琀氀攀㸀ഀഀ
<style type="text/css">
਍㰀℀ⴀⴀഀഀ
body {font-size: medium;}
਍⸀䘀䄀䜀 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 䄀爀椀愀氀Ⰰ 䠀攀氀瘀攀琀椀挀愀Ⰰ 猀愀渀猀ⴀ猀攀爀椀昀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 戀漀氀搀㬀 挀漀氀漀爀㨀 ⌀   㬀紀ഀഀ
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
਍⸀倀一 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 ∀吀椀洀攀猀 一攀眀 刀漀洀愀渀∀Ⰰ 吀椀洀攀猀Ⰰ 猀攀爀椀昀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 渀漀爀洀愀氀㬀 挀漀氀漀爀㨀 ⌀  䘀㬀紀ഀഀ
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
਍⸀吀一 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 ∀䄀渀最猀愀渀愀 一攀眀∀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 渀漀爀洀愀氀㬀 挀漀氀漀爀㨀 ⌀䘀  㬀紀ഀഀ
-->
਍㰀⼀猀琀礀氀攀㸀ഀഀ
</head>
਍㰀戀漀搀礀 氀愀渀最㴀∀昀爀∀㸀ഀഀ
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
਍㰀⼀戀漀搀礀㸀ഀഀ
</html>
਍

I tried to change user agents, and finally got the exact user agent using wireshark, but the same "bugged" page is downloading, not the right page

How can I get the same html text as normal browsers gets with python?

my guess is an encoding error (there is thai on the html) but I can't get it to work, I tried changing encoding etc.. but I can't get it to work


Solution

  • The page is actually utf-16 encoded not utf-8 so the user-agent is irrelevant:

    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    
    print(response.read().decode("utf-16"))
    

    Output:

    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>abeille</title>
    <style type="text/css">
    <!--
    body {font-size: medium;}
    .FAG {font-family: Arial, Helvetica, sans-serif; font-weight: bold; color: #000;}
    .FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
    .PN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #00F;}
    .PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
    .TN {font-family: "Angsana New"; font-weight: normal; color: #F00;}
    -->
    </style>
    </head>
    <body lang="fr">
    <span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
    </body>
    </html>
    

    requests has the same issue and using chardet you can it returns the encoding as UTF-16LE:

    import chardet
    print chardet.detect(response.read())
    {'confidence': 1.0, 'encoding': 'UTF-16LE'}