Search code examples
socketsunicodeluacjkluasocket

HTTP GET Chinese character using luasocket


I use luasocket to GET a web page which contains Chinese characters "开奖结果" (the page itself is encoded in charset="gb2312"), as below:

require "socket"
host = '61.129.89.226'
fileformat = '/fcopen/cp_kjgg_dfw.jsp?lottery_type=ssq&lottery_issue=%s'
function getlottery(num)
  c = assert(socket.connect(host, 80))
  c:send('GET ' .. string.format(fileformat, num)  .. " HTTP/1.0\r\n\r\n")
  content = c:receive('*l')
  while content do
    if content and content:find('开奖结果') then -- failed
      print(content)
    end
    content = c:receive('*l')
  end
  c:close()
end

--http://61.129.89.226/fcopen/cp_kjgg_dfw.jsp?lottery_type=ssq&lottery_issue=2012138
getlottery('2012138')

Unfortunately, it fails to match the expected characters:

content:find('开奖结果') -- failed

I know Lua is capable of finding unicode characters:

Lua 5.1.4  Copyright (C) 1994-2008 Lua.org, PUC-Rio
> if string.find("This is 开奖结果", "开奖结果") then print("found!") end
found!

Then I guess it might be caused by how luasocket retrieves data from the web. Could anyone shed some lights on this?

Thanks.


Solution

  • If the page is encoded in GB2312, and your script (the file itself) is encoded in utf-8, there's no way the match will work. Because .find() will look for utf-8 codepoints, and it will just slide over the characters you're looking for, because they're not encoded the same way...

              开    奖      结     果
    GB      bfaa   bdb1   bde1   b9fb
    UTF-16  5f00   5956   7ed3   679c
    UTF-8   e5bc80 e5a596 e7bb93 e69e9c