Search code examples
python-3.xfileencodingutf-8

How to read the file without encoding and extract desired urls with python3?


Environment: python3.
There are many files ,some of them encoding with gbk,others encoding with utf-8. I want to extract all the jpg with regular expression

For s.html encoding with gbk.

tree = open("/tmp/s.html","r").read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 135: invalid start byte

tree = open("/tmp/s.html","r",encoding="gbk").read()
pat = "http://.+\.jpg"
result = re.findall(pat,tree)
print(result)

['http://somesite/2017/06/0_56.jpg']

It is a huge job to open all the files with specified encoding,i want a smart way to extract jpg urls in all the files.


Solution

  • If they have mixed encoding, you could try one encoding and fall back to another:

    # first open as binary
    with open(..., 'rb') as f:
        f_contents = f.read()
        try:
            contents = f_contents.decode('UTF-8')
        except UnicodeDecodeError:
            contents = f_contents.decode('gbk')
        ...
    

    If they are html files, you may also be able to find the encoding tag, or search them as binary with a binary regex:

    contents = open(..., 'rb').read()
    regex = re.compile(b'http://.+\.jpg')
    result = regex.findall(contents)
    # now you'll probably want to `.decode()` each of the urls, but you should be able to do that pretty trivially with even the `ASCII` codec
    

    Though now that I think of it, you probably don't really want to use regex to capture the links as you'll then have to deal with html entities (&) and may do better with something like pyquery

    Here's a quick example using pyquery

    contents = open(..., 'rb').read()
    pq = pyquery.PyQuery(contents)
    images = pq.find('img')
    for img in images:
       img = pyquery.PyQuery(img)
       if img.attr('src').endswith('.jpg')
           print(img.attr('src'))
    

    Not on my computer with things installed, so mileage with these code samples may vary