Search code examples
pythoncsvgzipurllib2urlopen

Error on read gzip csv from url in Python: "_csv.Error: line contains NULL byte"


I am trying to read a gzipped csv file from a url. This is a very big file with more than 50.000 lines. When I try the code below I get an error: _csv.Error: line contains NULL byte

import csv
import urllib2   
url = '[my-url-to-csv-file].gz'
response = urllib2.urlopen(url)
cr = csv.reader(response)

for row in cr:
    if len(row) <= 1: continue
        print row

If I try to print the content of the file before I try to read it I get something like this:

?M}?7?M==??7M???z?YJ?????5{Ci?jK??3b??p?

?[?=?j&=????=?0u'???}mwBt??-E?m??Ծ??????WM??wj??Z??ėe?D?VF????4=Y?Y?tA???

How can I read the gzipped csv file from this URL properly?


Solution

  • How to Open a .gz (gzip) csv File from a URL with urllib2.urlopen

    1. Save the URL data to a file object. For this, you can use StringIO.StringIO().
    2. Decompress the .gz with gzip.Gzipfile().
    3. Read the data from your new file object.

    To use your example:

    from StringIO import StringIO
    import gzip
    import urllib2
    
    url = '[my-url-to-csv-file].gz'
    mem = StringIO(urlopen(url).read())
    f = gzip.GzipFile(fileobj=mem, mode='rb')
    data = f.read()
    
    for line in data:
      print line