Search code examples
python-3.xbz2

Python3: how to read the txt.bz2 file


There is text tile which compressed by bz2 file. The data in the text file like the following.

   1  x3, x32, f5

   0  f4, g6, h7, j9

   .............

I know how to load the text file by the following code

 rf = open('small.txt', 'r')
    lines = rf.readlines()
    lst_text = []
    lst_label = []
    for line in lines:
        line = line.rstrip('\n')
        label, text = line.split('\t')
        text_words = text.split(',')
        lst_text.append(text_words)
        lst_label.append(int(label))

But after the txt is compressed to small.txt.bz2 file. I want to use the following data to read the bz2 file, but there is error.

import bz2

bz_file = bz2.BZ2File("small.txt.bz2")
lines = bz_file.readlines()
for line in lines:
    line = line.rstrip('\n')
    label, text = line.split('\t')
    text_words = text.split(',')
    print(label)

errors:

      line = line.rstrip('\n')
TypeError: a bytes-like object is required, not 'str'

Could you give me hints how to deal with it, code is best. Thanks!


Solution

  • You get this error because the BZ2file object open files in binary mode. So your line is a bytes object, not a string. You could probably work around that by using line = line.rstrip(b'\n'). But the resulting line would still be a bytes object.

    But you should probably use bz2.open in text mode instead:

    with bz2.open("small.txt.bz2", "rt") as bz_file:
        for line in bz_file:
            label, text = line.rstrip('\n').split('\t')
            text_words = text.split(',')
            print(label)