Search code examples
pythonnumpycharacter-encodingfile-read

python numpy.loadtxt() crashing because of binary character in txt file


I am using this line to read part of the lines in a txt file, skipping header and footer.

np_data= np.loadtxt(file, delimiter= "\t", skiprows=12, max_rows= 1024)

The problem is that in the footer there is this character: ∞, which causes the following error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 4729: invalid start byte

Is there a way to skip that character or line? For me the combination of skiprows and max_rows does not seem to work. Thank you


Solution

  • Is there a way to skip that (...)line?

    numpy.loadtxt first argument might be

    File, filename, list, or generator to read. If the filename extension is .gz or .bz2, the file is first decompressed. Note that generators must return bytes or strings. The strings in a list or produced by a generator are treated as lines.

    thus you might envelope file handle to skip lines which you do not want, consider following simple example, let file.csv content be

    1,2,3
    4,∞,6
    7,8,9
    

    then

    import numpy as np
    with open("file.csv","rb") as f:
        arr = np.loadtxt(filter(lambda x:b"\xe2\x88\x9e" not in x,f), delimiter=",")
    print(arr)
    

    gives output

    [[1. 2. 3.]
     [7. 8. 9.]]
    

    Explanation: I open file.csv in binary mode, then use filter to select lines from file handle f which do not contain sequence of bytes \xe2\x88\x9e (which is ∞ in Unicode)