Search code examples
pythonarraysnumpydtype

Numpy string arrays - Strange behavior calling tobytes() on numpy string array


I am trying to use Numpy to vectorize an operation to parse a text file containing lines of numbers and convert the data into a numpy array. The data in the text file looks like this:

*** .txt file ***

1 0 0 0 0
2 1 0 0 0
3 1 1 0 0
4 0 1 0 0
5 0 0 1 0
6 1 0 1 0
7 1 1 1 0
8 0 1 1 0
9 0.5 0.5 0 0
10 0.5 0.5 1 0
11 0.5 0 0.5 0
12 1 0.5 0.5 0
13 0.5 1 0.5 0
14 0 0.5 0.5 0

*** /.txt file ***

My approach is to read the lines in using file.readlines(), then convert that list of line strings into a numpy array as follows - file.readlines() part omitted for testing.

short_list = ['1 0 0 0 0\n',
              '2 1 0 0 0\n',
              '3 1 1 0 0\n']

long_list = ['1 0 0 0 0\n',
             '2 1 0 0 0\n',
             '3 1 1 0 0\n',
             '4 0 1 0 0\n',
             '5 0 0 1 0\n',
             '6 1 0 1 0\n',
             '7 1 1 1 0\n',
             '8 0 1 1 0\n',
             '9 0.5 0.5 0 0\n',
             '10 0.5 0.5 1 0\n',
             '11 0.5 0 0.5 0\n',
             '12 1 0.5 0.5 0\n',
             '13 0.5 1 0.5 0\n',
             '14 0 0.5 0.5 0\n']


def lines_to_npy(lines):
    n_lines = len(lines)
    lines_array = np.array(lines).astype('S')
    tmp = lines_array.tobytes().decode('ascii')
    print(repr(tmp))
    print(lines_array.dtype)
    print(np.array(tmp.split(), dtype=np.int32).reshape(n_lines, -1))

lines_to_npy(short_list)
lines_to_npy(long_list)

Calling the function with short_list produces the following output:

'1 0 0 0 0\n2 1 0 0 0\n3 1 1 0 0\n'
|S10
[[1 0 0 0 0]
 [2 1 0 0 0]
 [3 1 1 0 0]]

Which is the desired result (from reading around I gather that "|S10" means that each element in the array is a 10 character string for which the endianness doesn't matter). However, calling with the long list inserts several null characters \x00 at the end of each string which makes it harder to parse.

'1 0 0 0 0\n\x00\x00\x00\x00\x002 1 0 0 0\n\x00\x00\x00\x00\x003 1 1 0 0\n\x00\x00\x00\x00\x004 0 1 0 0\n\x00\x00\x00\x00\x005 0 0 1 0\n\x00\x00\x00\x00\x006 1 0 1 0\n\x00\x00\x00\x00\x007 1 1 1 0\n\x00\x00\x00\x00\x008 0 1 1 0\n\x00\x00\x00\x00\x009 0.5 0.5 0 0\n\x0010 0.5 0.5 1 0\n11 0.5 0 0.5 0\n12 1 0.5 0.5 0\n13 0.5 1 0.5 0\n14 0 0.5 0.5 0\n'
|S15

Note that an error was raised in my function when loading the null characters into an array, preventing a final result. I know that a "cheap and dirty" solution would be to just strip the null characters off the end. I also know that I could use Pandas to accomplish the main goal, too, but I'd like to understand why this behavior is occurring.

The \x00 are padded at the end of each string to make each string of length 15. This kind of makes sense, because the dtype of the short array was |S10, and each string just happened to be 10 characters long. The long array contains 14 strings, the dtype was |S15, and extra \x00 are appended to make the length of each item in the array 15 characters.

I am confused because the number of elements in the list of strings (3 vs 14) has no correlation to the length of each string, so I don't understand why the dtype changes to |S15 when adding more list elements.


Update: I did some more research on ways to efficiently read in data from a text file to a numpy array. I need a fast method for doing this because I am reading files with ~10M lines. numpy.loadfromtxt() and numpy.genfromtxt() are candidate solutions, but they are very slow because they are implemented in Python and basically do the same thing as manually looping through file.readlines(), stripping, and splitting the line strings (source). I noticed in my own testing that using numpy.loadtxt() was about twice as slow as the aforementioned manual method, which was also noted here.

I found that using pandas.from_csv().to_numpy(), I was able to get a speedup of ~10x that of looping through file.readlines(). See this answer here. Hopefully this helps anyone in the future with the same application.


Solution

  • I am trying to use Numpy to vectorize an operation to parse a text file containing lines of numbers and convert the data into a numpy array.

    Vectorization has nothing to do with reading your data. Doing, e.g. tmp.split() is still calling a plain Python function on a plain Python string object, creating lots of Python string objects as a result, and doing it within the main Python bytecode interpreter loop. No amount of surrounding Numpy code will change that.

    That said, there is no meaningful performance gain to be had here anyway. Any halfway reasonable approach to reading and interpreting (i.e., parsing) the file is going to be lightning fast compared to fetching the contents from the hard drive, and much faster than even reading from an SSD.

    My approach is to read the lines in using file.readlines(), then convert that list of line strings into a numpy array as follows - file.readlines() part omitted for testing.

    Don't do that. This entire process is much more complex than necessary. Keep reading.

    tmp = lines_array.tobytes().decode('ascii')

    This just gives you the original contents of the file, which you could have gotten directly with .read() instead of .readlines().

    from reading around I gather that "|S10" means that each element in the array is a 10 character string for which the endianness doesn't matter

    Not quite; the elements are arrays (in the C sense) of 10 bytes each. They are not "strings"; they are raw data which is possibly interpreted as text.

    The string '1 0 0 0 0\n', when encoded to bytes using the default encoding, uses 10 bytes. So do all the other strings in short_list. Thus, "array of 10 bytes" is a suitable dtype.

    calling with the long list inserts several null characters \x00 at the end of each string which makes it harder to parse.

    It does not insert "null characters"; it inserts null bytes (with a numeric value of 0). It does this because it takes 15 bytes to store the encoded representation of '14 0 0.5 0.5 0\n', and each element has to be the same size.

    Keep in mind here that the symbol 0 in your text is translated into a single byte which does not have a numeric value of zero. It has a numeric value of 48.

    Again: all these encoding and re-encoding steps are not useful - you could have just used the original data from the file, via .read() - all that .readlines() is helping you with is in determining the number of lines in the file.


    But you do not either want or need to do any of that.

    The logic you want is built directly into Numpy. You should have found this out for yourself by using a search engine.

    You can directly ask Numpy to load the file for you, and you should do it that way: numpy.loadtxt('myfile.txt').