byte array numpy Python 2 vs Python 3

I need to generate tuples of this form : (string, string) or (string, int).

I have the following code which seems to work fine in Python 2 but not returning the desired result in Python 3 (tested on Python 3.5) :

import string
import numpy as np

global_tab     = []
global_nb_loop = 0

def numpy_test(N=2000000):
    global global_tab
    global global_nb_loop
    global_nb_loop = N

    print("Generate %d lines" % global_nb_loop)
    global_tab = [(u.tostring(),str(v)) for u,v in zip( np.random.choice(list(string.ascii_letters.encode("utf-8")), (N, 15)), np.random.randint(0, 100, N) )]
    print("%d lines generated" % len(global_tab))

numpy_test(10)

for x in range(10):
    print("%d : %s" % (x, global_tab[x]))

In Python 2, results are :

Generate 10 lines
10 lines generated
0 : ('zvtMIBpQZhjpyqt', '63')
1 : ('mVMkbqBHetqEJdc', '70')
2 : ('uWAwOYIBwzyDdhR', '54')
3 : ('WZvXdFYewrOIYfp', '90')
4 : ('uzszDaTwajsADag', '37')
5 : ('HmBSpSBbQeOixII', '88')
6 : ('VACSDjDtQqqjPWh', '84')
7 : ('XiZJbYQkgpgohMJ', '93')
8 : ('JiFSbeUBYtqhXQk', '93')
9 : ('xLuBXBGYPTogDwo', '41')

In Python 3.5 results are like this :

Generate 10 lines
10 lines generated
0 : (b'z\x00\x00\x00v\x00\x00\x00t\x00\x00\x00M\x00\x00\x00I\x00\x00\x00B\x00\x00\x00p\x00\x00\x00Q\x00\x00\x00Z\x00\x00\x00h\x00\x00\x00j\x00\x00\x00p\x00\x00\x00y\x00\x00\x00q\x00\x00\x00t\x00\x00\x00', '63')
1 : (b'm\x00\x00\x00V\x00\x00\x00M\x00\x00\x00k\x00\x00\x00b\x00\x00\x00q\x00\x00\x00B\x00\x00\x00H\x00\x00\x00e\x00\x00\x00t\x00\x00\x00q\x00\x00\x00E\x00\x00\x00J\x00\x00\x00d\x00\x00\x00c\x00\x00\x00', '70')
2 : (b'u\x00\x00\x00W\x00\x00\x00A\x00\x00\x00w\x00\x00\x00O\x00\x00\x00Y\x00\x00\x00I\x00\x00\x00B\x00\x00\x00w\x00\x00\x00z\x00\x00\x00y\x00\x00\x00D\x00\x00\x00d\x00\x00\x00h\x00\x00\x00R\x00\x00\x00', '54')
3 : (b'W\x00\x00\x00Z\x00\x00\x00v\x00\x00\x00X\x00\x00\x00d\x00\x00\x00F\x00\x00\x00Y\x00\x00\x00e\x00\x00\x00w\x00\x00\x00r\x00\x00\x00O\x00\x00\x00I\x00\x00\x00Y\x00\x00\x00f\x00\x00\x00p\x00\x00\x00', '90')
4 : (b'u\x00\x00\x00z\x00\x00\x00s\x00\x00\x00z\x00\x00\x00D\x00\x00\x00a\x00\x00\x00T\x00\x00\x00w\x00\x00\x00a\x00\x00\x00j\x00\x00\x00s\x00\x00\x00A\x00\x00\x00D\x00\x00\x00a\x00\x00\x00g\x00\x00\x00', '37')
5 : (b'H\x00\x00\x00m\x00\x00\x00B\x00\x00\x00S\x00\x00\x00p\x00\x00\x00S\x00\x00\x00B\x00\x00\x00b\x00\x00\x00Q\x00\x00\x00e\x00\x00\x00O\x00\x00\x00i\x00\x00\x00x\x00\x00\x00I\x00\x00\x00I\x00\x00\x00', '88')
6 : (b'V\x00\x00\x00A\x00\x00\x00C\x00\x00\x00S\x00\x00\x00D\x00\x00\x00j\x00\x00\x00D\x00\x00\x00t\x00\x00\x00Q\x00\x00\x00q\x00\x00\x00q\x00\x00\x00j\x00\x00\x00P\x00\x00\x00W\x00\x00\x00h\x00\x00\x00', '84')
7 : (b'X\x00\x00\x00i\x00\x00\x00Z\x00\x00\x00J\x00\x00\x00b\x00\x00\x00Y\x00\x00\x00Q\x00\x00\x00k\x00\x00\x00g\x00\x00\x00p\x00\x00\x00g\x00\x00\x00o\x00\x00\x00h\x00\x00\x00M\x00\x00\x00J\x00\x00\x00', '93')
8 : (b'J\x00\x00\x00i\x00\x00\x00F\x00\x00\x00S\x00\x00\x00b\x00\x00\x00e\x00\x00\x00U\x00\x00\x00B\x00\x00\x00Y\x00\x00\x00t\x00\x00\x00q\x00\x00\x00h\x00\x00\x00X\x00\x00\x00Q\x00\x00\x00k\x00\x00\x00', '93')
9 : (b'x\x00\x00\x00L\x00\x00\x00u\x00\x00\x00B\x00\x00\x00X\x00\x00\x00B\x00\x00\x00G\x00\x00\x00Y\x00\x00\x00P\x00\x00\x00T\x00\x00\x00o\x00\x00\x00g\x00\x00\x00D\x00\x00\x00w\x00\x00\x00o\x00\x00\x00', '41')

Of course, if I remove all the \x00, I have the desired result.

Results are linked to Python 3.5 since Windows or Linux Python 3.5 return the same type of byte array.

How can I get the desired result form from Python 2 in Python 3.5?

This script will be used to generate 2,000,000 rows packages, and numpy was the best for this generation, going faster than multiprocessing solution, but final result in Python 3.5 isn't the one expected.

Any ideas? The code must run as fast as possible on several platforms (Windows, Linux, Mac).

Solution

Why

In python 2 string.ascii_letters is a byte-string to begin with. The "magic" of python 2 first decodes it with default encoding when you call the method .encode('utf-8') and then re-encodes as requested. The result of encoding is bytes in both python 2 and 3.

In python 3 a byte-string behaves differently when iterated over: it returns integers, not byte-strings of length 1:

In [52]: list(string.ascii_letters.encode('utf-8'))
Out[52]: 
[97,
 98,
 99,
 ...

Thus in python 3 the result of

np.random.choice(list(string.ascii_letters.encode('utf-8')), (N, 15))

is not N arrays of 15 1-byte string elements. It is N arrays of 15 integers. As you then later call .tostring() to obtain the raw bytes of the array, you get either 4 or 8 byte integers. In your example you seem to get 4, on this machine they are 8.

Possible fixes

One option is to add a cast:

In [63]: [(u.tostring(),str(v)) for u, v in zip(
    np.random.choice(list(string.ascii_letters.encode("utf-8")),
                     (N, 15)).astype('|S1'),  # Cast to array-protocol type string
    np.random.randint(0, 100, N))]
Out[63]: 
[(b'811881611111171', '82'),
 (b'816878668111171', '46'),
 (b'811118881668718', '53'),
 (b'971861817181818', '49'),
 (b'118618991678978', '81'),
 ...

Another would be to skip the encoding entirely, trust the native string types if possible (unless you really do need byte strings) and use str.join():

In [74]: [(''.join(u), str(v)) for u, v in zip( 
    np.random.choice(list(string.ascii_letters),
                     (N, 15)),
    np.random.randint(0, 100, N))]
Out[74]: [('IJTlleYqZXmSJaW', '32')]

A third would be to wrap with bytearray() instead of a list():

In [95]: [(u.tostring(), str(v)) for u, v in zip(
    np.random.choice(bytearray(string.ascii_letters.encode('utf-8')),
                     (N, 15)),
    np.random.randint(0, 100, N))]
Out[95]: [(b'MPvbDEQIdAVBQVz', '83')]

Some timings

Here's how they performed on this machine in python 3 with N = 2000000:

The original without the (required) cast:

In [69]: %timeit [(u.tostring(), str(v)) for u, v in zip( np.random.choice(list(string.ascii_letters.encode('utf-8')), (N, 15)), np.random.randint(0, 100, N))]
1 loops, best of 3: 4.62 s per loop

With the cast:

In [70]: %timeit [(u.tostring(), str(v)) for u, v in zip( np.random.choice(list(string.ascii_letters.encode('utf-8')), (N, 15)).astype('|S1'), np.random.randint(0, 100, N))]
1 loops, best of 3: 7.07 s per loop

Using native string type and join:

In [71]: %timeit [(''.join(u), str(v)) for u, v in zip( np.random.choice(list(string.ascii_letters), (N, 15)), np.random.randint(0, 100, N))]
1 loops, best of 3: 12.1 s per loop

Wrapping with bytearray():

In [93]: %timeit [(u.tostring(), str(v)) for u, v in zip( np.random.choice(bytearray(string.ascii_letters.encode('utf-8')), (N, 15)), np.random.randint(0, 100, N))]
1 loops, best of 3: 4.56 s per loop