Search code examples
pythonarraysstringnumpydata-conversion

Strange performance from NumPy array2string


I'm using NumPy's array2string for writing an ASCII file. It out performs Python string formatting in loop or with map:

aa = np.array2string(array.flatten(), precision=precision, separator=' ', max_line_width=(precision + 4) * ncolumns, prefix='         ', floatmode='fixed')
aa =  '         ' + aa[1:-1] + '\n'

I noticed strange results when number of elements is less than a few thousand. A comparison using map and join performance-wise does what I expect (slower as array gets large and quicker for small arrays because of overhead of the NumPy function):

enter image description here

What is the cause of the spike in numpy.array2string? It's slower for a (100, 3) array than a (500000,3) array. NumPy is the best option for the size of my data (>1000) but the spike seems weird. Full code:

import numpy as np
import perfplot


precision = 16
ncolumns = 6

# numpy method
def numpystring(array, precision, ncolumns):
    indent = '          '
    aa = np.array2string(array.flatten(), precision=precision, separator=' ', max_line_width=(precision + 6) * ncolumns,
                     prefix='         ', floatmode='fixed')
    return indent + aa[1:-1] + '\n'

# native python string creation
def nativepython_string(array, precision, ncolumns):
    fmt = '{' + f":.{precision}f" + '}'
    data_str = ''

    # calculate number of full rows
    if array.size <= ncolumns:
        nrows = 1
    else:
        nrows = int(array.size / ncolumns)

    # write full rows
    for row in range(nrows):
        shift = row * ncolumns
        data_str += '          ' + ' '.join(
            map(lambda x: fmt.format(x), array.flatten()[0 + shift:ncolumns + shift])) + '\n'

    # write any remaining data in last non-full row
    if array.size > ncolumns and array.size % ncolumns != 0:
        data_str += '          ' + ' '.join(
            map(lambda x: fmt.format(x), array.flatten()[ncolumns + shift::])) + '\n'

    return data_str

# Benchmark methods
out = perfplot.bench(
    setup=lambda n: np.random.random([n,3]),  # setup random nx3 array
    kernels=[
        lambda a: nativepython_string(a, precision, ncolumns),
        lambda a: numpystring(a, precision, ncolumns)
    ],
    equality_check=None,
    labels=["Native", "NumPy"],
    n_range=[2**k for k in range(16)],
    xlabel="Number of vectors [Nr.]",
    title="String Conversion Performance"

)

out.show(
    time_unit="us",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
)
out.save("perf.png", transparent=True, bbox_inches="tight")

Solution

  • A sample of using savetxt with small 2d array:

    In [87]: np.savetxt('test.txt', np.arange(24).reshape(3,8), fmt='%5d')
    In [88]: cat test.txt
        0     1     2     3     4     5     6     7
        8     9    10    11    12    13    14    15
       16    17    18    19    20    21    22    23
    
    In [90]: np.savetxt('test.txt', np.arange(24).reshape(3,8), fmt='%5d', newline=' ')
    In [91]: cat test.txt
        0     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23 
    

    It constructs a fmt string, based on the parameter and number of columns:

    In [95]: fmt=' '.join(['%5d']*8)
    In [96]: fmt
    Out[96]: '%5d %5d %5d %5d %5d %5d %5d %5d'
    

    and then writes this line to the file:

    In [97]: fmt%tuple(np.arange(8))
    Out[97]: '    0     1     2     3     4     5     6     7'