Search code examples
pythonarraysnumpyperformancefloating-point

High-speed and low-memory way to store a float number inside a Numpy array


I have this number: 19576.4125. I want to save it inside a Numpy array, and I think that while the bit count is lower, it is still better. Is this right?

I tried to save inside a half and a single, but I don't know why it changes the number.

  • My number: 19576.4125
  • Half: 19580.0
  • single: 19576.412

This number is generated by a method that I created to make a datetime go to a float number. I can use timestamp, but I don't need the seconds and milliseconds, so I tried to create my own method that saves only date, hours and minutes. (My database doesn't accept datetimes and timedeltas).

This is my generator method:

from datetime import datetime


def get_timestamp() -> float:
    now = datetime.now()
    now.replace(microsecond=0, second=0)
    _1970 = datetime(1970, 1, 1, 0, 0, 0)
    td = now - _1970
    days = td.days
    hours, remainder = divmod(td.seconds, 3600)
    minutes, second = divmod(remainder, 60)
    timestamp = days + hours / 24 + minutes / 1440
    return round(timestamp, 4)

How I'm creating the array:

from numpy import array, half, single


__td = get_timestamp()
print(__td)
__array = array([__td], dtype=half)
print(type(__array[0]))
print(__array[0])
__array = array([__td], dtype=single)
print(type(__array[0]))
print(__array[0])

EDITED 08/07 11h02 AM

Hello, such the comments said, I think that this number can't be saved in a half or single type. So how I save this number with better performance? Is better save then like a int and multiply by 10000, a float64 or string?

And not, I don't want a better way to sabe datetimes I want a better way to save this float number with better performance. But thank you for te other replies.


Solution

  • I modified your function to take a timestemp

    In [48]: def get_timestamp(now) -> float:
        ...:     #now = datetime.now()
        ...:     now.replace(microsecond=0, second=0)
       ...
        ...:     return round(timestamp, 4)
        ...:     
    

    and made a list of dates:

    In [49]: alist = [datetime.now() for _ in range(1000)]
    
    In [50]: timeit alist = [datetime.now() for _ in range(1000)]
    885 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    

    And timed your function, to make an array:

    In [51]: arr = np.array([get_timestamp(d) for d in alist])
    
    In [52]: timeit arr = np.array([get_timestamp(d) for d in alist])
    7.7 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [53]: arr.nbytes
    Out[53]: 8000
    

    and did the same, but using numpy's own conversion to an 8 byte element:

    In [54]: barr = np.array(alist,dtype='datetime64[m]')
    
    In [55]: barr.nbytes
    Out[55]: 8000
    
    In [56]: timeit barr = np.array(alist,dtype='datetime64[m]')
    7.87 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    So basically the same conversion time. So from calculation and memory your function is just as good.

    Saving as a 4 byte element (float or int) would cut the memory use, but unless you are hitting memory errors with millions of values, this effort is rarely worth it.

    datetime64 has already worked out the conversion both ways. I imagine the interface to pandas is also good, though pandas appears to have its own datetime formats and tricks. After all, it's designed to handle timeseries.

    pandas

    In [64]: import pandas as pd
    
    In [65]: df = pd.DataFrame({'a':arr, 'b':barr})
    
    In [66]: df
    Out[66]: 
                  a                   b
    0    19576.3799 2023-08-07 09:07:00
    1    19576.3799 2023-08-07 09:07:00
    2    19576.3799 2023-08-07 09:07:00
    3    19576.3799 2023-08-07 09:07:00
    4    19576.3799 2023-08-07 09:07:00
    ..          ...                 ...
    995  19576.3799 2023-08-07 09:07:00
    996  19576.3799 2023-08-07 09:07:00
    997  19576.3799 2023-08-07 09:07:00
    998  19576.3799 2023-08-07 09:07:00
    999  19576.3799 2023-08-07 09:07:00
    
    [1000 rows x 2 columns]
    
    In [67]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000 entries, 0 to 999
    Data columns (total 2 columns):
     #   Column  Non-Null Count  Dtype        
    ---  ------  --------------  -----        
     0   a       1000 non-null   float64      
     1   b       1000 non-null   datetime64[s]
    dtypes: datetime64[s](1), float64(1)
    memory usage: 15.8 KB
    

    Interesting if I save the timestamp list directly to a dataframe, it's faster

    In [81]: df = pd.DataFrame({'c':alist})
    
    In [82]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1000 entries, 0 to 999
    Data columns (total 1 columns):
     #   Column  Non-Null Count  Dtype         
    ---  ------  --------------  -----         
     0   c       1000 non-null   datetime64[ns]
    dtypes: datetime64[ns](1)
    memory usage: 7.9 KB
    
    In [83]: timeit df = pd.DataFrame({'c':alist})
    5.29 ms ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)