Search code examples
pythonpandasdataframedata-analysisreal-time-data

How to specify the number of rows a pandas dataframe will have?


I have a Pandas dataframe and I am continually appending a row of data each second as below.

df.loc[time.strftime("%Y-%m-%d %H:%M:%S")] = [reading1, reading2, reading3]
>>>df
                     sensor1 sensor2 sensor3
2015-04-14 08:50:23    5.4     5.6     5.7
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4

If I continue this, eventually I am going to start experiencing memory issues (Each time it will call the whole DataFrame).

I only need to keep X rows of the data. i.e. after the operation, it will be:

>>>df
                     sensor1 sensor2 sensor3
(this row is gone)
2015-04-14 08:50:24    5.5     5.6     5.8
2015-04-14 08:50:26    5.2     5.3     5.4
2015-04-14 08:50:27    5.2     5.4     5.6

Is there a way I can specify a maximum number of rows, so that when any subsequent rows are added, the oldest row is deleted at the same time WITHOUT a "Check length of DataFrame, If length of DataFrame > X, Remove first row, Append new row"?

Like this, but for a Pandas DataFrame: https://stackoverflow.com/a/10155753/4783578


Solution

  • This example initializes a DataFrame equal to the max size and fills it with Nones. It then iterates over a list of new rows, first shifting the original DataFrame and then appending the new row to the end. You didn't specify how you wanted to treat the index, so I ignored it.

    max_rows = 5
    cols = list('AB')
    
    # Initialize empty DataFrame
    df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in cols})
    
    new_rows = [pd.DataFrame({'A': [1], 'B': [10]}), 
                pd.DataFrame({'A': [2], 'B': [11]}),
                pd.DataFrame({'A': [3], 'B': [12]}),
                pd.DataFrame({'A': [4], 'B': [13]}),
                pd.DataFrame({'A': [5], 'B': [14]}),
                pd.DataFrame({'A': [6], 'B': [15]}),
                pd.DataFrame({'A': [7], 'B': [16]})]
    
    for row in new_rows:
        df = df.shift(-1)
        df.iloc[-1, :] = row.values
    
    >>> df
    df
       A   B
    0  3  12
    1  4  13
    2  5  14
    3  6  15
    4  7  16
    

    Let's use a real example with one year of stock prices for AAPL.

    from datetime import timedelta
    
    aapl = DataReader("AAPL", data_source="yahoo", start="2014-1-1", end="2015-1-1")
    cols = aapl.columns
    df = pd.DataFrame({c: np.repeat([None], [max_rows]) for c in aapl.columns})[cols]
    # Initialize a datetime index
    df.index = pd.DatetimeIndex(end=aapl.index[0] + timedelta(days=-1), periods=max_rows, freq='D')
    
    for timestamp, row in aapl.iterrows():
        df = df.shift(-1)
        df.iloc[-1, :] = row.values
        idx = df.index[:-1].tolist()
        idx.append(timestamp)
        df.index = idx
    
    >>> df
                  Open    High     Low   Close       Volume Adj Close
    2013-12-28  112.58  112.71  112.01  112.01  1.44796e+07    111.57
    2013-12-29   112.1  114.52  112.01  113.99   3.3721e+07    113.54
    2013-12-30  113.79  114.77   113.7  113.91  2.75989e+07    113.46
    2013-12-31  113.64  113.92  112.11  112.52  2.98815e+07    112.08
    2014-12-31  112.82  113.13  110.21  110.38  4.14034e+07    109.95