Search code examples
pythonpandasdataframecsvinsert

Inserting rows in specific location using pandas


I have a CSV-file containing the following data structure:

2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269

By using Pandas in Python, I would like to quadruple the 2nd row and insert the new rows after the 2nd row (filling up the missing intervals with the 2nd row). Eventually, it should look like:

2015-01-02,09:30:00,64.815
2015-01-02,09:35:00,64.8741
2015-01-02,09:40:00,64.8741
2015-01-02,09:45:00,64.8741
2015-01-02,09:50:00,64.8741
2015-01-02,09:55:00,65.0255
2015-01-02,10:00:00,64.9269
2015-01-02,10:05:00,64.815

I have the following code:

    df = pd.read_csv("csv.file", header=0, names=['date', 'minute', 'price'])

    for i in range(len(df)):
        if i != len(df)-1:
            next_i = i+1
            if df.loc[next_i, 'date'] == df.loc[i, 'date'] and df.loc[i, 'minute'] != "16:00:00":
                now = int(df.loc[i, "minute"][:2]+df.loc[i, "minute"][3:5])
                future = int(df.loc[next_i, "minute"][:2]+df.loc[next_i, "minute"][3:5])
                
                while now + 5 != future and df.loc[next_i, "minute"][3:5] != "00" and df.loc[next_i, "minute"][3:5] != "60":
                    newminutes = str(int(df.loc[i, "minute"][3:5])+5*a)
                    newtime = df.loc[next_i, "minute"][:2] +":"+newminutes+":00"
                    df.loc[next_i-0.5] = [df.loc[next_i, 'date'], newtime , df.loc[i, 'price']]
                    df = df.sort_index().reset_index(drop=True)
                    now = int(newtime[:2]+newtime[3:5])
                    future = int(df.loc[next_i+1, "minute"][:2]+df.loc[next_i+1, "minute"][3:5])

However, it's not working.


Solution

  • I see there is an extra row in the expected output 2015-01-02,10:05:00,64.815. To accomodate that as well you can reindex using pd.DateRange.

    Creating data
    data = {
        'date' : ['2015-01-02', '2015-01-02', '2015-01-02', '2015-01-02'],
        'time' : ['09:30:00', '09:35:00', '09:55:00', '10:00:00'],
        'val' : [64.815, 64.8741, 65.0255, 64.9269]
        }
    df = pd.DataFrame(data)
    

    Creating datetime column for reindexing

    df['datetime'] = pd.to_datetime(df['date'] + ' ' + df['time'])
    df.set_index('datetime', inplace=True)
    

    Generating output

    df = df.resample('5min').asfreq().reindex(pd.date_range('2015-01-02 09:30:00', '2015-01-02 10:05:00', freq='5 min')).ffill()
    df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
    df.reset_index(drop=True)
    

    Output

    This gives us the expected output

             date      time      val
    0  2015-01-02  09:30:00  64.8150
    1  2015-01-02  09:35:00  64.8741
    2  2015-01-02  09:40:00  64.8741
    3  2015-01-02  09:45:00  64.8741
    4  2015-01-02  09:50:00  64.8741
    5  2015-01-02  09:55:00  65.0255
    6  2015-01-02  10:00:00  64.9269
    7  2015-01-02  10:05:00  64.9269
    

    However if that was a typo and you don't want the last row you can do this :

    df = df.resample('5min').asfreq().reindex(pd.date_range(df.index[0], df.index[len(df)-1], freq='5 min')).ffill()
    df[['date', 'time']] = df.index.astype(str).to_series().str.split(' ', expand=True).values
    df.reset_index(drop=True)
    

    which gives is

             date      time      val
    0  2015-01-02  09:30:00  64.8150
    1  2015-01-02  09:35:00  64.8741
    2  2015-01-02  09:40:00  64.8741
    3  2015-01-02  09:45:00  64.8741
    4  2015-01-02  09:50:00  64.8741
    5  2015-01-02  09:55:00  65.0255
    6  2015-01-02  10:00:00  64.9269