Search code examples
pythonpandasdataframetime-seriesfinancial

Lag values and differences in pandas dataframe with missing quarterly data


Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.

See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.

I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).

See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.

However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.

Are there ways to deal with gaps in data using Pandas?

import pandas as pd
import numpy as np
import os

dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date']     = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q']        = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year']     = dfg['date'].dt.year

dfg['DiffEq']    = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV']    = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV']    = dfg.groupby(['Firm'])['MV'].shift(4)

Gapped data:

enter image description here

Full data:

enter image description here


Solution

  • Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):

    from pandas.tseries.offsets import QuarterEnd 
    dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
    

    Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).

    lagset=dfg[['Firm','date', 'MV']].copy()
    lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
    

    Lastly, merge the new frame into the existing one:

    dfg=pd.merge(dfg, lagset,  on=['Firm', 'lagQ'], how='left')