Search code examples
pythonpandasdataframeperformanceoptimization

Survey data many periods: transformation to current and previous period (wide to long format)


I have a data frame (survey data) called df that looks like this (this is sample data):

respondent_id r1age r2age r3age r4age r1smoke r2smoke r3smoke r4smoke r1income r2income r3income r4income
16178 35 38 41 44 1 1 1 1 60 62 68 70
161719 65 68 71 74 0 0 0 1 50 52 54 56
161720 47 50 53 56 0 1 0 1 80 82 85 87

The number after the "r" or "h" represents the wave or period of each interview. For this particular example, there are only four interviews for each respondent, and data for 3 different variables (age, whether the respondent smokes, and his/her gross annual income in $10,000).

I'm interested in transforming this to get the following instead:

respondent_id t_1_period t_age t_1_age t_smoke t_1_smoke t_income t_1_income
16178 1 38 35 1 1 62 60
16178 2 41 38 1 1 68 62
16178 3 44 41 1 1 70 68
161719 1 68 65 0 0 52 50
161719 2 71 68 0 0 54 52
161719 3 74 71 1 0 56 54
161720 1 50 47 1 0 82 80
161720 2 53 50 0 1 85 82
161720 3 56 53 1 0 87 85

I'm interested in repeating the respondents such that the number of observations for each respondent are the number of interviews/waves - 1 (that is, the unique transitions), and for each variable there must be t (current period) and t_1 (previous period) columns, again, for each transition. Additionally, I add a t_1_period column representing the number of the previous period for that observation.

I have tried the following:

df = pd.melt(df, id_vars=["respondent_id"])

variable_names = ["age", "smoke", "income"]

new_rows = []
for respondent_id in df["respondent_id"].unique():
  df_temp = df[df["respondent_id"] == respondent_id]

  for i in range(2, 5):
    new_row = {"respondent_id": respondent_id, "t_1_period": i-1}
    for var in variable_names:
      if var not in ["income"]:
        current_var = f"r{i}{var}"
        previous_var = f"r{i-1}{var}"
        new_row[f"t_{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
        new_row[f"t_1_{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]
      elif var == "income":
        current_var = f"h{i}{var}"
        previous_var = f"h{i-1}{var}"
        new_row[f"t_h{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
        new_row[f"t_1_h{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]

    new_rows.append(new_row)

df_periods = pd.DataFrame(new_rows)

In my real data, I have much more than 3 variables: I sometimes have up to 100. Additionally, all variables are always present for all periods, however some of them can have NaNs, but the columns are there. In terms of respondents, I can also have a lot: as much as 50,000 for example. Note that some variables start with "h" instead of "r", and others with "s" (not present in this example).

My question: is there a faster way of transforming this? Every time I want to transform the data in this t vs. t-1 version for all variables I decide to include in variable_names I have to wait a lot. I believe there must be a better way of doing this. I appreciate your help, thank you.


Solution

  • There are many ways to approach that, wide_to_long is an option but you would need to pre-process the column names (it expects the stubnames as prefixes, not suffixes).

    I'd suggest to use a MultiIndex and stack, here is an example that doesn't require to know the stubnames:

    # set aside respondent_id and create a MultiIndex
    tmp = df.set_index('respondent_id')
    tmp.columns = pd.MultiIndex.from_frame(tmp.columns.str.extract(r'[rh](\d+)(\D+)'),
                                           names=['t_1_period', None])
    
    # reshape
    tmp = tmp.stack(0, future_stack=True)
    
    # concatenate the long format with a shifted version of itself
    out = (pd.concat([tmp.groupby(level=0).shift(-1), tmp], keys=['t', 't_1'], axis=1)
             .sort_index(axis=1, level=1, sort_remaining=False)
          )
    
    # flatten MultiIndex
    out.columns = out.columns.map('_'.join)
    out.reset_index(inplace=True)
    
    # remove the last value per group
    out = out[out['respondent_id'].duplicated(keep='last')].convert_dtypes()
    

    Output:

        respondent_id t_1_period  t_age  t_1_age  t_income  t_1_income  t_smoke  t_1_smoke
    0           16178          1     38       35        62          60        1          1
    1           16178          2     41       38        68          62        1          1
    2           16178          3     44       41        70          68        1          1
    4          161719          1     68       65        52          50        0          0
    5          161719          2     71       68        54          52        0          0
    6          161719          3     74       71        56          54        1          0
    8          161720          1     50       47        82          80        1          0
    9          161720          2     53       50        85          82        0          1
    10         161720          3     56       53        87          85        1          0