python pandas dataframe performance optimization

Survey data many periods: transformation to current and previous period (wide to long format)

I have a data frame (survey data) called df that looks like this (this is sample data):

respondent_id	r1age	r2age	r3age	r4age	r1smoke	r2smoke	r3smoke	r4smoke	r1income	r2income	r3income	r4income
16178	35	38	41	44	1	1	1	1	60	62	68	70
161719	65	68	71	74	0	0	0	1	50	52	54	56
161720	47	50	53	56	0	1	0	1	80	82	85	87

The number after the "r" or "h" represents the wave or period of each interview. For this particular example, there are only four interviews for each respondent, and data for 3 different variables (age, whether the respondent smokes, and his/her gross annual income in $10,000).

I'm interested in transforming this to get the following instead:

respondent_id	t_1_period	t_age	t_1_age	t_smoke	t_1_smoke	t_income	t_1_income
16178	1	38	35	1	1	62	60
16178	2	41	38	1	1	68	62
16178	3	44	41	1	1	70	68
161719	1	68	65	0	0	52	50
161719	2	71	68	0	0	54	52
161719	3	74	71	1	0	56	54
161720	1	50	47	1	0	82	80
161720	2	53	50	0	1	85	82
161720	3	56	53	1	0	87	85

I'm interested in repeating the respondents such that the number of observations for each respondent are the number of interviews/waves - 1 (that is, the unique transitions), and for each variable there must be t (current period) and t_1 (previous period) columns, again, for each transition. Additionally, I add a t_1_period column representing the number of the previous period for that observation.

I have tried the following:

df = pd.melt(df, id_vars=["respondent_id"])

variable_names = ["age", "smoke", "income"]

new_rows = []
for respondent_id in df["respondent_id"].unique():
  df_temp = df[df["respondent_id"] == respondent_id]

  for i in range(2, 5):
    new_row = {"respondent_id": respondent_id, "t_1_period": i-1}
    for var in variable_names:
      if var not in ["income"]:
        current_var = f"r{i}{var}"
        previous_var = f"r{i-1}{var}"
        new_row[f"t_{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
        new_row[f"t_1_{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]
      elif var == "income":
        current_var = f"h{i}{var}"
        previous_var = f"h{i-1}{var}"
        new_row[f"t_h{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
        new_row[f"t_1_h{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]

    new_rows.append(new_row)

df_periods = pd.DataFrame(new_rows)

In my real data, I have much more than 3 variables: I sometimes have up to 100. Additionally, all variables are always present for all periods, however some of them can have NaNs, but the columns are there. In terms of respondents, I can also have a lot: as much as 50,000 for example. Note that some variables start with "h" instead of "r", and others with "s" (not present in this example).

My question: is there a faster way of transforming this? Every time I want to transform the data in this t vs. t-1 version for all variables I decide to include in variable_names I have to wait a lot. I believe there must be a better way of doing this. I appreciate your help, thank you.

Solution

There are many ways to approach that, wide_to_long is an option but you would need to pre-process the column names (it expects the stubnames as prefixes, not suffixes).

I'd suggest to use a MultiIndex and stack, here is an example that doesn't require to know the stubnames:

# set aside respondent_id and create a MultiIndex
tmp = df.set_index('respondent_id')
tmp.columns = pd.MultiIndex.from_frame(tmp.columns.str.extract(r'[rh](\d+)(\D+)'),
                                       names=['t_1_period', None])

# reshape
tmp = tmp.stack(0, future_stack=True)

# concatenate the long format with a shifted version of itself
out = (pd.concat([tmp.groupby(level=0).shift(-1), tmp], keys=['t', 't_1'], axis=1)
         .sort_index(axis=1, level=1, sort_remaining=False)
      )

# flatten MultiIndex
out.columns = out.columns.map('_'.join)
out.reset_index(inplace=True)

# remove the last value per group
out = out[out['respondent_id'].duplicated(keep='last')].convert_dtypes()

Output:

    respondent_id t_1_period  t_age  t_1_age  t_income  t_1_income  t_smoke  t_1_smoke
0           16178          1     38       35        62          60        1          1
1           16178          2     41       38        68          62        1          1
2           16178          3     44       41        70          68        1          1
4          161719          1     68       65        52          50        0          0
5          161719          2     71       68        54          52        0          0
6          161719          3     74       71        56          54        1          0
8          161720          1     50       47        82          80        1          0
9          161720          2     53       50        85          82        0          1
10         161720          3     56       53        87          85        1          0