I have a data frame (survey data) called df
that looks like this (this is sample data):
respondent_id | r1age | r2age | r3age | r4age | r1smoke | r2smoke | r3smoke | r4smoke | r1income | r2income | r3income | r4income |
---|---|---|---|---|---|---|---|---|---|---|---|---|
16178 | 35 | 38 | 41 | 44 | 1 | 1 | 1 | 1 | 60 | 62 | 68 | 70 |
161719 | 65 | 68 | 71 | 74 | 0 | 0 | 0 | 1 | 50 | 52 | 54 | 56 |
161720 | 47 | 50 | 53 | 56 | 0 | 1 | 0 | 1 | 80 | 82 | 85 | 87 |
The number after the "r" or "h" represents the wave or period of each interview. For this particular example, there are only four interviews for each respondent, and data for 3 different variables (age, whether the respondent smokes, and his/her gross annual income in $10,000).
I'm interested in transforming this to get the following instead:
respondent_id | t_1_period | t_age | t_1_age | t_smoke | t_1_smoke | t_income | t_1_income |
---|---|---|---|---|---|---|---|
16178 | 1 | 38 | 35 | 1 | 1 | 62 | 60 |
16178 | 2 | 41 | 38 | 1 | 1 | 68 | 62 |
16178 | 3 | 44 | 41 | 1 | 1 | 70 | 68 |
161719 | 1 | 68 | 65 | 0 | 0 | 52 | 50 |
161719 | 2 | 71 | 68 | 0 | 0 | 54 | 52 |
161719 | 3 | 74 | 71 | 1 | 0 | 56 | 54 |
161720 | 1 | 50 | 47 | 1 | 0 | 82 | 80 |
161720 | 2 | 53 | 50 | 0 | 1 | 85 | 82 |
161720 | 3 | 56 | 53 | 1 | 0 | 87 | 85 |
I'm interested in repeating the respondents such that the number of observations for each respondent are the number of interviews/waves - 1 (that is, the unique transitions), and for each variable there must be t (current period) and t_1 (previous period) columns, again, for each transition. Additionally, I add a t_1_period
column representing the number of the previous period for that observation.
I have tried the following:
df = pd.melt(df, id_vars=["respondent_id"])
variable_names = ["age", "smoke", "income"]
new_rows = []
for respondent_id in df["respondent_id"].unique():
df_temp = df[df["respondent_id"] == respondent_id]
for i in range(2, 5):
new_row = {"respondent_id": respondent_id, "t_1_period": i-1}
for var in variable_names:
if var not in ["income"]:
current_var = f"r{i}{var}"
previous_var = f"r{i-1}{var}"
new_row[f"t_{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
new_row[f"t_1_{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]
elif var == "income":
current_var = f"h{i}{var}"
previous_var = f"h{i-1}{var}"
new_row[f"t_h{var}"] = df_temp[df_temp["variable"] == current_var]["value"].values[0]
new_row[f"t_1_h{var}"] = df_temp[df_temp["variable"] == previous_var]["value"].values[0]
new_rows.append(new_row)
df_periods = pd.DataFrame(new_rows)
In my real data, I have much more than 3 variables: I sometimes have up to 100. Additionally, all variables are always present for all periods, however some of them can have NaNs, but the columns are there. In terms of respondents, I can also have a lot: as much as 50,000 for example. Note that some variables start with "h" instead of "r", and others with "s" (not present in this example).
My question: is there a faster way of transforming this? Every time I want to transform the data in this t vs. t-1 version for all variables I decide to include in variable_names
I have to wait a lot. I believe there must be a better way of doing this. I appreciate your help, thank you.
There are many ways to approach that, wide_to_long
is an option but you would need to pre-process the column names (it expects the stubnames as prefixes, not suffixes).
I'd suggest to use a MultiIndex and stack
, here is an example that doesn't require to know the stubnames:
# set aside respondent_id and create a MultiIndex
tmp = df.set_index('respondent_id')
tmp.columns = pd.MultiIndex.from_frame(tmp.columns.str.extract(r'[rh](\d+)(\D+)'),
names=['t_1_period', None])
# reshape
tmp = tmp.stack(0, future_stack=True)
# concatenate the long format with a shifted version of itself
out = (pd.concat([tmp.groupby(level=0).shift(-1), tmp], keys=['t', 't_1'], axis=1)
.sort_index(axis=1, level=1, sort_remaining=False)
)
# flatten MultiIndex
out.columns = out.columns.map('_'.join)
out.reset_index(inplace=True)
# remove the last value per group
out = out[out['respondent_id'].duplicated(keep='last')].convert_dtypes()
Output:
respondent_id t_1_period t_age t_1_age t_income t_1_income t_smoke t_1_smoke
0 16178 1 38 35 62 60 1 1
1 16178 2 41 38 68 62 1 1
2 16178 3 44 41 70 68 1 1
4 161719 1 68 65 52 50 0 0
5 161719 2 71 68 54 52 0 0
6 161719 3 74 71 56 54 1 0
8 161720 1 50 47 82 80 1 0
9 161720 2 53 50 85 82 0 1
10 161720 3 56 53 87 85 1 0