I hope this question is clear but I am looking into the Zillow Home Sales Data and having an issue being able to merge all of the months from certain years to a newly declared "Year" variable. This should essentially bucket all of the data that have Jan-YY, Feb-YY, Mar-YY...etc to just YY.
I have tried Pandas built in functions such as Stack() & Pivot() and these have not seemed to work.
If there is no feasible way to do this, what are my alternatives? Thanks in Advance!
Example: take Column1 = '1/31/1996' and Column2 = '2/28/1996'…etc. and Column12 = '12/31/1996' and combine into a new Column called Y1996. This will be easier to analyze rather than a monthly breakdown.
My code:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
zil = pd.read_csv('zillow.csv')
df_zil = pd.DataFrame(df_zil)
df_zil.head(4)
#My attempt at merging into one
y1996 = (df_zil['1/31/1996'] + df_zil['3/31/1996'] + df_zil['4/30/1996'] + df_zil['5/31/1996'] +
df_zil['6/30/1996'] + df_zil['7/31/1996'] + df_zil['8/31/1996'] + df_zil['9/30/1996'] +
df_zil['10/31/1996'] + df_zil['11/30/1996'] + df_zil['12/31/1996'])
Screen shot of how data is formatted via excel -- Starts on Column I
Reference to Zillow Data: https://www.zillow.com/research/data/
I think you need the opposite of a pivot - a melt. You have your data in "wide" format, and it's easier to do this summary if you convert the data to "tall" format. Once you have the data in tall format, you can use a groupby to sum up values within the same year.
I downloaded the House Inventory and Sales dataset, and I wrote a short program to sum up all values in the same year.
Code:
import pandas as pd
df = pd.read_csv("Metro_invt_fs_uc_sfrcondo_smoothed_month.csv")
# Take all of the columns after the index and convert them into additional rows
df = df.melt(id_vars=["RegionID", "SizeRank", "RegionName", "RegionType", "StateName"], var_name="Date")
# Drop date, but keep year
df["Year"] = pd.to_datetime(df["Date"]).dt.year
df = df.drop("Date", axis="columns")
# Aggregate each year
df = df.groupby(["RegionID", "SizeRank", "RegionName", "RegionType", "StateName", "Year"], as_index=False).sum()
print(df)
Output:
RegionID SizeRank RegionName RegionType StateName Year value
0 394304 74 Akron, OH Msa OH 2017 3576.0
1 394304 74 Akron, OH Msa OH 2018 42625.0
2 394304 74 Akron, OH Msa OH 2019 39078.0
3 394304 74 Akron, OH Msa OH 2020 21532.0
4 394308 60 Albany, NY Msa NY 2017 2969.0
.. ... ... ... ... ... ... ...
475 753906 75 North Port-Sarasota-Bradenton, FL Msa FL 2020 73953.0
476 753924 54 Urban Honolulu, HI Msa HI 2017 3735.0
477 753924 54 Urban Honolulu, HI Msa HI 2018 50079.0
478 753924 54 Urban Honolulu, HI Msa HI 2019 57413.0
479 753924 54 Urban Honolulu, HI Msa HI 2020 35522.0
[480 rows x 7 columns]