Search code examples
pythonpandascsvzillow

Is there a way to combine multiple columns in Pandas that can join all data from certain columns into one column (12 columns to 1)


I hope this question is clear but I am looking into the Zillow Home Sales Data and having an issue being able to merge all of the months from certain years to a newly declared "Year" variable. This should essentially bucket all of the data that have Jan-YY, Feb-YY, Mar-YY...etc to just YY.

I have tried Pandas built in functions such as Stack() & Pivot() and these have not seemed to work.

If there is no feasible way to do this, what are my alternatives? Thanks in Advance!

Example: take Column1 = '1/31/1996' and Column2 = '2/28/1996'…etc. and Column12 = '12/31/1996' and combine into a new Column called Y1996. This will be easier to analyze rather than a monthly breakdown.

My code:

  import pandas as pd
  import numpy as np
  import statsmodels.api as sm
  import matplotlib.pyplot as plt 
  %matplotlib inline

  zil = pd.read_csv('zillow.csv')
  df_zil = pd.DataFrame(df_zil)
  df_zil.head(4)

  #My attempt at merging into one
  y1996 = (df_zil['1/31/1996'] + df_zil['3/31/1996'] + df_zil['4/30/1996'] + df_zil['5/31/1996'] + 
  df_zil['6/30/1996'] + df_zil['7/31/1996'] + df_zil['8/31/1996'] + df_zil['9/30/1996'] + 
  df_zil['10/31/1996'] + df_zil['11/30/1996'] + df_zil['12/31/1996'])

Screen shot of how data is formatted via excel -- Starts on Column I

Reference to Zillow Data: https://www.zillow.com/research/data/


Solution

  • I think you need the opposite of a pivot - a melt. You have your data in "wide" format, and it's easier to do this summary if you convert the data to "tall" format. Once you have the data in tall format, you can use a groupby to sum up values within the same year.

    I downloaded the House Inventory and Sales dataset, and I wrote a short program to sum up all values in the same year.

    Code:

    import pandas as pd
    
    df = pd.read_csv("Metro_invt_fs_uc_sfrcondo_smoothed_month.csv")
    # Take all of the columns after the index and convert them into additional rows
    df = df.melt(id_vars=["RegionID", "SizeRank", "RegionName", "RegionType", "StateName"], var_name="Date")
    # Drop date, but keep year
    df["Year"] = pd.to_datetime(df["Date"]).dt.year
    df = df.drop("Date", axis="columns")
    # Aggregate each year
    df = df.groupby(["RegionID", "SizeRank", "RegionName", "RegionType", "StateName", "Year"], as_index=False).sum()
    print(df)
    

    Output:

         RegionID  SizeRank                         RegionName RegionType StateName  Year    value
    0      394304        74                          Akron, OH        Msa        OH  2017   3576.0
    1      394304        74                          Akron, OH        Msa        OH  2018  42625.0
    2      394304        74                          Akron, OH        Msa        OH  2019  39078.0
    3      394304        74                          Akron, OH        Msa        OH  2020  21532.0
    4      394308        60                         Albany, NY        Msa        NY  2017   2969.0
    ..        ...       ...                                ...        ...       ...   ...      ...
    475    753906        75  North Port-Sarasota-Bradenton, FL        Msa        FL  2020  73953.0
    476    753924        54                 Urban Honolulu, HI        Msa        HI  2017   3735.0
    477    753924        54                 Urban Honolulu, HI        Msa        HI  2018  50079.0
    478    753924        54                 Urban Honolulu, HI        Msa        HI  2019  57413.0
    479    753924        54                 Urban Honolulu, HI        Msa        HI  2020  35522.0
    
    [480 rows x 7 columns]