Search code examples
pythonregressionpanel-data

Any efficient way to build up regression model on panel data?


I have two dimensional data including frequent crime type in certain regions and corresponding house prices along the year. I want to understand possible association between crime frequency in certain regions and house price fluctuation. Initially I tried to use linear regression to do that, but it didn't work well. Now I want to try PCA analysis on my data, but it is still not efficient to me to grab meaningful results. How can I perform efficient PCA analysis on panel data for the purpose of doing regression? any efficient workaround to make this happen? thanks

data :

because my data is bit long in terms of dimension, it is bit difficult to make reproducible example here, so let's see how panel data looks like:

enter image description here enter image description here

here is safest cloud link that you can browse input panel data: example data snippet.

update: my attempt:

since @ flyingmeatball pointed out that using PCA is not a good idea, I tried simple linear regression but it didn't help me to capture the relation between crime frequencies and house price. here is what I did:

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import scale
import re
import urllib
import pandas as pd

# download data from cloud
u = "https://filebin.net/ml0sjn455gr8pvh3/crime_realEstate?t=7dkm15wq"
crime_realEstate = urllib.request.urlretrieve (u, "Ktest.csv")

# or just manually download data first and read
crime_realEstate = pd.read_csv('crime_realEstate.csv')
cols_2012 = crime_realEstate.filter(regex='_2012').columns
crime_realEstate['Area_Name']=crime_realEstate['Area_Name'].apply(lambda x: re.sub(' ', '_', str(x)))
regDF_2012 = crime_realEstate[cols_2012]
regDF_2012 = regDF_2012.assign(community_code=crime_finalDF['community_area'])
regDF_2012.dropna(inplace=True)
X_feats = regDF_2012.drop(['Avg_Price_2012'], axis=1)
y_label = regDF_2012['Avg_Price_2012'].values

poly = PolynomialFeatures(degree=2)
sc_y = StandardScaler()
X = poly.fit_transform(X_feats)
y= sc_y.fit_transform(y_label.reshape(-1,1)).flatten()
X = log(X)
y = log(y)
regModel = LinearRegression()
regModel.fit(X, y)

above code doesn't help me because I want to see which features contributed to house price fluctuation along the year. Any thoughts on how to make this happen?

goal:

what I am trying to achieve is to build model that explain the dynamics between crime frequency in certain regions and respective house price fluctuation. Any efficient workaround to make this happen?

update:

if PCA is not good idea, then any possible regression model that can capture the relation between crime frequencies in certain community area and house price fluctuation? any idea?


Solution

  • A couple thoughts:

    1) Please post complete code. I don't see where crime_realEstate is defined anywhere. If you leave out the line where you read in your data to that variable, it makes it really hard to reproduce your error, and you're less likely to get help. Also, you should organize all of your import statements so they are at the top of your code. It isn't really a function thing, more of a convention that everyone expects and makes it easier to read.

    2) When you reference panel data, are you really talking about a pandas DataFrame? That is sort of the "typical" way to store this kind of stuff for analysis. You may want to get in the habit of referring to data as dataframes so it's clearer to your audience. You should also post the full error traceback so we can see what line of code is bombing exactly.

    3) I think you may be misunderstanding PCA, or at least what it is for. PCA (principle component analysis) is a data transformation method, where you are capturing variation in data that is across multiple variables and restating that data as fewer components that capture the same amount (or less, depending on how many components you keep) of variability. Once you run PCA, you won't be able to see which features are contributing to crime, because they will be replaced by totally new components. If it is important to identify the features that are correlated with crime, then PCA is a bad idea.

    Please fix items above.

    EDIT

    I'm not saying PCA is wrong, I'm just saying that the question you asked above ("how do I apply PCA and why is my code bombing"), isn't really the right question. PCA should be used if you think that you have many correlated variables that need to be reduced to a lower level of dimensionality. I wouldn't start there though - see what kind of accuracy you can get without doing that. You've now reformulated to a much broader question of "how do I make a predictive model for this data, preferably using a regression?", which should probably go to https://datascience.stackexchange.com/ instead, but I'll give you a starting point of how I would approach coding that solution.

    First - PCA is probably not the ideal starting point because from just looking at the data/columns, your problem isn't dimensionality. You basically have 10 different crimes over 5 years. You also only have 58 different rows...or is that just the sample data? Also, your data is a bit weird - you have the same prices for multiple rows, but the crimes differ. I can't tell if it's just because you're posting sample data. If this is indeed the full dataset, stop your analysis now and get more data/go do something else.

    I made some executive decisions about how I would approach the problem. All of these are just for demonstration purposes of how to code regression. I summed crime across all years (you maybe want average? Highest? Change in? Those are all design decisions for you). My metric was Change in Price From 2012-2016, the timeframe you have crime data. I normalized crime counts by type of crime. Didn't scale the target variable.

    Here's how I would start:

    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import StandardScaler
    from sklearn.metrics import r2_score
    from sklearn.preprocessing import scale
    import pandas as pd
    
    # Load data
    filePath = 'L:\\crime_realEstate.txt'
    crime_df = pd.read_csv(filePath, sep = '\t').drop(['Unnamed: 0','community_area'],axis = 1)
    
    #calculate price change between 2016 and 2012 - same timeframe you have crime data
    crime_df['price_change'] = crime_df['Avg_Price_2016'] - crime_df['Avg_Price_2012']
    crime_df.drop(['Avg_Price_2012','Avg_Price_2013','Avg_Price_2014','Avg_Price_2015','Avg_Price_2016','Avg_Price_2017','Avg_Price_2018','Avg_Price_2019'],axis = 1,inplace = True)
    
    #split years if they are data over time
    crime_df.columns =  pd.MultiIndex.from_tuples([(x.split('_20')[1] if '_20' in x else x ,x.split('_20')[0]) for x in crime_df.columns])
    #sum across years for crimeFields
    crime_df = crime_df.groupby(level=[1],axis = 1).sum(axis = 1)
    
    #split out tgt var
    price_growth = crime_df['price_change']
    
    #create dummy variable from area name
    dummy_df = pd.get_dummies(crime_df['Area_Name'])
    
    
    crime_df.drop(['Area_Name','price_change'],axis = 1,inplace = True)
    
    #scales crime variables
    scaler = StandardScaler()
    crime_df[crime_df.columns] = scaler.fit_transform(crime_df)
    
    crime_df = pd.merge(crime_df,dummy_df,left_index = True, right_index = True)
    
    regModel = LinearRegression()
    
    #split to training testing
    train_df = crime_df.sample(frac=0.8,random_state=200)
    test_df = crime_df.drop(train_df.index)
    
    regModel.fit(train_df, price_growth[train_df.index])
    
    #R2 
    r2_score(price_growth.drop(train_df.index),regModel.predict(test_df))
    0.7355837132941521
    

    Simpler answer to your analysis: wherever the white people live in Chicago, the property is expensive.