Search code examples
pythonrstatisticslinear-regression

Linear Regression Model for Quote Data


I would like to build a linear regression model to determine the influence of various parameters on quote prices. The data of the quotes were collected over 10 years.

Density plot of quote prices over 10 years

y = Price

X = [System size(int),ZIP, Year, module_manufacturer, module_name, inverter_manufacturer,inverter_name, battery storage (binary), number of installers/offerer in the region(int), installer_density, new_construction(binary), self_installation(binary), household density]

Questions:

  1. What type of regression model is suitable for this dataset?
  2. Due to technological progress, quote prices decrease over years. How can I account for the different years in the model? I found some examples where years where considered as binary variables. Another option: multiple regression models for each year. Is there a way to combine these multiple models?
  3. Is the dataset a type of panel data?

Unfortunately, I have not yet found any information that could explicitly help me with my data. But maybe I didn't use the right search terms. I would be very happy about any suggestions that nudge me in the right direction.


Solution

  • Suppose you have a data.frame called data with columns price, system_size, zip, year, battery_storage etc. Then you can start with a simple linear regression:

    lm(price ~ system_size + zip + year + battery_storage, data = data)
    

    year is included in the model so you take changes over time into account. If you want to remove batch effects (e.g. different regions zip codes) and you just care to model the price after getting rid of the effect of different locations, you can run a linear mixed model:

    lmerTest::lmer(price ~ system_size + year + battery_storage + (1|zip), data = data)
    

    If you have a high correlation e.g. between year and system_size, you might want to include interaction terms like year:system_size into your formula. As a rule of thumb, you need to have 10 samples for each variable to get a reasonable fit. If you have more, you can do a variable selection first.