Insert cells as needed below to write a short EDA/data section that summarizes the data for someone who has never opened it before.
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from statsmodels.formula.api import ols as sm_ols
import matplotlib.pyplot as plt
from statsmodels.iolib.summary2 import summary_col
housing = pd.read_csv('input_data2/housing_train.csv')
housing
parcel | v_MS_SubClass | v_MS_Zoning | v_Lot_Frontage | v_Lot_Area | v_Street | v_Alley | v_Lot_Shape | v_Land_Contour | v_Utilities | ... | v_Pool_Area | v_Pool_QC | v_Fence | v_Misc_Feature | v_Misc_Val | v_Mo_Sold | v_Yr_Sold | v_Sale_Type | v_Sale_Condition | v_SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1056_528110080 | 20 | RL | 107.0 | 13891 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 1 | 2008 | New | Partial | 372402 |
1 | 1055_528108150 | 20 | RL | 98.0 | 12704 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 1 | 2008 | New | Partial | 317500 |
2 | 1053_528104050 | 20 | RL | 114.0 | 14803 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | New | Partial | 385000 |
3 | 2213_909275160 | 20 | RL | 126.0 | 13108 | Pave | NaN | IR2 | HLS | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2007 | WD | Normal | 153500 |
4 | 1051_528102030 | 20 | RL | 96.0 | 12444 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 11 | 2008 | New | Partial | 394617 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1936 | 2524_534125210 | 190 | RL | 79.0 | 13110 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 7 | 2006 | WD | Normal | 146500 |
1937 | 2846_909131125 | 190 | RH | NaN | 7082 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 7 | 2006 | WD | Normal | 160000 |
1938 | 2605_535382020 | 190 | RL | 60.0 | 10800 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2006 | ConLD | Normal | 160000 |
1939 | 1516_909101180 | 190 | RL | 55.0 | 5687 | Pave | Grvl | Reg | Bnk | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 3 | 2008 | WD | Normal | 135900 |
1940 | 1387_905200100 | 190 | RL | 60.0 | 12900 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 1 | 2008 | WD | Alloca | 95541 |
1941 rows × 81 columns
Run these regressions on the RAW data, even if you found data issues that you think should be addressed.
Insert cells as needed below to run these regressions. Note that $i$ is indexing a given house, and $t$ indexes the year of sale.
v_Yr_Sold
.Bonus formatting trick: Instead of reporting all regressions separately, report all seven regressions in a single table using summary_col
.
reg1 = sm_ols("v_SalePrice ~ v_Lot_Area", data=housing).fit()
reg2 = sm_ols("v_SalePrice ~ np.log(v_Lot_Area)", data=housing).fit()
reg3 = sm_ols("np.log(v_SalePrice) ~ v_Lot_Area", data=housing).fit()
reg4 = sm_ols("np.log(v_SalePrice) ~ np.log(v_Lot_Area)", data=housing).fit()
reg5 = sm_ols("np.log(v_SalePrice) ~ v_Yr_Sold", data=housing).fit()
reg6 = sm_ols("np.log(v_SalePrice) ~ v_Yr_Sold==2007 + v_Yr_Sold==2008", data=housing).fit()
reg7 = sm_ols("np.log(v_SalePrice) ~ v_Lot_Area + v_Yr_Sold + v_Lot_Config + v_Neighborhood + v_Overall_Qual", data=housing).fit()
# now I'll format an output table
# I'd like to include extra info in the table (not just coefficients)
info_dict={'R-squared' : lambda x: f"{x.rsquared:.2f}",
'Adj R-squared' : lambda x: f"{x.rsquared_adj:.2f}",
'No. observations' : lambda x: f"{int(x.nobs):d}"}
# This summary col function combines a bunch of regressions into one nice table
print('='*107)
print(' y = Sale Price if not specified, log(Sale Price else)')
print(summary_col(results=[reg1,reg2,reg3,reg4,reg5,reg6,reg7], # list the result obj here
float_format='%0.2f',
stars = True, # stars are easy way to see if anything is statistically significant
model_names=['1','2',' 3','4','5','6','7'], # these are bad names, lol. Usually, just use the y variable name
info_dict=info_dict,
regressor_order=[ 'v_SalePrice','v_Lot_Area','np.log(v_Lot_Area)','v_yr_Sold','v_Lot_Config',
'v_Neighborhood','v_Overall_Qual']
)
)
===========================================================================================================
y = Sale Price if not specified, log(Sale Price else)
===============================================================================================
1 2 3 4 5 6 7
-----------------------------------------------------------------------------------------------
v_Lot_Area 2.65*** 0.00*** 0.00***
(0.23) (0.00) (0.00)
np.log(v_Lot_Area) 56028.17*** 0.29***
(3315.14) (0.02)
v_Overall_Qual 0.17***
(0.00)
Intercept 154789.55*** -327915.80*** 11.89*** 9.41*** 22.29 12.02*** 20.82*
(2911.59) (30221.35) (0.01) (0.15) (22.94) (0.02) (10.94)
v_Neighborhood[T.SawyerW] -0.00
(0.05)
v_Neighborhood[T.NAmes] -0.08*
(0.05)
v_Neighborhood[T.NPkVill] -0.19**
(0.08)
v_Neighborhood[T.NWAmes] -0.00
(0.05)
v_Neighborhood[T.NoRidge] 0.27***
(0.05)
v_Neighborhood[T.NridgHt] 0.18***
(0.05)
v_Neighborhood[T.OldTown] -0.21***
(0.05)
v_Neighborhood[T.SWISU] -0.13**
(0.06)
v_Neighborhood[T.Sawyer] -0.08*
(0.05)
v_Neighborhood[T.MeadowV] -0.26***
(0.06)
v_Neighborhood[T.Somerst] 0.02
(0.05)
v_Neighborhood[T.StoneBr] 0.13**
(0.05)
v_Neighborhood[T.Timber] 0.09*
(0.05)
v_Neighborhood[T.Veenker] 0.10
(0.06)
v_Yr_Sold -0.01 -0.00
(0.01) (0.01)
v_Yr_Sold == 2007[T.True] 0.03
(0.02)
v_Neighborhood[T.Mitchel] -0.04
(0.05)
v_Neighborhood[T.Landmrk] -0.19
(0.20)
v_Neighborhood[T.BrDale] -0.42***
(0.06)
v_Neighborhood[T.IDOTRR] -0.33***
(0.05)
v_Lot_Config[T.CulDSac] 0.04*
(0.02)
v_Lot_Config[T.FR2] -0.06*
(0.03)
v_Lot_Config[T.FR3] -0.01
(0.06)
v_Lot_Config[T.Inside] 0.01
(0.01)
v_Neighborhood[T.Blueste] -0.22**
(0.10)
v_Yr_Sold == 2008[T.True] -0.01
(0.02)
v_Neighborhood[T.BrkSide] -0.21***
(0.05)
v_Neighborhood[T.ClearCr] 0.03
(0.06)
v_Neighborhood[T.CollgCr] 0.01
(0.05)
v_Neighborhood[T.Crawfor] 0.07
(0.05)
v_Neighborhood[T.Edwards] -0.16***
(0.05)
v_Neighborhood[T.Gilbert] -0.02
(0.05)
v_Neighborhood[T.Greens] -0.21**
(0.10)
v_Neighborhood[T.GrnHill] 0.30**
(0.14)
R-squared 0.07 0.13 0.06 0.13 0.00 0.00 0.78
R-squared Adj. 0.07 0.13 0.06 0.13 -0.00 0.00 0.78
R-squared 0.07 0.13 0.06 0.13 0.00 0.00 0.78
Adj R-squared 0.07 0.13 0.06 0.13 -0.00 0.00 0.78
No. observations 1941 1941 1941 1941 1941 1941 1941
===============================================================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
Insert cells as needed below to answer these questions. Note that $i$ is indexing a given house, and $t$ indexes the year of sale.
summary_col
trick, list $\beta_1$ for Models 1-6 to make it easier on your graders.#1
print("N/A")
N/A
#2
reg2.summary()
y10k = reg2.params[0]+np.log(10000)*reg2.params[1]
y20k = reg2.params[0]+np.log(10100)*reg2.params[1]
print(f"""Reg 2:
intercept : {reg2.params[0]}
beta1: {reg2.params[1]}
y at Lot Area == 10,000: {y10k}
y at Lot Area == 10,100: {y20k}
Going from Lot Area 10k to 10100, reg2 predicts Sale Price changes: {y20k-y10k}
""")
Reg 2:
intercept : -327915.80232023844
beta1: 56028.16996046535
y at Lot Area == 10,000: 188122.7134345788
y at Lot Area == 10,100: 188680.2122627829
Going from Lot Area 10k to 10100, reg2 predicts Sale Price changes: 557.4988282041159
Model 2: Intercept –> A Sale Price of -327915.80. if log(Lot Area)=0. “A 1% increase in Lot Area is associated with an increase of 557.50 in Sales Price.” @X=10,000, E(y) is 188122.71 10,000 to 10,100 –> Sales Price increases 557.50
#3
reg3.summary()
y10k = reg3.params[0]+10000*reg3.params[1]
y20k = reg3.params[0]+10001*reg3.params[1]
print(f"""Reg 3:
intercept : {reg3.params[0]}
beta1: {reg3.params[1]}
log(y) at Lot Area == 10,000: {y10k}
log(y) at Lot Area == 10,001: {y20k}
Going from Lot Area 10k to 10,001, reg3 predicts log(Sale Price) changes: {y20k-y10k}
""")
Reg 3:
intercept : 11.89407251466273
beta1: 1.3092338465836551e-05
log(y) at Lot Area == 10,000: 12.024995899321095
log(y) at Lot Area == 10,001: 12.025008991659561
Going from Lot Area 10k to 10,001, reg3 predicts log(Sale Price) changes: 1.3092338466691444e-05
Model 3: “A 1 unit increase in Lot Area is associated with a PROPORTIONAL increase of 0.0013% in Sales Price.” Lot Area 10000 to 10001 –> Sales Price increase 0.0013%
I think Model 2 best explains the data because it shows how a proportional increase in lot area affects sales prices which is exactly what we are looking for in this scenario.
#5
reg5.summary()
y2006 = reg5.params[0]+2006*reg5.params[1]
y2007 = reg5.params[0]+2007*reg5.params[1]
print(f"""Reg 5:
intercept : {reg5.params[0]}
beta1: {reg5.params[1]}
y at Year Sold == 2006: {y2006}
y at Year Sold == 2007: {y2007}
Going from Year Sold 2006 to 2007, reg5 predicts Sale Price changes: {y2007-y2006}
""")
Reg 5:
intercept : 22.293213132062135
beta1: -0.005114348195977281
y at Year Sold == 2006: 12.03383065093171
y at Year Sold == 2007: 12.028716302735733
Going from Year Sold 2006 to 2007, reg5 predicts Sale Price changes: -0.005114348195977669
Model 5: “A 1 year increase in Year Sold is associated with a PROPORTIONAL increase of 0.51% in interest rates.” Year Sold 2006 to 2007 –> Sales Price increase 0.51%
Alpha is the intercept so when the dependent variable is equal to zero alpha is equal to the independent variable.
#7
reg6.summary()
y2007 = reg6.params[0]+2007*reg5.params[1]
y2008 = reg6.params[0]+2008*reg5.params[1]
print(f"""Reg 6:
intercept : {reg6.params[0]}
beta1: {reg6.params[1]}
y at Year Sold == 2007: {y2007}
y at Year Sold == 2008: {y2008}
Going from Year Sold 2007 to 2008, reg6 predicts Sale Price changes: {y2008-y2007}
""")
Reg 6:
intercept : 12.022869210751955
beta1: 0.02559031997164936
y at Year Sold == 2007: 1.758372381425552
y at Year Sold == 2008: 1.7532580332295744
Going from Year Sold 2007 to 2008, reg6 predicts Sale Price changes: -0.005114348195977669
Model 6: “A 1 year increase in Year Sold is associated with a PROPORTIONAL increase of 0.51% in interest rates.” Year Sold 2007 to 2008 –> Sales Price increase 0.51%
Unfortunately mine is not so I can’t really comment with numbers but if I had to guess I would say it is because housing prices climbed so much in 2008 so buying a house in 2007 versus 2008 is highly correlated with a higher price in 2008.
The variables I included in my Model 7 were v_Lot_Area, v_Yr_Sold, v_Lot_Config, v_Neighborhood, and v_Overall_Qual
0.78
Yes because buying a house in 2008 as opposed to 2007 clearly correlate to higher sales prices
Probably not because buying a house in 2007 vs 2006 did not necessarily correlate to higher prices.