Logo

Andrew Schachter

LinkedIn
Resume
GitHub

Part 1: EDA

Insert cells as needed below to write a short EDA/data section that summarizes the data for someone who has never opened it before.

import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from statsmodels.formula.api import ols as sm_ols
import matplotlib.pyplot as plt
from statsmodels.iolib.summary2 import summary_col
housing = pd.read_csv('input_data2/housing_train.csv')
housing
parcel v_MS_SubClass v_MS_Zoning v_Lot_Frontage v_Lot_Area v_Street v_Alley v_Lot_Shape v_Land_Contour v_Utilities ... v_Pool_Area v_Pool_QC v_Fence v_Misc_Feature v_Misc_Val v_Mo_Sold v_Yr_Sold v_Sale_Type v_Sale_Condition v_SalePrice
0 1056_528110080 20 RL 107.0 13891 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 New Partial 372402
1 1055_528108150 20 RL 98.0 12704 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 New Partial 317500
2 1053_528104050 20 RL 114.0 14803 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 New Partial 385000
3 2213_909275160 20 RL 126.0 13108 Pave NaN IR2 HLS AllPub ... 0 NaN NaN NaN 0 6 2007 WD Normal 153500
4 1051_528102030 20 RL 96.0 12444 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 11 2008 New Partial 394617
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1936 2524_534125210 190 RL 79.0 13110 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv NaN 0 7 2006 WD Normal 146500
1937 2846_909131125 190 RH NaN 7082 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 7 2006 WD Normal 160000
1938 2605_535382020 190 RL 60.0 10800 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2006 ConLD Normal 160000
1939 1516_909101180 190 RL 55.0 5687 Pave Grvl Reg Bnk AllPub ... 0 NaN NaN NaN 0 3 2008 WD Normal 135900
1940 1387_905200100 190 RL 60.0 12900 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 WD Alloca 95541

1941 rows × 81 columns

Takeaways

Part 2: Running Regressions

Run these regressions on the RAW data, even if you found data issues that you think should be addressed.

Insert cells as needed below to run these regressions. Note that $i$ is indexing a given house, and $t$ indexes the year of sale.

  1. $\text{Sale Price}_{i,t} = \alpha + \beta_1 * \text{v_Lot_Area}$
  2. $\text{Sale Price}_{i,t} = \alpha + \beta_1 * log(\text{v_Lot_Area})$
  3. $log(\text{Sale Price}_{i,t}) = \alpha + \beta_1 * \text{v_Lot_Area}$
  4. $log(\text{Sale Price}_{i,t}) = \alpha + \beta_1 * log(\text{v_Lot_Area})$
  5. $log(\text{Sale Price}_{i,t}) = \alpha + \beta_1 * \text{v_Yr_Sold}$
  6. $log(\text{Sale Price}_{i,t}) = \alpha + \beta_1 * (\text{v_Yr_Sold==2007})+ \beta_2 * (\text{v_Yr_Sold==2008})$
  7. Choose your own adventure: Pick any five variables from the dataset that you think will generate good R2. Use them in a regression of $log(\text{Sale Price}_{i,t})$
    • Tip: You can transform/create these five variables however you want, even if it creates extra variables. For example: I’d count Model 6 above as only using one variable: v_Yr_Sold.
    • I got an R2 of 0.877 with just “5” variables. How close can you get? I won’t be shocked if someone beats that!

Bonus formatting trick: Instead of reporting all regressions separately, report all seven regressions in a single table using summary_col.

reg1 = sm_ols("v_SalePrice ~ v_Lot_Area", data=housing).fit()
reg2 = sm_ols("v_SalePrice ~ np.log(v_Lot_Area)", data=housing).fit()
reg3 = sm_ols("np.log(v_SalePrice) ~ v_Lot_Area", data=housing).fit()
reg4 = sm_ols("np.log(v_SalePrice) ~ np.log(v_Lot_Area)", data=housing).fit()
reg5 = sm_ols("np.log(v_SalePrice) ~ v_Yr_Sold", data=housing).fit()
reg6 = sm_ols("np.log(v_SalePrice) ~ v_Yr_Sold==2007 + v_Yr_Sold==2008", data=housing).fit()
reg7 = sm_ols("np.log(v_SalePrice) ~ v_Lot_Area + v_Yr_Sold + v_Lot_Config + v_Neighborhood + v_Overall_Qual", data=housing).fit()
# now I'll format an output table
# I'd like to include extra info in the table (not just coefficients)
info_dict={'R-squared' : lambda x: f"{x.rsquared:.2f}",
           'Adj R-squared' : lambda x: f"{x.rsquared_adj:.2f}",
           'No. observations' : lambda x: f"{int(x.nobs):d}"} 

# This summary col function combines a bunch of regressions into one nice table
print('='*107)
print('                  y = Sale  Price if not specified, log(Sale Price else)')
print(summary_col(results=[reg1,reg2,reg3,reg4,reg5,reg6,reg7], # list the result obj here
                  float_format='%0.2f',
                  stars = True, # stars are easy way to see if anything is statistically significant
                  model_names=['1','2',' 3','4','5','6','7'], # these are bad names, lol. Usually, just use the y variable name
                  info_dict=info_dict,
                  regressor_order=[ 'v_SalePrice','v_Lot_Area','np.log(v_Lot_Area)','v_yr_Sold','v_Lot_Config',
                                  'v_Neighborhood','v_Overall_Qual']
                  )
     )
===========================================================================================================
                  y = Sale  Price if not specified, log(Sale Price else)

===============================================================================================
                               1             2           3       4       5       6        7    
-----------------------------------------------------------------------------------------------
v_Lot_Area                2.65***                    0.00***                           0.00*** 
                          (0.23)                     (0.00)                            (0.00)  
np.log(v_Lot_Area)                     56028.17***            0.29***                          
                                       (3315.14)              (0.02)                           
v_Overall_Qual                                                                         0.17*** 
                                                                                       (0.00)  
Intercept                 154789.55*** -327915.80*** 11.89*** 9.41*** 22.29   12.02*** 20.82*  
                          (2911.59)    (30221.35)    (0.01)   (0.15)  (22.94) (0.02)   (10.94) 
v_Neighborhood[T.SawyerW]                                                              -0.00   
                                                                                       (0.05)  
v_Neighborhood[T.NAmes]                                                                -0.08*  
                                                                                       (0.05)  
v_Neighborhood[T.NPkVill]                                                              -0.19** 
                                                                                       (0.08)  
v_Neighborhood[T.NWAmes]                                                               -0.00   
                                                                                       (0.05)  
v_Neighborhood[T.NoRidge]                                                              0.27*** 
                                                                                       (0.05)  
v_Neighborhood[T.NridgHt]                                                              0.18*** 
                                                                                       (0.05)  
v_Neighborhood[T.OldTown]                                                              -0.21***
                                                                                       (0.05)  
v_Neighborhood[T.SWISU]                                                                -0.13** 
                                                                                       (0.06)  
v_Neighborhood[T.Sawyer]                                                               -0.08*  
                                                                                       (0.05)  
v_Neighborhood[T.MeadowV]                                                              -0.26***
                                                                                       (0.06)  
v_Neighborhood[T.Somerst]                                                              0.02    
                                                                                       (0.05)  
v_Neighborhood[T.StoneBr]                                                              0.13**  
                                                                                       (0.05)  
v_Neighborhood[T.Timber]                                                               0.09*   
                                                                                       (0.05)  
v_Neighborhood[T.Veenker]                                                              0.10    
                                                                                       (0.06)  
v_Yr_Sold                                                             -0.01            -0.00   
                                                                      (0.01)           (0.01)  
v_Yr_Sold == 2007[T.True]                                                     0.03             
                                                                              (0.02)           
v_Neighborhood[T.Mitchel]                                                              -0.04   
                                                                                       (0.05)  
v_Neighborhood[T.Landmrk]                                                              -0.19   
                                                                                       (0.20)  
v_Neighborhood[T.BrDale]                                                               -0.42***
                                                                                       (0.06)  
v_Neighborhood[T.IDOTRR]                                                               -0.33***
                                                                                       (0.05)  
v_Lot_Config[T.CulDSac]                                                                0.04*   
                                                                                       (0.02)  
v_Lot_Config[T.FR2]                                                                    -0.06*  
                                                                                       (0.03)  
v_Lot_Config[T.FR3]                                                                    -0.01   
                                                                                       (0.06)  
v_Lot_Config[T.Inside]                                                                 0.01    
                                                                                       (0.01)  
v_Neighborhood[T.Blueste]                                                              -0.22** 
                                                                                       (0.10)  
v_Yr_Sold == 2008[T.True]                                                     -0.01            
                                                                              (0.02)           
v_Neighborhood[T.BrkSide]                                                              -0.21***
                                                                                       (0.05)  
v_Neighborhood[T.ClearCr]                                                              0.03    
                                                                                       (0.06)  
v_Neighborhood[T.CollgCr]                                                              0.01    
                                                                                       (0.05)  
v_Neighborhood[T.Crawfor]                                                              0.07    
                                                                                       (0.05)  
v_Neighborhood[T.Edwards]                                                              -0.16***
                                                                                       (0.05)  
v_Neighborhood[T.Gilbert]                                                              -0.02   
                                                                                       (0.05)  
v_Neighborhood[T.Greens]                                                               -0.21** 
                                                                                       (0.10)  
v_Neighborhood[T.GrnHill]                                                              0.30**  
                                                                                       (0.14)  
R-squared                 0.07         0.13          0.06     0.13    0.00    0.00     0.78    
R-squared Adj.            0.07         0.13          0.06     0.13    -0.00   0.00     0.78    
R-squared                 0.07         0.13          0.06     0.13    0.00    0.00     0.78    
Adj R-squared             0.07         0.13          0.06     0.13    -0.00   0.00     0.78    
No. observations          1941         1941          1941     1941    1941    1941     1941    
===============================================================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Part 3: Regression interpretation

Insert cells as needed below to answer these questions. Note that $i$ is indexing a given house, and $t$ indexes the year of sale.

  1. If you didn’t use the summary_col trick, list $\beta_1$ for Models 1-6 to make it easier on your graders.
  2. Interpret $\beta_1$ in Model 2.
  3. Interpret $\beta_1$ in Model 3.
    • HINT: You might need to print out more decimal places. Show at least 2 non-zero digits.
  4. Of models 1-4, which do you think best explains the data and why?
  5. Interpret $\beta_1$ In Model 5
  6. Interpret $\alpha$ in Model 6
  7. Interpret $\beta_1$ in Model 6
  8. Why is the R2 of Model 6 higher than the R2 of Model 5?
  9. What variables did you include in Model 7?
  10. What is the R2 of your Model 7?
  11. Speculate (not graded): Could you use the specification of Model 6 in a predictive regression?
  12. Speculate (not graded): Could you use the specification of Model 5 in a predictive regression?
#1
print("N/A")
N/A
#2
reg2.summary()
y10k = reg2.params[0]+np.log(10000)*reg2.params[1]
y20k = reg2.params[0]+np.log(10100)*reg2.params[1]
print(f"""Reg 2: 
intercept              :  {reg2.params[0]}
beta1:                    {reg2.params[1]}
y at Lot Area == 10,000: {y10k}
y at Lot Area == 10,100: {y20k}
Going from Lot Area 10k to 10100, reg2 predicts Sale Price changes: {y20k-y10k}
""")
Reg 2: 
intercept              :  -327915.80232023844
beta1:                    56028.16996046535
y at Lot Area == 10,000: 188122.7134345788
y at Lot Area == 10,100: 188680.2122627829
Going from Lot Area 10k to 10100, reg2 predicts Sale Price changes: 557.4988282041159

Model 2: Intercept –> A Sale Price of -327915.80. if log(Lot Area)=0. “A 1% increase in Lot Area is associated with an increase of 557.50 in Sales Price.” @X=10,000, E(y) is 188122.71 10,000 to 10,100 –> Sales Price increases 557.50

#3
reg3.summary()
y10k = reg3.params[0]+10000*reg3.params[1]
y20k = reg3.params[0]+10001*reg3.params[1]
print(f"""Reg 3: 
intercept              :  {reg3.params[0]}
beta1:                    {reg3.params[1]}
log(y) at Lot Area == 10,000: {y10k}
log(y) at Lot Area == 10,001: {y20k}
Going from Lot Area 10k to 10,001, reg3 predicts log(Sale Price) changes: {y20k-y10k}
""")
Reg 3: 
intercept              :  11.89407251466273
beta1:                    1.3092338465836551e-05
log(y) at Lot Area == 10,000: 12.024995899321095
log(y) at Lot Area == 10,001: 12.025008991659561
Going from Lot Area 10k to 10,001, reg3 predicts log(Sale Price) changes: 1.3092338466691444e-05

Model 3: “A 1 unit increase in Lot Area is associated with a PROPORTIONAL increase of 0.0013% in Sales Price.” Lot Area 10000 to 10001 –> Sales Price increase 0.0013%

#4

I think Model 2 best explains the data because it shows how a proportional increase in lot area affects sales prices which is exactly what we are looking for in this scenario.

#5
reg5.summary()
y2006 = reg5.params[0]+2006*reg5.params[1]
y2007 = reg5.params[0]+2007*reg5.params[1]
print(f"""Reg 5: 
intercept              :  {reg5.params[0]}
beta1:                    {reg5.params[1]}
y at Year Sold == 2006: {y2006}
y at Year Sold == 2007: {y2007}
Going from Year Sold 2006 to 2007, reg5 predicts Sale Price changes: {y2007-y2006}
""")
Reg 5: 
intercept              :  22.293213132062135
beta1:                    -0.005114348195977281
y at Year Sold == 2006: 12.03383065093171
y at Year Sold == 2007: 12.028716302735733
Going from Year Sold 2006 to 2007, reg5 predicts Sale Price changes: -0.005114348195977669

Model 5: “A 1 year increase in Year Sold is associated with a PROPORTIONAL increase of 0.51% in interest rates.” Year Sold 2006 to 2007 –> Sales Price increase 0.51%

#6

Alpha is the intercept so when the dependent variable is equal to zero alpha is equal to the independent variable.

#7
reg6.summary()
y2007 = reg6.params[0]+2007*reg5.params[1]
y2008 = reg6.params[0]+2008*reg5.params[1]
print(f"""Reg 6: 
intercept              :  {reg6.params[0]}
beta1:                    {reg6.params[1]}
y at Year Sold == 2007: {y2007}
y at Year Sold == 2008: {y2008}
Going from Year Sold 2007 to 2008, reg6 predicts Sale Price changes: {y2008-y2007}
""")
Reg 6: 
intercept              :  12.022869210751955
beta1:                    0.02559031997164936
y at Year Sold == 2007: 1.758372381425552
y at Year Sold == 2008: 1.7532580332295744
Going from Year Sold 2007 to 2008, reg6 predicts Sale Price changes: -0.005114348195977669

Model 6: “A 1 year increase in Year Sold is associated with a PROPORTIONAL increase of 0.51% in interest rates.” Year Sold 2007 to 2008 –> Sales Price increase 0.51%

#8

Unfortunately mine is not so I can’t really comment with numbers but if I had to guess I would say it is because housing prices climbed so much in 2008 so buying a house in 2007 versus 2008 is highly correlated with a higher price in 2008.

#9

The variables I included in my Model 7 were v_Lot_Area, v_Yr_Sold, v_Lot_Config, v_Neighborhood, and v_Overall_Qual

#10

0.78

#11

Yes because buying a house in 2008 as opposed to 2007 clearly correlate to higher sales prices

#12

Probably not because buying a house in 2007 vs 2006 did not necessarily correlate to higher prices.