# QUANT homework

## Introduction

 Ideally the goal is to select explanatory variables to build the simplest regression equation possible that will produce a good estimate of the response variable (parsimony) this to simplify interpretation of the regression coefficients selecting good variables for use in a regression analysis usually involves initially engaging in some research To use linear simple linear regression analysis: There needs to be a linear relation between: 1) the response variable and the explanatory variable Investigate the linear relations by examining: 1) Correlation: the numerical measure of strength in the relation between the explanatory and response variables 2) scatter plot of the response variable against the independent variable the plot can show linear dot pattern , indicating likely need for the explanatory variable to be included in the regression equation the plot can show non-linear dot pattern suggesting a transformation of an explanatory variable may be needed the plot can show a horizontal dot pattern suggesting the variable should not be used in the regression equation 3) residual plots looking for a scattered, cloudy dot pattern with consistent variance at each level of the explanatory variable to help confirm linear relation likely exists if a pattern is observed, this suggests transformations many need to be applied

## correlation

 Response variables (variable of interest) : typically placed on y-axis explanatory variables: typically placed on x-axis Choosing the role of the variables depends on how you think the variables are related Correlation coefficient (r): permits you to quantify the strength of the linear relation between the 2 variables In Excel use =CORREL feature properties of correlation coefficient range from -1 to 1 sign gives the direction of the association unitless measure not affected by changes to the center or scale of the variables depends only on the z-scores sensative to outliers does not imply causation (could be lurking variables standing behind the association) check the scatter plot for straightness even if the value of r is high

## scatter

 Scatterplots displays relation (association) between 2 quantitative variables investigor can see patterns, trends, and outliers 1. Direction of the association negative positive 2. Form of the association nonlinear vs linear 3. Strength of association How discernable is the data pattern or trend? weak, moderate, strong

## residual

 Question: What line can be used to model the linear data pattern displayed in the scatterplot? the equation of the best fit line is the regression line (model) the estimates for the response variable made through the regression line are called the predicted values residuals To find residuals, subtract the predicted value (calculated with the regression equation) from the observed value negative residual: the regression equation is overestimating postive residual: the regression equation is underestimating Best fitting line means the sum of its least squared residuals is smallest

## Example1A

 Question: Is there a linear relation between the response variable (salary) and housing price? 1) correlation (CORELL feature in Excel) 2) scatter plot 3) residual plot 1) Correlation 0.9086846363 3) Residual Plot (Run the regression in Excel, using the Data Analysis Toolbar. Select Residual plot A correlation of approximately .91 implies that salary and housing price have a strong, positive, linear relation where salary increase ad housing price increases in a systematically predictable way Salary(\$10,000) Housing Price (\$10,000) 1 45 2) Scatter Plot (select variables and use the Insert .. Charts menu) 2 47 3 57 4.5 59 5.2 60 8.4 65 11 66 12.4 67 12.5 69 13.1 70 15 70 17 72 22.2 73 22.6 74 There is a pattern (dot pattern is not scattered) evident in the residual plot. This suggests that some other relation, beyond a linear relation 25.8 76 would likely better represent the relation between salary and housing price. Hence a transformation should be performed From the scatter plot is can be seen that the relation appears to be positive and strong. However there also appears to be a slight bend in the dot pattern. Further investigation is needed in order to determine if the trend is linear

Housing Price (\$10,000) 1 2 3 4.5 5.2 8.4 11 12.4 12.5 13.1 15 17 22.2 22.6 25.8 45 47 57 59 60 65 66 67 69 70 70 72 73 74 76

Salary

Housing Price (\$10,000)

## Transform1

 Goal: Make the dot pattern of a scatterplot more nearly linear calculate transform for the explanatory and the response variables look at scatterplots of combinations of transformed and/or non-transformed values Can use Adjusted R-Squared to help chose between candidate linear equation models that result Non-linear dot patterns in scatterplots and possible transformations to straighten year salary (in \$10,000) LN(salary) 1980 1 1982 2 1990 3 1990 4.5 1991 5.2 1996 8.4 1997 11 1998 12.4 1999 12.5 1999 13.1 1999 15 2001 17 2001 22.2 2004 22.6 2005 25.8

(LN(x), Y)

(x, y2)

(LN(y), LN(x))

(LN(x), Y)

(x, 1/y)

(x, SQRT(y))

(x, LN(y))

(LN(y), LN(x))

## Significance

 What else to check in the Regression Output The overall significance of the regression equation (in the ANOVA table) The regression coefficent is the average change in y expected per unit change in the explanatory variable when all other explanatory variables are held constant How significant are the regression equation and/or the coefficients? less than .05 for the significance and/or p-values , 1 or more of the explanatory variables are good predictors of the response variable also, check R-squared (use adjusted R-squared) Standard Approach we will use (1 of the following will be the investigation conclusion) If regression equation is not significant, the model is not a good fit to the data If the regression model is significant but the coefficient is not, the model provides improved fit over using the expected value of the response variable as the estimated prediction If the regression model is significant and the coefficient is significant, the model is a good fit for the data and the explanatory variable is contributing significantly towards the quality of prediction Coefficient significance Model significance

## Example2

 Is the regression model with the Square Root Transformed response variable salary the better model? Yes according to the analysis (scatter plots, residual plots, adjusted r-squared, and standard error comparisons) year (actual) year (Explanatory) salary (in \$10,000) (Response) year (Explanatory) SQRT(Salary) Compare Scatter Plots : Which looks visually straighter ? 1980 80 1 80 1 1982 82 2 82 1.4142135624 1990 90 3 90 1.7320508076 1990 90 4.5 90 2.1213203436 1991 91 5.2 91 2.2803508502 1996 96 8.4 96 2.8982753492 1997 97 11 97 3.3166247904 1998 98 12.4 98 3.5213633723 1999 99 12.5 99 3.5355339059 1999 99 13.1 99 3.6193922142 1999 99 15 99 3.8729833462 2001 101 17 101 4.1231056256 2001 101 22.2 101 4.7116875958 2004 104 22.6 104 4.7539457296 2005 105 25.8 105 5.0793700397 Compare Residual Plots : Which has a cloudy, scattered dot pattern ? Response: Salary Response : Square Root of Salary Compare Adjusted R-squared values : Which is largest ? Compare Standard Error: Which is smallest? Both models are significant The regression model with the transformed response variable has the largest Adjusted R-squared value and smallest standard error Response Variable : Salary Response Variable : Square Root Of Salary Regression Statistics Regression Statistics Multiple R 0.9150264296 Multiple R 0.9668776273 R Square 0.8372733668 R Square 0.9348523462 Adjusted R Square 0.8247559335 Adjusted R Square 0.9298409882 Standard Error 3.28038926 Standard Error 0.3337444496 Observations 15 Observations 15 ANOVA ANOVA df SS MS F Significance F df SS MS F Significance F Regression 1 719.7849352684 719.7849352684 66.8885821383 0.0000017547 Regression 1 20.7785720521 20.7785720521 186.5467103404 0.0000000044 Residual 13 139.892398065 10.7609536973 Residual 13 1.4480096496 0.1113853577 Total 14 859.6773333333 Total 14 22.2265817017 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Intercept -80.4842432619 11.3048781078 -7.1194260119 0.0000078229 -104.906947591 -56.0615389328 -104.906947591 -56.0615389328 Intercept -12.4661593835 1.1501501874 -10.8387230813 0.0000000701 -14.9509077987 -9.9814109684 -14.9509077987 -9.9814109684 year (Explanatory) 0.9657567381 0.1180841892 8.1785440109 0.0000017547 0.710651357 1.2208621192 0.710651357 1.2208621192 year (Explanatory) 0.164087017 0.0120138007 13.6582103637 0.0000000044 0.1381327785 0.1900412554 0.1381327785 0.1900412554

salary (in \$10,000) (Response) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 2 3 4.5 5.2 8.4 11 12.4 12.5 13.1 15 17 22.2 22.6 25.8

SQRT(Salary) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 1.4142135623730951 1.7320508075688772 2.1213203435596424 2.2803508501982761 2.8982753492378879 3.3166247903553998 3.5213633723318019 3.5355339059327378 3.6193922141707713 3.872983346207417 4.1231056256176606 4.7116875957558984 4.7539457296018854 5.0793700396801178

## regression equation after trans

 To undo the LN transformation to determine a regression equation for the original salary units, take the exponential (e) of both sides of the equation

To Undo the LN transformations

(x, LN(Y)) LN(Y) = ax + b y = e(ax + b)
y = eax (eb)

(LN(x), y) y = aLN(x) + b

(LN(x), LN(y)) LN(y) = aLN(x) + b LN(y) = aLN(x) + b
LN(y) = LN(xa) + b
y = xa + eb

## numerical summaries

 Numerical Summaries: Excel Regression Output Table (color coded) Regression Coefficients Excel Regression Output Table y-intercept: average value of the response variable when all explanatory variables are 0 Regression Statistics impact of explanatory variables are not considered Multiple R 0.9150264296 R Square 0.8372733668 regression coefficients: the average predicted change in the response variable per unit change in the explanatory variable Adjusted R Square 0.8247559335 Standard Error 3.28038926 Correlation Coefficient: Strength of Linear Relation Observations 15 Coefficient of Determination (r^2) – Use Adjusted R^2 ANOVA fraction of the variation in the data accounted for by the regression equation df SS MS F Significance F sometimes reported as a percentage Regression 1 719.7849352684 719.7849352684 66.8885821383 0.0000017547 between 0 and 1 Residual 13 139.892398065 10.7609536973 Total 14 859.6773333333 SEE (Standard Error of the Estimate) standard deviation of the residuals Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% how spread out the observations are from the regression line Intercept -80.4842432619 11.3048781078 -7.1194260119 0.0000078229 -104.906947591 -56.0615389328 -104.906947591 -56.0615389328 year (Explanatory) 0.9657567381 0.1180841892 8.1785440109 0.0000017547 0.710651357 1.2208621192 0.710651357 1.2208621192 Confidence Intervals : Range, withing specified level of confidence, that the population parameter resides

## Influential Points

 Influential Points Does the regression equation change significantly when the isolated points is removed? Would a different result from the analysis be concluded? If yes, then the point is influential In the regression output look at: significance of model or coefficients numerical summaries Example with influencial point without influential point

## recode

 Recode multiple category variables (3 or more categories) to 2 category variable Gender: male (1) female (0) Political Party (0) republican, (1) democrat, (2) libertarian, (3) green Student IQ Study hours Gender Political Party Test score 1 110 40 1 0 100 2 110 40 0 2 95 3 120 30 1 1 90 4 110 40 1 3 85 5 100 20 0 3 80 6 110 40 1 1 75 7 90 0 0 0 70 8 110 40 0 2 65 9 80 10 1 3 60 10 80 10 0 1 55 The dummy variables are 1 when (zero otherwise): x1: republican x2: democrat x3: libertarian Student IQ Study hours Gender x1 x2 x3 Test score 1 110 40 1 1 0 0 100 2 110 40 0 0 0 1 95 3 120 30 1 0 1 0 90 4 110 40 1 0 0 0 85 5 100 20 0 80 6 110 40 1 75 7 90 0 0 70 8 110 40 0 65 9 80 10 1 60 10 80 10 0 55
0 replies