QUANT homework

Attachments

Introduction

Ideally the goal is to select explanatory variables to build the simplest regression equation possible that will produce a good estimate of the response variable (parsimony)
this to simplify interpretation of the regression coefficients
selecting good variables for use in a regression analysis usually involves initially engaging in some research
To use linear simple linear regression analysis: There needs to be a linear relation between:
1) the response variable and the explanatory variable
Investigate the linear relations by examining:
1) Correlation: the numerical measure of strength in the relation between the explanatory and response variables
2) scatter plot of the response variable against the independent variable
the plot can show linear dot pattern , indicating likely need for the explanatory variable to be included in the regression equation
the plot can show non-linear dot pattern suggesting a transformation of an explanatory variable may be needed
the plot can show a horizontal dot pattern suggesting the variable should not be used in the regression equation
3) residual plots
looking for a scattered, cloudy dot pattern with consistent variance at each level of the explanatory variable to help confirm linear relation likely exists
if a pattern is observed, this suggests transformations many need to be applied

correlation

Response variables (variable of interest) : typically placed on y-axis
explanatory variables: typically placed on x-axis
Choosing the role of the variables depends on how you think the variables are related
Correlation coefficient (r): permits you to quantify the strength of the linear relation between the 2 variables
In Excel use =CORREL feature
properties of correlation coefficient
range from -1 to 1
sign gives the direction of the association
unitless measure
not affected by changes to the center or scale of the variables
depends only on the z-scores
sensative to outliers
does not imply causation (could be lurking variables standing behind the association)
check the scatter plot for straightness even if the value of r is high

scatter

Scatterplots
displays relation (association) between 2 quantitative variables
investigor can see patterns, trends, and outliers
1. Direction of the association
negative positive
2. Form of the association
nonlinear vs linear
3. Strength of association
How discernable is the data pattern or trend?
weak, moderate, strong

residual

Question: What line can be used to model the linear data pattern displayed in the scatterplot?
the equation of the best fit line is the regression line (model)
the estimates for the response variable made through the regression line are called the predicted values
residuals
To find residuals, subtract the predicted value (calculated with the regression equation) from the observed value
negative residual: the regression equation is overestimating
postive residual: the regression equation is underestimating
Best fitting line means the sum of its least squared residuals is smallest

Example1A

Question: Is there a linear relation between the response variable (salary) and housing price?
1) correlation (CORELL feature in Excel)
2) scatter plot
3) residual plot
1) Correlation 0.9086846363 3) Residual Plot (Run the regression in Excel, using the Data Analysis Toolbar. Select Residual plot
A correlation of approximately .91 implies that salary and housing price have
a strong, positive, linear relation where salary increase ad housing price increases in a systematically predictable way
Salary($10,000) Housing Price ($10,000)
1 45 2) Scatter Plot (select variables and use the Insert .. Charts menu)
2 47
3 57
4.5 59
5.2 60
8.4 65
11 66
12.4 67
12.5 69
13.1 70
15 70
17 72
22.2 73
22.6 74 There is a pattern (dot pattern is not scattered) evident in the residual plot. This suggests that some other relation, beyond a linear relation
25.8 76 would likely better represent the relation between salary and housing price. Hence a transformation should be
performed
From the scatter plot is can be seen that the relation appears to be positive and strong. However there also
appears to be a slight bend in the dot pattern. Further investigation is needed in order to determine if the
trend is linear

Housing Price ($10,000) 1 2 3 4.5 5.2 8.4 11 12.4 12.5 13.1 15 17 22.2 22.6 25.8 45 47 57 59 60 65 66 67 69 70 70 72 73 74 76

Salary

Housing Price ($10,000)

Transform1

Goal: Make the dot pattern of a scatterplot more nearly linear
calculate transform for the explanatory and the response variables
look at scatterplots of combinations of transformed and/or non-transformed values
Can use Adjusted R-Squared to help chose between candidate linear equation models that result
Non-linear dot patterns in scatterplots and possible transformations to straighten
year salary (in $10,000) LN(salary)
1980 1
1982 2
1990 3
1990 4.5
1991 5.2
1996 8.4
1997 11
1998 12.4
1999 12.5
1999 13.1
1999 15
2001 17
2001 22.2
2004 22.6
2005 25.8

(LN(x), Y)

(x, y2)

(LN(y), LN(x))

(LN(x), Y)

(x, 1/y)

(x, SQRT(y))

(x, LN(y))

(LN(y), LN(x))

Significance

What else to check in the Regression Output
The overall significance of the regression equation (in the ANOVA table)
The regression coefficent is the average change in y expected per unit change in the explanatory variable when all other explanatory variables are held constant
How significant are the regression equation and/or the coefficients?
less than .05 for the significance and/or p-values , 1 or more of the explanatory variables are good predictors of the response variable
also, check R-squared (use adjusted R-squared)
Standard Approach we will use (1 of the following will be the investigation conclusion)
If regression equation is not significant, the model is not a good fit to the data
If the regression model is significant but the coefficient is not, the model provides improved fit over using the expected value of the response variable as the estimated prediction
If the regression model is significant and the coefficient is significant, the model is a good fit for the data and the explanatory variable is contributing significantly towards the quality of prediction
Coefficient significance Model significance

Example2

Is the regression model with the Square Root Transformed response variable salary the better model? Yes according to the analysis (scatter plots, residual plots, adjusted r-squared, and standard error comparisons)
year (actual) year (Explanatory) salary (in $10,000) (Response) year (Explanatory) SQRT(Salary) Compare Scatter Plots : Which looks visually straighter ?
1980 80 1 80 1
1982 82 2 82 1.4142135624
1990 90 3 90 1.7320508076
1990 90 4.5 90 2.1213203436
1991 91 5.2 91 2.2803508502
1996 96 8.4 96 2.8982753492
1997 97 11 97 3.3166247904
1998 98 12.4 98 3.5213633723
1999 99 12.5 99 3.5355339059
1999 99 13.1 99 3.6193922142
1999 99 15 99 3.8729833462
2001 101 17 101 4.1231056256
2001 101 22.2 101 4.7116875958
2004 104 22.6 104 4.7539457296
2005 105 25.8 105 5.0793700397 Compare Residual Plots : Which has a cloudy, scattered dot pattern ?
Response: Salary Response : Square Root of Salary
Compare Adjusted R-squared values : Which is largest ? Compare Standard Error: Which is smallest?
Both models are significant
The regression model with the transformed response variable has the largest Adjusted R-squared value and smallest standard error
Response Variable : Salary Response Variable : Square Root Of Salary
Regression Statistics Regression Statistics
Multiple R 0.9150264296 Multiple R 0.9668776273
R Square 0.8372733668 R Square 0.9348523462
Adjusted R Square 0.8247559335 Adjusted R Square 0.9298409882
Standard Error 3.28038926 Standard Error 0.3337444496
Observations 15 Observations 15
ANOVA ANOVA
df SS MS F Significance F df SS MS F Significance F
Regression 1 719.7849352684 719.7849352684 66.8885821383 0.0000017547 Regression 1 20.7785720521 20.7785720521 186.5467103404 0.0000000044
Residual 13 139.892398065 10.7609536973 Residual 13 1.4480096496 0.1113853577
Total 14 859.6773333333 Total 14 22.2265817017
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept -80.4842432619 11.3048781078 -7.1194260119 0.0000078229 -104.906947591 -56.0615389328 -104.906947591 -56.0615389328 Intercept -12.4661593835 1.1501501874 -10.8387230813 0.0000000701 -14.9509077987 -9.9814109684 -14.9509077987 -9.9814109684
year (Explanatory) 0.9657567381 0.1180841892 8.1785440109 0.0000017547 0.710651357 1.2208621192 0.710651357 1.2208621192 year (Explanatory) 0.164087017 0.0120138007 13.6582103637 0.0000000044 0.1381327785 0.1900412554 0.1381327785 0.1900412554

salary (in $10,000) (Response) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 2 3 4.5 5.2 8.4 11 12.4 12.5 13.1 15 17 22.2 22.6 25.8

SQRT(Salary) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 1.4142135623730951 1.7320508075688772 2.1213203435596424 2.2803508501982761 2.8982753492378879 3.3166247903553998 3.5213633723318019 3.5355339059327378 3.6193922141707713 3.872983346207417 4.1231056256176606 4.7116875957558984 4.7539457296018854 5.0793700396801178

regression equation after trans

To undo the LN transformation to determine a regression equation for the original salary units, take the exponential (e) of both sides of the equation

To Undo the LN transformations

(x, LN(Y)) LN(Y) = ax + b y = e(ax + b)
y = eax (eb)

(LN(x), y) y = aLN(x) + b

(LN(x), LN(y)) LN(y) = aLN(x) + b LN(y) = aLN(x) + b
LN(y) = LN(xa) + b
y = xa + eb

numerical summaries

Numerical Summaries: Excel Regression Output Table (color coded)
Regression Coefficients Excel Regression Output Table
y-intercept: average value of the response variable when all explanatory variables are 0 Regression Statistics
impact of explanatory variables are not considered Multiple R 0.9150264296
R Square 0.8372733668
regression coefficients: the average predicted change in the response variable per unit change in the explanatory variable Adjusted R Square 0.8247559335
Standard Error 3.28038926
Correlation Coefficient: Strength of Linear Relation Observations 15
Coefficient of Determination (r^2) – Use Adjusted R^2 ANOVA
fraction of the variation in the data accounted for by the regression equation df SS MS F Significance F
sometimes reported as a percentage Regression 1 719.7849352684 719.7849352684 66.8885821383 0.0000017547
between 0 and 1 Residual 13 139.892398065 10.7609536973
Total 14 859.6773333333
SEE (Standard Error of the Estimate)
standard deviation of the residuals Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
how spread out the observations are from the regression line Intercept -80.4842432619 11.3048781078 -7.1194260119 0.0000078229 -104.906947591 -56.0615389328 -104.906947591 -56.0615389328
year (Explanatory) 0.9657567381 0.1180841892 8.1785440109 0.0000017547 0.710651357 1.2208621192 0.710651357 1.2208621192
Confidence Intervals : Range, withing specified level of confidence, that the population parameter resides

Influential Points

Influential Points
Does the regression equation change significantly when the isolated points is removed?
Would a different result from the analysis be concluded?
If yes, then the point is influential
In the regression output look at:
significance of model or coefficients
numerical summaries
Example
with influencial point without influential point

recode

Recode multiple category variables (3 or more categories) to 2 category variable
Gender: male (1) female (0)
Political Party (0) republican, (1) democrat, (2) libertarian, (3) green
Student IQ Study hours Gender Political Party Test score
1 110 40 1 0 100
2 110 40 0 2 95
3 120 30 1 1 90
4 110 40 1 3 85
5 100 20 0 3 80
6 110 40 1 1 75
7 90 0 0 0 70
8 110 40 0 2 65
9 80 10 1 3 60
10 80 10 0 1 55
The dummy variables are 1 when (zero otherwise):
x1: republican
x2: democrat
x3: libertarian
Student IQ Study hours Gender x1 x2 x3 Test score
1 110 40 1 1 0 0 100
2 110 40 0 0 0 1 95
3 120 30 1 0 1 0 90
4 110 40 1 0 0 0 85
5 100 20 0 80
6 110 40 1 75
7 90 0 0 70
8 110 40 0 65
9 80 10 1 60
10 80 10 0 55
0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *