Minggu, 21 Januari 2018

Sample Answer for Regression

Applied Linear Statistical Models 5th Edition

1.20
R code
data1<-read.table("CH01PR20.txt", header=F)
colnames(data1) <-c("y","x")
attach (data1)
data1reg<-lm(y~x, data=data1)
summary (data1reg)

Y= the total number of minutes spent by the service person
X= number of copiers serviced.
a.
Estimated regression function : y= -0.5802+15.0352x
It means that every increase of 1copiers serviced, the total number spent by service person increases by 15.0352 minutes.
b.
Based on the plotted data, the regression fits the data because the line lies in the middle of the plotted data.

 c. b0= -0.5802. It means that if there is no copiers machine serviced, the number of minutes spent by service person is -0.5802 minute. Therefore, b0 doesn’t provide relevant information in here because the number of minutes spend cannot be negative for 0 copier serviced. 

d. y=  -0.5802 + 15.0352(5) = 74.5958.
It means 74.5958 minutes are needed to service 5 copiers machine.

1.24
Rcode:
data1reg$resi



"x"
"1" -9.49033942558754
"2" 0.439164490861531
"3" 1.47441253263708
"4" 11.5096605744125
"5" -2.455091383812
"6" -12.7723237597911
"7" -6.59608355091384
"8" 14.4039164490862
"9" -10.455091383812
"10" 2.50966057441254
"11" 9.26292428198433
"12" 6.22767624020888
"13" 3.36866840731071
"14" -8.52558746736292
"15" 12.4391644908616
"16" -19.7018276762402
"17" 0.333420365535253
"18" 11.2981723237598
"19" -22.7723237597911
"20" -2.56083550913838
"21" -8.59608355091384
"22" -3.66657963446475
"23" 4.33342036553525
"24" -0.596083550913837
"25" -0.737075718015668
"26" 7.33342036553525
"27" -11.4903394255875
"28" -1.59608355091384
"29" 6.33342036553525
"30" 6.36866840731071
"31" 3.29817232375979
"32" 15.4039164490862
"33" -9.49033942558746
"34" -1.49033942558746
"35" -11.455091383812
"36" -2.56083550913838
"37" 11.4039164490862
"38" -2.73707571801567
"39" 7.33342036553525
"40" 12.544908616188
"41" -3.73707571801567
"42" 4.50966057441254
"43" -2.49033942558746
"44" 1.43916449086162
"45" 2.40391644908616



Rcode:
anova (data1reg)

Response: y
                         Df       Sum Sq             
x                        1        76960             
Residuals         43        3416
a.
Sum Square of residuals = 3416
Mean Square of residuals = Sum of Square residuals / degree of freedom = 3416/43 = 79.44186
Point of estimate of population variance= sample variance = MSE = 79.44186
Point of estimate standard deviation of population = sample standard deviation = √MSE = 8.913016
Standard deviation is expressed in minutes.
2.5
a.
Rcode:
> confint (data1reg, "x", level=0.90)
       5 %     95 %
x 14.22314 15.84735
Interpretation: with 90% confidence interval, every increase of numbers of copiers increased will increase the mean service time between 14.22314 minutes to 15.84735
b. p value is in the summary
Coefficients:
                                    Estimate          Std. Error       t value             Pr(>|t|)   
(Intercept)                   -0.5802             2.8039            -0.207              0.837   
x                                   15.0352            0.4831            31.123            <2e-16 ***
t value is 31.123 with p value ≈ 0.
With the assumption that
Ho: b1=0 (There is no linear association between the total number of minutes spent by the service person and number of copiers serviced)
  Ha: b1 ≠0 (There is linear association between the total number of minutes spent by the service person and number of copiers serviced.
Critical value t0.1 = 1.688. Therefore, we reject Ho
Conclusion: There is linear association between the total number of minutes spent by the service person and number of copiers serviced
c. Yes, because at 90% level. The confident interval of b1 does not contain the hypothesis value of a.
d Under the assumption H0:
b1=14 (mean required time increase by 14 minutes for each additional copier that is serviced on a service call)
 and Ha b1>14 (mean required time increase by more than 14 minutes for each additional copier that is serviced on a service call).
T observe = (15.0352-14)/ 0.4831 = 2.148
P value (single tail)= 0.0189
T observe > t 0.05 which means p value observe < p (t0.05 single tail). Therefore, we reject Ho
Conclusion: mean required time increase by more than 14 minutes for each additional copier that is serviced on a service call
2.14
a
> predict(data1reg, newdata=data.frame(x=6), se.fit = T, interval = "confidence", leve=0.9)
$fit
       fit              lwr                   upr
 89.63133        87.28387         91.9788

Interpretation : it means that we are 90% confident that mean service time on calls in which six copiers are serviced are between 87.2838 minutes and 91.978 minutes.
b.
 predict(data1reg, newdata=data.frame(x=6), se.fit = T, interval = "prediction", leve=0.9)
$fit
       fit              lwr                   upr
1 89.63133      74.46433         104.7983
Interpretation : it means that we are 90% confident that prediction interval for service time on calls in which six copiers are serviced are between 74.46433 minutes and 104.7983 minutes.
2.24
b.
Under the assumption
 Ho: b1=0 (there is no linear association between time spent and number of copiers serviced) 
and Ha b1 ≠ 0 (there is linear association between time spent and number of copiers serviced)

 F*-statistic: 968.7
Critical F value at 90% with df 1 and 43 = 2.826
F*> Critical F value, therefore we reject Ho
Conclusion : there is linear association between time spent and number of copiers serviced.

c. > summary(data1reg)$r.squared
[1] 0.9574955
Total variation of minutes spent on a call can be determined 95.745% by the number of copiers serviced.
This is relatively low reduction

d. r= √R2 = +.09785. The sign is positive because it is the same sign as the slope
e. R2

1. Page 147 3.4
c
  -2 | 30
  -1 |
  -1 | 3110
  -0 | 99997
  -0 | 44333222111
   0 | 001123334
   0 | 5666779
   1 | 112234
   1 | 5

Among 45 residuals from the regression, we can see that it starts to look like a normal distribution, slightly skewed to the left and most of them are located near the 0 within the interval -1.3 to 1.5 with one point slightly away from the group(-2.3). A further test is required to figure out the doubt whether this data is normally distributed or not.

d
Both of the plots provide almost identical information. The only difference is the scale on the X axis. The first label is the plot of regression residual towards numbers of copiers serviced (x) and the second plot is the plot of regression residual towards estimated number of minutes spent by the service person (Yhat). Since the regression formula is Yhat = -0.5802 + 15.0352X. Which means that every increase of X by one , it will increase Yhat by 15.032. Then the plot will be quite identical. Both patterns have slight curve increasing for lower X (Yhat) increasing as it goes to the median and drop as the value of X (Yhat) goes larger.

According to the histogram, the residuals slightly skew to the left. A further test is needed to check the normality of this residuals. However I would say that I am quite skeptical that this residuals are normally distributed.
e
Based on this plot, it seems that the lower value of of X deviates from the line and as the value increases, it fits the line. I would say that the lower value of x has subtle pattern to deviates from normality.


stderr= summary(data120reg)$sigma
expvals = sapply(1:n, function(k) stderr * qnorm((k-.375)/(n+.25)))
cor(expvals,sort(data120reg$residuals))

Data1reg$residuals
residuals


-22.77232376
-19.70182768
-12.77232376
-11.49033943
-11.45509138
-10.45509138
-9.490339426
-9.490339426
-8.596083551
-8.525587467
-6.596083551
-3.737075718
-3.666579634
-2.737075718
-2.560835509
-2.560835509
-2.490339426
-2.455091384
-1.596083551
-1.490339426
-0.737075718
-0.596083551
0.333420366
0.439164491
1.439164491
1.474412533
2.403916449
2.509660574
3.298172324
3.368668407
4.333420366
4.509660574
6.22767624
6.333420366
6.368668407
7.333420366
7.333420366
9.262924282
11.29817232
11.40391645
11.50966057
12.43916449
12.54490862
14.40391645
15.40391645



Expected value under normality
[1] -19.6327219 -16.0464284 -14.0092857 -12.5174953 -11.3117711 -10.2836100
 [7]  -9.3766584  -8.5576522  -7.8051932  -7.1046288  -6.4454331  -5.8197483
[13]  -5.2215182  -4.6459445  -4.0891323  -3.5478477  -3.0193496  -2.5012663
[19]  -1.9915043  -1.4881790  -0.9895583  -0.4940175   0.0000000   0.4940175
[25]   0.9895583   1.4881790   1.9915043   2.5012663   3.0193496   3.5478477
[31]   4.0891323   4.6459445   5.2215182   5.8197483   6.4454331   7.1046288
[37]   7.8051932   8.5576522   9.3766584  10.2836100  11.3117711  12.5174953
[43]  14.0092857  16.0464284  19.6327219

R-squared = 0.9891021
Under the assumption:
Ho= the distribution of error terms does not depart substantially from a normal distribution.
Ha= = the distribution of error terms departs substantially from a normal distribution.
Critical value for coefficient correlation with alpha =0.1 and n=35 (0.977+0.981)/2 = 0.979
Since R-squared observation >0.979, it means that the distribution of error terms does not depart substantially from a normal distribution.
The previous doubt about the residual pattern in histogram and plot does not depart substantially.


h
  
The plot of residuals against x2( mean of copiers call in months) appears to have a positive correlation. It is the sign that this variable should be included in the regression and to test whether it will significantly contribute to the fitted value or not. However, the plot of residuals against x3 doesn’t appear to have pattern and the variances appear to be constant. This variable should not be included in the regression.
2.
With the command >
 length(unique(x))
[1] 10
We can obtain 10 distinct unique value in x variable
SSPE= 2797.658
SSE- 3416.377
SSLF = 3416.377-2797.658 =  618.72
Dflf = c-2 = 8
Dfpe=n-c= 45-10=35
Flf = Mlf/MPe= (618.72/8)/(2797.658/35) = 0.9676
R output
 > data1lof<-lm(y~as.factor(x), data=data1)
> anova(data1reg,data1lof)
Analysis of Variance Table
Model 1: y ~ x
Model 2: y ~ as.factor(x)
                Res.Df                    RSS Df Sum of Sq           F              Pr(>F)
1             43                           3416.4                          
2             35                           2797.7  8               618.72 0.9676 0.4766

P value of F
1-pf(0.9676,8,35)
0.4766119
Critical p-value of f
1-qf(0.5,8,35)
[1] 0.0641271
Under the hypothesis
H0: the model fits the data
Ha: the model does not fit the data
Since p-value of observed F stat is larger than the critical value, we fail to reject the null hypothesis. The model fits the data for linear regression, and there is not enough evidence for lack of fit test.
3.17
a
Yes positive linear relationship appears strong in here

c
> data317$sqrty317<- sqrt(data317$y317)
> data317reg2<-lm(data317$sqrty317~x317)
> summary(data317reg2)

                               Estimate              Std. Error             t value  Pr(>|t|)   
(Intercept)          10.26093             0.21290                 48.20    3.80e-11 ***
x317                     1.07629               0.03988                  26.99 3.83e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Estimated linear regression = √y = 10.26093    + 1.07629 X
Interpretation: for every increase of the year, the square root of annual sales will increase by 1.07629 thousands of unit start from  10.26093 thousand unit in year 0.

plot(x317,data317$sqrty)
The regression line appears to be a good fit to the transformed data. The data spread around the line.
 
plot( predict(data317reg2), data317reg2$resi, ylab="residuals", xlab="fitted values")
The plots suggest that the residuals seem normally distributed. No pattern in the residuals, the variance is constant and the shape of the variance close to the box shape. On normal probability plot, most of the data close to the line with 2 data both the top right and bottom left deviate from the line.  Since those are symmetrical, it’s still considered acceptable.

ˆ Y = (10.26093 + 1.07629X)2
Y = 10.26093^2 + 10.26093(1.07629X) + 1.07629x^2
It means that as the years increase, sales increase by 10.26093^2 + 10.26093(1.07629) + 1.07629^2 thousands of units.

4.7
Bonferonni method:
B(t(0.975,43)) = 2.016692
B0=  -0.5802
B1= 15.0352
Se0= 2.8039
Se1= 0.4831

Xh= 3   44.52559 ± 2.016692 (9.069025058) =                                                     
  18.28943028     <Eyh<    62.81502028
Interpretation: Under Bonferroni method with 90% family confidence coefficient, when there are 3 copiers to be serviced, estimated expected the number of minutes spent are between 18.28943028 to 62.81502028 minutes.
Xh=5   74.59608 ± 2.016692 (9.011665523) =
56.42232623 <Eyh< 92.76983377
Interpretation: Under Bonferroni method with 90% family confidence coefficient, when there are 5 copiers to be serviced, estimated expected the number of minutes spent are between 56.42232623 to 92.76983377 minutes.
 Xh =7  104.6666 ± 2.016692 (9.05758221) =
86.40024642 <Eyh <  122.9329536
Interpretation: Under Bonferroni method with 90% family confidence coefficient, when there are 5 copiers to be serviced, estimated expected the number of minutes spent are between 86.40024642 to 122.9329536 minutes.

Working Hotelling method
F(0.9,2,43) = 2.430407
W = √2F = √2*2.430407 = 2.204725
Xh= 3   44.52559 ± 2.204725 (1.67491955) = 41.14779314 <Eyh <   47.90338686
Interpretation: Under Working Hotelling method with 90% family confidence coefficient, when there are 3 copiers to be serviced estimated expected number of minutes spent are between 41.14779314 to 47.90338686 minutes.
Xh=5   74.59608 ± 2.204725 (1.329757684) = 71.91436832 <Eyh <   77.27779168
Interpretation: Under Working Hotelling method with 90% family confidence coefficient, when there are 5 copiers to be serviced, estimated expected number of minutes spent are between 71.91436832  to 77.27779168 minutes
Xh =7  104.6666 ± 2.204725 (1.611811248) = 101.4160732 <Eyh <   107.9171268
Interpretation: Under Working Hotelling method with 90% family confidence coefficient, when there are 7 copiers to be serviced , estimated expected number of minutes spent are between 101.4160732 to 107.9171268 minutes

Bonferroni
Alpha = 0.1
Two tail alpha = 0.5
Bonferroni k=2 alpha =0.25
T value of Bonferroni with k=2 =>  B(t(0.975,43)) = 2.016692
Scheffe method:
S=√K*F(0.9,2,43) = √2*2.430407 = 2.204725
From Bonferroni and Scheffe method,  Bonferroni critical value is smaller. Hence I choose Bonferroni method.
Xh= 4  59.56084 ± 2.016692 (9.027475727)  = 41.35520192 < Eyh < 77.76647808
Interpretation: Under Bonferroni method with 90% family confidence coefficient, when there are 4 copiers to be serviced, estimated expected the number of minutes spent is between 41.35520192  to 77.76647808 minutes.
Xh =7   104.6666 ± 2.016692 (9.05758221) =  86.40024642 <Eyh <  122.9329536
Interpretation: Under Bonferroni method with 90% family confidence coefficient, when there are 5 copiers to be serviced, estimated expected the number of minutes spent is between 86.40024642 to 122.9329536 minutes.




6.9 a

data69<-read.table("CH06PR09.txt", header =F)
> colnames(data69)<- c("y69", "x691", "x692", "x693")
> attach(data69)
stem(x691)
  The decimal point is 5 digit(s) to the right of the |
  2 | 13
  2 | 555556677777777777888999999
  3 | 00000011222233
  3 | 5778
  4 | 1134
  4 | 7

stem(x692)

  The decimal point is at the |

  4 | 6
  5 | 8
  6 | 123345577888999
  7 | 0122222334466777888999
  8 | 00011224566
  9 | 07

               y69                        x691                      x692                        x693
y69          1.0000000          0.20766494          0.06002960         0.81057940
x691       0.2076649          1.00000000          0.08489639         0.04565698
x692      0.0600296           0.08489639        1.00000000          0.11337076
x693      0.8105794            0.04565698       0.11337076        1.00000000


Boxplot Residual

6.10
a
Summary(data69reg)
Coefficients:
                              Estimate Std.      Error                      t value                 Pr(>|t|)   
(Intercept)            4.150e+03        1.956e+02            21.220                  < 2e-16 ***
x691                     7.871e-04            3.646e-04             2.159                    0.0359 * 
x692                      -1.317e+01          2.309e+01          -0.570                     0.5712   
x693                       6.236e+02          6.264e+01           9.954                    2.94e-13 ***
Yhat = 4150+ 0.0007871 X1 -13.17X2 + 623.6 X3
Interpretation
1.       For every number of cases shipped increases by one (x1), total labor hours will increase by 0.0007871 hours after adjusted for effects of other predictors.
2.       For every increase of total labor hours as a percentage increase by one (x2), total labor hours will decrease by 13.17 hours. 
3.       For every increase of holiday by one(x3),  total labor hours will increase by 623.6 hours. In other word, if the week has a holiday, total labor hours increase by 623.6 hours.
b.
The boxplot shows that the residuals in the fourth quantile (>100) is longer compared to the others 3 quantiles. It means that there are slightly more numbers of residual value in positive and larger value compared to other number.
c.
Residuals against fitted value
plot(predict(data69reg),data69reg$resi, ylab="Total labor hours residuals", xlab="Predicted total labor hours (Yhat)"
The scatterplot between regression residuals and predicted value clearly shows that there is distinct clustering for residual with predicted total labor hours below 4800 hours and above 4800 hours.
Residual against number of cases shipped (x1)
plot(x691,data69reg$resi, ylab="Total labor hours residuals", xlab="number of cases shipped(x1)")

.
The scatterplot shows that there is subtle clustering between number of cases shipped below 350,000 cases and above 350,000 cases. There are more regression residuals in number of cases shipped below 350,000 cases.
Residual against total labour hours as percentage (x2)
plot(x692,data69reg$resi, ylab="Total labor hours residuals", xlab="indirect cost(x2)")
The scatterplot shows that the residuals tend to be centralized between 6% to 9% of total labor hours as percentage with two outliers lies in <5% and one in >9%. However overall, there is no pattern and the variance is roughly equal for this scatterplot
Residuals against qualitative predictor holiday
plot(x693,data69reg$resi, ylab="Total labor hours residuals", xlab="holiday(x3)")
The scatterplot shows the residual’s variance is not constant against holiday. If the week has holiday (value =1) the variance is smaller compared to the week who doesn’t have holiday.

Residuals against interaction between number of cases shipped and indirect cost of total labor hours as a percentage
x694<-x691+x692
plot(x694,data69reg$resi, ylab="Total labor hours residuals", xlab="cases and indirect cost (x1*x2")
Interpretation: the scatterplot shows that the data has subtle clustering in lower value compared to the higher value of interaction between number of cases and indirect cost.

Conclusion: in summary, most of the residual has clustering pattern for the lower value compared to the higher value.
d.
time plot of residuals
This scatterplot shows that if the residuals is plotted against time which is sequence of the data collected, there is no sign of error terms are correlated. The residuals seem has constant variance and widely spread.

6.11
A
Under the hypothesis:
Null hypothesis b1=b2=b3 =0 (All of the predictors don’t have relationship against response variable)
Alternate hypothesis b1≠b2≠b3 ≠0 (At least one of the predictors has relationship against response variable)
Anova (data69reg)
Analysis of Variance Table

Response: y69
                                 Df          Sum      Sq           Mean Sq             F value                   Pr(>F)   
x691                        1           136366                                  136366                6.6417                   0.01309 * 
x692                        1            5726                      5726                      0.2789                   0.59987   
x693                        1           2034514                2034514             99.0905                 2.941e-13 ***
Residuals             48           985530                 20532

From anova table,
SSRegression = 136366+5726+20134514 = 2176606 with df =3
MSR = 2176606/3 = 725535
SSE= 985530 with df=48
MSE = 985530/48 = 20531.9
F obs = 725535 / 20531.9 = 35.337
> summary(data69reg)
Residual standard error: 143.3 on 48 degrees of freedom
Multiple R-squared:  0.6883,    Adjusted R-squared:  0.6689
F-statistic: 35.34 on 3 and 48 DF,  p-value: 3.316e-12

This amount is already calculated in summary table in R.
Critical p-value = pf(35.337,3,48) = 2.79806
35.35>2.79806
Significance level =0.05
P value = 1- pf(35.32,2,48) = 3.341993e-12 (almost 0)
3.341993e-12< 0.05
Reject Ho
Interpretations. Since the probability of all the predictors don’t have relationship against response variable close to 0 (lower than 0.05 significance level), we reject null hypothesis. It means there is significance evidence that at least one of the predictors which has relationship against response variable.
b.
Bonferroni method with b1 and b2.
K=2
Significane level = 0.05
Bonferonni (B) significance level =0.05/2 =0.025
confint(data69reg,level=0.975)
                              1.25 %                                   98.75 %
(Intercept)          3.697369e+03                   4.602406e+03
x691                        -5.646080e-05                 1.630622e-03
x692                      -6.659796e+01                   4.026592e+01
x693                     4.786096e+02                   7.684993e+02
From summary table
Standard error of b1= 3.646e-04
Standard error of b3= 6.264e+01
Bonferonni significance level = significance level / two tail/ k member of bonferroni = 0.05/ 2/2 = 0.0125
Bonferonni critical value = qt(1-0.0125,48) = 2.313899
For b1 : 7.871e-04 ± 2.313899 (3.646e-04)                              -5.646080e-05 b1 ≤ 1.630622e-03
For b3 : 6.236e+02 ± 2.313899 (6.264e+01)                          4.786096e+02 ≤ b1 ≤ 7.684993e+02

Interpretation:
Under Bonferroni method with 95 percent confidence, every increase of number of cases by one, total labour hours will change from -0.000058 to +0.00163 hours and every increase of holiday, total labour hours will increase from 478.60 to 768.499 hours.

C
From anova table
 Sum of square of regression  = 2176606
Sum of square of error  = 985530
Sum of square total = SSR + SSE = 2176606 + 985530= 3162136

R squared = SSR/SSTO = 2176606 /  3162136 =  0.6683

It also can be attained by summary table in R.
Residual standard error: 143.3 on 48 degrees of freedom
Multiple R-squared:  0.6883,    Adjusted R-squared:  0.6689
F-statistic: 35.34 on 3 and 48 DF,  p-value: 3.316e-12

Interpretation:
66.8% of total variation can be explained/reduced by the model. The rest 33.2% is due to randomness.

6.12
Under bonferonni method
Significance level = 0.05
Two tail = 0.05/2= 0.025
Bonferonni significance level = 0.025/5 = 0.005
Bonferonni critical value = B= qt(0.995,48) = 2.682204

Under working hoteling method
W = √P*F(1-alpha, df1=p, df2=n-p)
P =4 (b0,b1,b2,b3)
F(0.95,4,48) = 2.565241
W= √4*2.565241 =  3.203273
SInce the critical value of bonferroni method is smaller, bonferroni method is used.

Rcode:
Bonferonni method
predict(data69reg, newdata=data.frame(x691=c(302000, 245000, 280000, 350000,295000),x692=c(7.2, 7.4, 6.9, 7.0, 6.7), x693=c(0,0,0,0,1)), se.fit = T, interval = "confidence", leve=0.99)
$fit
                    fit                                       lwr          upr
1              4292.790             4235.507             4350.073
2              4245.293            4165.626              4324.960
3              4279.424             4213.859             4344.989
4             4333.203             4255.609              4410.798
5             4917.418             4749.781              5085.055

$se.fit
       1                      2                             3                             4                              5
21.35667             29.70206             24.44444              28.92934             62.49980
$df
[1] 48
$residual.scale
[1] 143.2895

x1
x2
x3
fitted value
lower value
upper value
302000
7.2
0
4292.79
4235.507
4350.073
245000
7.4
0
4245.293
4165.626
≤ E{Yh}≤
4324.96
280000
6.9
0
4279.424
4213.859
≤ E{Yh}≤
4344.989
350000
7
0
4333.203
4255.609
≤ E{Yh}≤
4410.798
295000
6.7
1
4917.418
4749.781
≤ E{Yh}≤
5085.05



Interpretation: under Bonferroni

b. maximum amount of:
x1 (cases)= 472476
x2  (indirect percentage)= 9.65
x3 (holiday) = 1
With 40000 cases, 7.2 indirect percentage and on 1 (holiday week), this condition is still in the model.
With 40000 cases, 9.9 indirect percentage and on 1 (holiday week), this condition is not in the model because the maximum observed indirect percentage is 9.65. This condition suggest extrapolation
 plot (x692,x691, xlab=" indirect percentage", ylab=" Number of Cases" , xlim= c(1,9.7))

6.13
Bonferonni significance level = 0.05/2/4 = 0.00625
Bonferonni critical value = qt(0.99375, 48) = 2.595323

Working hoteling
W= 3.203273 (Calculated in 6.12)

Scheffe
S= k.f
S= √4.2.56524
S= 3.20327

Since critical value of Bonferroni method is the smallest, we will use it.

> predict(data69reg, newdata=data.frame(x691=c(230000, 250000, 280000, 340000),x692=c(7.5, 7.3, 7.1, 6.9), x693=c(0,0,0,0)), se.fit = T, interval = "prediction", leve=0.9875)
$fit
       fit      lwr      upr
1 4232.171 3849.911 4614.430
2 4250.545 3871.478 4629.613
3 4276.791 3900.122 4653.460
4 4326.649 3947.913 4705.385

$se.fit
       1        2        3        4
34.08561 28.30225 23.06451 27.63605

$df
[1] 48

$residual.scale
[1] 143.2895

       fit                                                                    lwr           upr
1 4232.171                          3849.911  ≤ Yhpred≤      4614.430
2 4250.545                          3871.478  ≤ Yhpred≤      4629.613
3 4276.791                          3900.122  ≤ Yhpred≤      4653.460
4 4326.649                          3947.913  ≤ Yhpred≤      4705.385

This problem uses individual prediction because management desires predictions of the handling times for these shipments, not the expected prediction times of these shipments.


8.4
8.4 a
We centerized the data
> data127reg2<-lm(y127~centx127+I(centx127^2), data=data127)
> summary (data127reg2)

Call:
lm(formula = y127 ~ centx127 + I(centx127^2), data = data127)

Residuals:
    Min      1Q  Median      3Q     Max
-15.086  -6.154  -1.088   6.220  20.578

Coefficients:
                              Estimate              Std. Error             t value                  Pr(>|t|)   
(Intercept)          82.935749           1.543146             53.745                   <2e-16 ***
centx127             -1.183958            0.088633             -13.358                <2e-16 ***
I(centx127^2)     0.014840            0.008357             1.776                    0.0811 . 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.026 on 57 degrees of freedom
Multiple R-squared:  0.7632,    Adjusted R-squared:  0.7549
F-statistic: 91.84 on 2 and 57 DF,  p-value: < 2.2e-16

Regression Function
Yhat= 82.9357+ -1.18 X1 + 0.014840X1^2
R = 0.7632

Interpretation
In this case, regression function appears to be somewhat fit in here. Given that the coefficient for the quadratic function is only around 0.01 which means that every increase of 1 quadratic years age, muscle mass will increase by (0.01)^2. Given the p-value for t test >0.05, this variable is not significant and we can make sure by using partial F test whether this variable can be dropped or not.

8.4 B.
Based on the summary table. F stat = 91.84.
This indicates the global Ftest for overall regression line.
qf(0.9,2,57)
[1] 2.398157
F stat> qf => reject null hypothesis
At leas one of variable has relationship toward response variable
Interpretation :
Under the hypothesis:
Ho : b1= b11=0 (There is no relationship of quadratic  and ha : b1=b11 ≠ 0
This can be read by the F observed in the summary table.
Based on F observer that > than critical F value which means that we reject the null hypothesis.
It means that there is at least 1 variable that has relationship toward response variable.



8.4 c
Since we use centerized data, x =48 will be the same as centx= -11.98333333

predict(data127reg2, newdata=data.frame(centx127=c(-11.98333333)), se.fit=T, interval="confidence", level=0.95)
$fit
       fit      lwr      upr
1 99.25461 96.28436 102.2249

$se.fit
[1] 1.483295

$df
[1] 57

$residual.scale
[1] 8.025521

Interpretation: we are 95% confidence that for women age 48 years old, the expected muscle mass will be between 96.28436 AND 102.2249

8.4 d
predict(data127reg2, newdata=data.frame(centx127=c(-11.98333333)), se.fit=T, interval="prediction", level=0.95)
$fit
       fit     lwr      upr
1 99.25461 82.9116 115.5976
 
$se.fit
[1] 1.483295
 
$df
[1] 57
 
$residual.scale
[1] 8.025521

Interpretation: we are 95% confidence that for women age 48 years old, we can predict that the muscle mass will be between 82.9116 to 115.5976

8.4 e
Response: y127
                              Df           Sum Sq                 Mean Sq              F value                  Pr(>F)   
centx127             1             11627.5                11627.5                 180.5258             < 2e-16 ***
I(centx127^2)    1             203.1                    203.1                    3.1538                  0.08109 . 
Residuals             57           3671.3                  64.4                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F value at the row of I(centx127^2) is the value of full model compared to the reduced model without given variable. This output automatically calculate F score of partial test.

Under the hypothesis:
B11=0 (quadratic term has no relationship towards response variable)
B11 ≠ 0 (quadratic tern has relationship towards response variable)
Full model : yhat = b0+b1x1+b11x1^2
Reduced model : yhat = b0+b1x1

F test = 3.1538
Critical F value = 4.009868
Since F observed < Critical F value, we fail to reject Ho
Interpretation: There is no significance evidence that quadratic term has relationship towards   response variable. 
 



8.4 f
In term of original variable
Data127reg3<-lm(y127~x127+I(x127^2))
> summary(data127reg3)


Call:
lm(formula = y127 ~ x127 + I(x127^2), data = data127)

Residuals:
    Min      1Q  Median      3Q     Max
-15.086  -6.154  -1.088   6.220  20.578

Coefficients:
                                                Estimate                            Std. Error             t value  Pr(>|t|)   
(Intercept)                          207.349608                       29.225118             7.095 2.21e-09 ***
x127                                      -2.964323                            1.003031             -2.955      0.00453 **
I(x127^2)                             0.014840                             0.008357              1.776     0.08109 . 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.026 on 57 degrees of freedom
Multiple R-squared:  0.7632,    Adjusted R-squared:  0.7549
F-statistic: 91.84 on 2 and 57 DF,  p-value: < 2.2e-16

In term of original variable regression function will be:
Yhat = 207.349608 – 2.964323x1+ 0.014840x1^2

Interpretation: muscle mass will reduce as the age increase. 1 year increase in age, muscle mass will decrease by -2.964  + (0.014^2)


8.4 g
cor(data127)
                                   y127                    x127                   centx127             centx127q          x127q
y127                        1.0000000          -0.86606398       -0.86606398       0.14760734         -0.85257317
x127                      -0.8660640          1.00000000        1.00000000         -0.03835694      0.99609392
centx127                -0.8660640       1.00000000        1.00000000          -0.03835694       0.99609392
centx127q             0.1476073         -0.03835694        -0.03835694       1.00000000       0.05002801
x127q                      -0.8525732       0.99609392         0.99609392           0.05002801      1.00000000

Correlation coefficient for uncentered = 0.99609392
Correlation coefficient for centered data = -0.03835694
With the same data set and we just adjust the range of the data, we can reduce multicollinearity almost 99%.
Centered data is useful for quadratic term

8.25

> data69reg2<-lm(y69~centx691+I(centx691^2)+x693+centx691:x693+I(centx691^2):x693)
> summary(data69reg2)

Call:
lm(formula = y69 ~ centx691 + I(centx691^2) + x693 + centx691:x693 +
    I(centx691^2):x693)

Residuals:
    Min      1Q  Median      3Q     Max
-289.03  -95.85  -12.22   73.69  295.31

Coefficients:
                                               Estimate Std.     Error t                   value                     Pr(>|t|)   
(Intercept)                          4.296e+03           2.735e+01           157.068                 < 2e-16 ***
centx691                            9.031e-04            6.398e-04            1.412                    0.165   
I(centx691^2)                    -1.577e-09           6.328e-09           -0.249                     0.804   
x693                                      6.144e+02           9.447e+01            6.504                   5.07e-08 ***
centx691:x693                   -1.884e-04            1.142e-03         -0.165                     0.870   
I(centx691^2):x693           1.808e-09           1.309e-08          0.138                    0.891   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 146.8 on 46 degrees of freedom
Multiple R-squared:  0.6867,    Adjusted R-squared:  0.6526
F-statistic: 20.16 on 5 and 46 DF,  p-value: 1.345e-10










8.25 b
Full model
Yhat = b0+b1x1+b2x1^2+b3x3+b4x1x3+b5x1^2x3
anova(data69reg2)
Analysis of Variance Table

Response: y69
                                              Df            Sum Sq                Mean                    Sq           F value    Pr(>F)   
centx691                              1            136366                 136366                 6.3313   0.01541 * 
I(centx691^2)                     1            115766                 115766                 5.3749   0.02493 * 
x693                                     1             1918614              1918614                89.0791 2.477e-12 ***
centx691:x693                   1               216                        216                      0.0100   0.92063   
I(centx691^2):x693         1             411                        411                        0.0191   0.89077   
Residuals                             46          990762                  21538                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Reduced model
Yhat = b0+b1x1+b3x3

> anova(data69regr)
Analysis of Variance Table

Response: y69
                                 Df          Sum Sq                Mean Sq             F value                    Pr(>F)   
centx691              1            136366                                  136366                6.7344                   0.01244 * 
x693                       1             2033565                2033565             100.4276             1.875e-13 ***
Residuals             49          992204                 20249                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> anova(data69regr,data69reg2)
Analysis of Variance Table

Model 1: y69 ~ centx691 + x693
Model 2: y69 ~ centx691 + I(centx691^2) + x693 + centx691:x693 + I(centx691^2):x693
  Res.Df                 RSS         Df           Sum of Sq             F             Pr(>F)
1              49                          992204                          
2               46                         990762  3            1441.9                 0.0223   0.9954








8.39

datac2<-read.table("APPENC02.txt", header=F)
colnames (datac2)<-c("id","county","state","landarea","totalpop","percentpop1", "percentpop2","numphysic","numhospital","totseriuscrimes","percenthsgrad","percentbach","percentbelowpov","percentunemp","percapitaincome","totalpersonalincome","geographicregion")
datac2839<-lm(numphysic~totalpop+totalpersonalincome+as.factor(geographicregion), data=datac2)

datac2$grf<-as.factor(datac2$geographicregion)
datac2$basew<-relevel(datac2$grf, ref="4")
attach(datac2)
datac2839<-lm(numphysic~totalpop+totalpersonalincome+basew, data=datac2)

Call:
lm(formula = numphysic ~ totalpop + totalpersonalincome + basew,
    data = datac2)

Residuals:
    Min      1Q  Median      3Q     Max
-1866.8  -207.7   -81.5    72.4  3721.7

Coefficients:
                                              Estimate              Std. Error             t value                  Pr(>|t|)   
(Intercept)                            -2.075e+02        7.028e+01           -2.952                   0.00332 **
totalpop X1                                         5.515e-04           2.835e-04             1.945                   0.05243 . 
totalpersonalincome X2 1.070e-01            1.325e-02          8.073                      6.8e-15 ***
basew1 X3                            1.490e+02          8.683e+01         1.716                      0.08685 . 
basew2 X4                           1.455e+02          8.515e+01            1.709                     0.08817 . 
basew3  X5                         1.912e+02           8.003e+01            2.389                     0.01731 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 566.1 on 434 degrees of freedom
Multiple R-squared:  0.9011,    Adjusted R-squared:  0.8999
F-statistic: 790.7 on 5 and 434 DF,  p-value: < 2.2e-16

Regression function =
Yhat = -207.5 + 0.0005515X1 + 0.0107X2 +149.0 X3+ 145.5X4 + 191.2X5





b.
difference in the NE region and NC region by 90% confidence interval.
The difference of beta = 149-145.5 = 3.5
The difference of standard error =  86.83- 85.15 = 1.68
Critical value = t(1-0.1/2, 434) = 1.648372
Confidence interval =

3.5 ± 1.648372 (1.68) = 0.730688 ≤ b3-b4 ≤ 6.2639

c.
Full model: yhat = b0+b1x1+b2x2+b3x3+b4x4+b5x5

Under the assumption
Ho: B3=b4=b5 = 0
Ha: at least on of the area is significant

Reduced model = b0+b1x1+b2x2


F partial = (SSER-SSEF / DFER-DFEF)/SSEF/DFEF =
((140967081 – 139093455)/(437-434))/( 139093455/434) = 1.9487

> anova(datac2839r, datac2839)
Analysis of Variance Table

Model 1: numphysic ~ totalpop + totalpersonalincome
Model 2: numphysic ~ totalpop + totalpersonalincome + basew
                Res.Df                     RSS                       Df           Sum of Sq           F                              Pr(>F)
1             437                         140967081                                    
2              434                        139093455         3            1873626                1.9487                 0.121

Critical p value=0.1
With p-value 0.121, fail to reject null hypothesis.
B3=b4=b5=0









9.18

> step(data910reg, direction="both")
Start:  AIC=74.95
y910 ~ x9101 + x9102 + x9103 + x9104

        Df Sum of Sq     RSS     AIC
- x9102  1     12.22  348.20  73.847
<none>                335.98  74.954
- x9104  1    260.74  596.72  87.314
- x9101  1    759.83 1095.81 102.509
- x9103  1   1064.15 1400.13 108.636

Step:  AIC=73.85
y910 ~ x9101 + x9103 + x9104

        Df Sum of Sq     RSS     AIC
<none>                348.20  73.847
+ x9102  1     12.22  335.98  74.954
- x9104  1    258.46  606.66  85.727
- x9101  1    763.12 1111.31 100.861
- x9103  1   1324.39 1672.59 111.081

Call:
lm(formula = y910 ~ x9101 + x9103 + x9104)

Coefficients:
(Intercept)        x9101        x9103        x9104 
  -124.2000       0.2963       1.3570       0.5174 

Final model : y= b0+b1x1+b2x2+b3x3







9.18 b
jobprof.lmf<-lm(score~x1+x3+x4, data=jobprof)
> summary(jobprof.lmf)
 
Call:
lm(formula = score ~ x1 + x3 + x4, data = jobprof)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-5.4579 -3.1563 -0.2057  1.8070  6.6083 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -124.20002    9.87406 -12.578 3.04e-11 ***
x1             0.29633    0.04368   6.784 1.04e-06 ***
x3             1.35697    0.15183   8.937 1.33e-08 ***
x4             0.51742    0.13105   3.948 0.000735 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Residual standard error: 4.072 on 21 degrees of freedom
Multiple R-squared:  0.9615,  Adjusted R-squared:  0.956 
F-statistic:   175 on 3 and 21 DF,  p-value: 5.16e-15

AIC for final model = 73.85

BIC(jobprof.lmf)
[1] 152.8886


> sum((jobprof.lmf$residuals/(1-hatvalues(jobprof.lmf)))^2)
[1] 471.452

Mallow C for reduced model
348.2/16.6-(length(jobprof.lmf$fit)-2*4)
[1] 3.975904


Ii
Mallow C for full model
336.0/16.8-(length(jobprof.lm1$fit)-2*5)
Full model
Multiple R-squared:  0.9629,  Adjusted R-squared:  0.9555 

BIC(jobprof.lm1)
[1] 155.2144

sum((jobprof.lm1$residuals/(1-hatvalues(jobprof.lm1)))^2)
[1] 518.9885






Full model
Reduced Model

R square
0.9629
0.9615
Fullmodel has slightly higher R squared
Adjusted R square
0.9555
0.956
Both of model has similar value
Mallow C score
5
3.975904
Reduced model has lower value
BIC
155.2144
152.8886
Reduced model has lower value
AIC
74.95
73.85
Reduced model has lower value
PRESS
518.9885
471.452
Reduced model has lower value

10.17 a
Added variable plot between residuals of age and residual of satisfaction shows clear negative correlation. It means that the predictor has strong relationship towards response variable. This predictor should be included in the model
The residuals of severity have slight relationship towards response variable however the relationship exist. This variable should be included in the model
Similar to age, residual of anxiety has negative relationship towards residual of satisfaction. This variable should be included in the model.
10.17 b
All of the residuals of predictor show relationship toward response variable. Therefore, the regression relationship in the fitted function are appropriate.
10.11 a
Studentized residual
Bonferroni procedure with significance level =0.1
Two tail = 0.1/2 = 0.05
Number of observation = 46
Individual bonferonni  significance level = 0.05/46 =  0.001086957
Critical value for studentized residual = qt(1-0.001086957, 46) = 3.27151
 
Under the hypothesis:
Ho: the case is not an outlier.
Ha: the case is an outlier.
 
Based on the studentized residual table. There isn’t any observation that greater than |3.27151|. We       fail to reject null hypothesis hence  there is no outliers in this dataset. 
 
10. b
Hat matrix
Hat value of the observation
 

Average of h’s = 2p/n = 2*4/46 = 0.173913

There are 3 observations that above the average of h’s value that is observation case 9, 28 and 30

10.c





10.d

DF fits
DFBETAS



B0
B1
B2
B3
Cook’s Dist
11
0.5688
0.0991
-0.3631
-0.19
0.39
0.0766
17
0.6657
-0.4491
0.4432
0.4432
0.0893
0.1051
27
-0.6087
-0.0172
-0.2499
-0.2499
0.1614
0.0867





10.f
A case is influential if cooks D is larger than critical value of F distribution 0.5 with df1=4, df2=42.
Under null hypothesis
Ho: the case is not an outlier
Ha: the case is an outlier 
 
The critical value is 0.8528731
There is no observation above 0.8528731. Therefore according to the cooks distance, each of the case fail reject null hypothesis. It means that there is no outliers according to Cooks D in the dataset.

10.17
a. Correlation matrix
Based on scatter plot and correlation matrix, there are somewhat multicollinearity among predictors but the multicollinearity is not strong.
10.17 b.

VIF among 3 predictors are below 2.1 which is considerable low. Each predictor doesn’t have strong multicollinearity that can cause problem in the dataset. The average of the VIF is 1.9 which means that the whole set of data won’t cause any problem in the regression equation later in the future.
11.7
A
Regression function: Y hat = -5.750+ 0.18750X
Interpretation: Number of defective item will increase by 0.1875 in each increment of the speed based on this model
11.7 c
The squared residuals plot against speed show that the variances among residuals are not constant. The data set needs remedial measure to make the estimation of defective item more acurate.
11.7 d
Variance function
The variance function is estimated defective = -180.083+1.2437 (speed). Variance function is the function to create weighted least square hence it is not something to interpret.

11.7 e
Regression function of weighted least square : estimated defective = -6.2322+0.18911(speed)
Regression function of non weighted : Y hat = -5.750+ 0.18750(speed)
The regression function is similar based on the weighted and non weighted least square.

11.7 f
Based on the standard error of weighted and non-weighted least square, the standard error of the weighted least square is smaller compared to the one without weight. Weighted least square creates slight difference in the equation but reduce standard deviation on both of the coefficients.
11.7 g
Second weighted regression equation : estimated defect = -6.233 + 0.18911 (speed)
The equation doesn’t change significantly thus the first weight has represented the necessary weighted for the equation.

12.6
Under the hypothesis:
Ho: there is no autocorrelation among our error terms; =0
Ha: there is autocorrelation among error terms; p>0
Based on the durbin Watson table, we will reject null hypothesis if DW value less than 1.29 and we will accept null hypothesis if DW value > 1.38. If DW value falls in between, it means our data is conclusive.
Our DW value > 1.38 it means we fail to reject Ho, it means that there is no significant evidence to support autocorrelation among error terms.



Residual plot against its index shows that the residuals are scattered and there is no visible pattern in the residual.