Read the data set
ex1 <- read.table("lecture2ex1.csv", sep=",", header=TRUE)
Find the descriptive statistics of the variables in the data set
library(psych)
## Warning: package 'psych' was built under R version 4.2.3
describe(ex1)
## vars n mean sd median trimmed mad min max range skew
## homepress 1 100 50.25 10.76 51.00 50.76 11.12 18.00 76.00 58.00 -0.44
## workpress 2 100 51.03 10.63 51.00 51.34 9.64 22.00 73.00 51.00 -0.34
## stress 3 100 52.10 11.33 53.92 52.12 12.91 22.81 78.09 55.27 -0.07
## kurtosis se
## homepress 0.15 1.08
## workpress 0.02 1.06
## stress -0.60 1.13
Test whether each correlation in a correlation matrix is significantly different from 0.
corr.test(ex1)
## Call:corr.test(x = ex1)
## Correlation matrix
## homepress workpress stress
## homepress 1.00 0.55 0.57
## workpress 0.55 1.00 0.48
## stress 0.57 0.48 1.00
## Sample Size
## [1] 100
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## homepress workpress stress
## homepress 0 0 0
## workpress 0 0 0
## stress 0 0 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
Run multiple regression where stress
is the outcome and
homepress
and workpress
are predictors. Save
the output of the multiple regression as out1
and use
summary
to see the output.
out1 <- lm(stress ~ homepress + workpress, data=ex1)
summary(out1)
##
## Call:
## lm(formula = stress ~ homepress + workpress, data = ex1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.2903 -5.4976 0.2526 5.7299 23.3002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.0114 5.0209 3.189 0.00192 **
## homepress 0.4588 0.1026 4.473 2.09e-05 ***
## workpress 0.2553 0.1038 2.461 0.01564 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.142 on 97 degrees of freedom
## Multiple R-squared: 0.3626, Adjusted R-squared: 0.3495
## F-statistic: 27.6 on 2 and 97 DF, p-value: 3.254e-10
Find the standardized regression coefficients and their significance
tests (\(\ne\) 0) by the
betaDelta
package
library(betaDelta)
## Warning: package 'betaDelta' was built under R version 4.2.3
BetaDelta(out1)
## Call:
## BetaDelta(object = out1)
##
## Standardized regression slopes with MVN standard errors:
## est se t df p 0.05% 0.5% 2.5% 97.5% 99.5%
## homepress 0.4355 0.0898 4.8509 97 0.0000 0.1308 0.1996 0.2573 0.6137 0.6714
## workpress 0.2396 0.0944 2.5388 97 0.0127 -0.0807 -0.0084 0.0523 0.4268 0.4875
## 99.95%
## homepress 0.7402
## workpress 0.5598
Find the variance partitioning by the anova
function. Be
careful that the resulting sum of squares are in Type 1 (controlling
variables in previous lines).
anova(out1)
## Analysis of Variance Table
##
## Response: stress
## Df Sum Sq Mean Sq F value Pr(>F)
## homepress 1 4106.5 4106.5 49.1369 3.187e-10 ***
## workpress 1 506.0 506.0 6.0547 0.01564 *
## Residuals 97 8106.6 83.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Read the data set
ex2 <- read.table("lecture2ex2.csv", sep=",", header=TRUE)
Find the descriptive statistics of the variables in the data set
describe(ex2)
## vars n mean sd median trimmed mad min max range skew kurtosis
## face 1 50 51.64 15.26 51.50 51.02 15.57 21.00 89.00 68.00 0.27 -0.51
## shape 2 50 50.96 16.75 49.00 50.75 15.57 13.00 99.00 86.00 0.25 0.09
## habit 3 50 48.14 15.27 47.00 47.62 14.08 16.00 90.00 74.00 0.31 0.06
## close 4 50 50.40 16.40 49.50 50.15 17.79 19.00 88.00 69.00 0.14 -0.67
## cute 5 50 52.85 18.50 54.54 53.57 19.76 6.34 88.84 82.51 -0.37 -0.33
## se
## face 2.16
## shape 2.37
## habit 2.16
## close 2.32
## cute 2.62
Test whether each correlation in a correlation matrix is significantly different from 0.
corr.test(ex2)
## Call:corr.test(x = ex2)
## Correlation matrix
## face shape habit close cute
## face 1.00 0.45 -0.02 0.33 0.65
## shape 0.45 1.00 -0.19 0.13 0.31
## habit -0.02 -0.19 1.00 0.54 0.12
## close 0.33 0.13 0.54 1.00 0.23
## cute 0.65 0.31 0.12 0.23 1.00
## Sample Size
## [1] 50
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## face shape habit close cute
## face 0.00 0.01 1.00 0.13 0.00
## shape 0.00 0.00 0.73 1.00 0.18
## habit 0.90 0.18 0.00 0.00 1.00
## close 0.02 0.37 0.00 0.00 0.54
## cute 0.00 0.03 0.42 0.11 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
Run multiple regression, save the output, and run the summary function.
out2 <- lm(cute ~ face + shape + habit + close, data=ex2)
summary(out2)
##
## Call:
## lm(formula = cute ~ face + shape + habit + close, data = ex2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.372 -11.143 1.349 8.074 30.234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.36760 10.90784 0.217 0.829
## face 0.80841 0.15839 5.104 6.5e-06 ***
## shape 0.06468 0.13999 0.462 0.646
## habit 0.23651 0.16807 1.407 0.166
## close -0.11788 0.16246 -0.726 0.472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.27 on 45 degrees of freedom
## Multiple R-squared: 0.4534, Adjusted R-squared: 0.4048
## F-statistic: 9.332 on 4 and 45 DF, p-value: 1.402e-05
Find the standardized regression coefficients and their significance
tests (\(\ne\) 0) by the
betaDelta
package
BetaDelta(out2)
## Call:
## BetaDelta(object = out2)
##
## Standardized regression slopes with MVN standard errors:
## est se t df p 0.05% 0.5% 2.5% 97.5% 99.5%
## face 0.6672 0.1036 6.4408 45 0.0000 0.3025 0.3886 0.4586 0.8758 0.9458
## shape 0.0586 0.1202 0.4872 45 0.6285 -0.3646 -0.2647 -0.1835 0.3007 0.3819
## habit 0.1952 0.1313 1.4868 45 0.1440 -0.2670 -0.1579 -0.0692 0.4597 0.5484
## close -0.1045 0.1365 -0.7655 45 0.4480 -0.5851 -0.4717 -0.3795 0.1705 0.2627
## 99.95%
## face 1.0319
## shape 0.4817
## habit 0.6575
## close 0.3761
Find the variance partitioning by the anova
function. Be
careful that the resulting sum of squares are in Type 1 (controlling
variables in previous lines).
anova(out2)
## Analysis of Variance Table
##
## Response: cute
## Df Sum Sq Mean Sq F value Pr(>F)
## face 1 7190.4 7190.4 35.3173 3.796e-07 ***
## shape 1 3.6 3.6 0.0177 0.8947
## habit 1 298.8 298.8 1.4678 0.2320
## close 1 107.2 107.2 0.5264 0.4719
## Residuals 45 9161.8 203.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Centering predictors as follows:
face
is centered at 50shape
is centered at its meanhabit
is centered at 50close
is centered at 50I()
is used to run a mathematical operation on
predictors before putting them in multiple regression.
out2a <- lm(cute ~ I(face - 50) + I(shape - mean(shape)) + I(habit - 50) + I(close - 50), data=ex2)
summary(out2a)
##
## Call:
## lm(formula = cute ~ I(face - 50) + I(shape - mean(shape)) + I(habit -
## 50) + I(close - 50), data = ex2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.372 -11.143 1.349 8.074 30.234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.01609 2.05732 25.283 < 2e-16 ***
## I(face - 50) 0.80841 0.15839 5.104 6.5e-06 ***
## I(shape - mean(shape)) 0.06468 0.13999 0.462 0.646
## I(habit - 50) 0.23651 0.16807 1.407 0.166
## I(close - 50) -0.11788 0.16246 -0.726 0.472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.27 on 45 degrees of freedom
## Multiple R-squared: 0.4534, Adjusted R-squared: 0.4048
## F-statistic: 9.332 on 4 and 45 DF, p-value: 1.402e-05
The intercepts are changed but the slopes are not changes. The standardized coefficients and the anova table would not be changed too.
BetaDelta(out2a)
## Call:
## BetaDelta(object = out2a)
##
## Standardized regression slopes with MVN standard errors:
## est se t df p 0.05% 0.5% 2.5%
## I(face - 50) 0.6672 0.1036 6.4408 45 0.0000 0.3025 0.3886 0.4586
## I(shape - mean(shape)) 0.0586 0.1202 0.4872 45 0.6285 -0.3646 -0.2647 -0.1835
## I(habit - 50) 0.1952 0.1313 1.4868 45 0.1440 -0.2670 -0.1579 -0.0692
## I(close - 50) -0.1045 0.1365 -0.7655 45 0.4480 -0.5851 -0.4717 -0.3795
## 97.5% 99.5% 99.95%
## I(face - 50) 0.8758 0.9458 1.0319
## I(shape - mean(shape)) 0.3007 0.3819 0.4817
## I(habit - 50) 0.4597 0.5484 0.6575
## I(close - 50) 0.1705 0.2627 0.3761
anova(out2a)
## Analysis of Variance Table
##
## Response: cute
## Df Sum Sq Mean Sq F value Pr(>F)
## I(face - 50) 1 7190.4 7190.4 35.3173 3.796e-07 ***
## I(shape - mean(shape)) 1 3.6 3.6 0.0177 0.8947
## I(habit - 50) 1 298.8 298.8 1.4678 0.2320
## I(close - 50) 1 107.2 107.2 0.5264 0.4719
## Residuals 45 9161.8 203.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Read and view the data set.
ex3 <- read.table("lecture2ex3.csv", sep=",", header=TRUE)
head(ex3, 15)
## id chilimg region sex age
## 1 1 4.08 1 1 23
## 2 2 11.15 3 2 43
## 3 3 7.75 2 2 36
## 4 4 0.00 1 1 39
## 5 5 20.09 4 2 24
## 6 6 6.78 2 2 30
## 7 7 2.36 2 1 59
## 8 8 6.18 2 1 45
## 9 9 1.81 2 2 43
## 10 10 0.00 1 1 52
## 11 11 7.47 3 1 21
## 12 12 0.05 2 2 56
## 13 13 1.24 2 1 18
## 14 14 0.00 2 2 50
## 15 15 6.89 1 2 20
Change sex to a factor variable where the first group is the
reference group, which is female
here. When a variable is
in the factor format, R will know that each number represents different
categories (not viewed as low or high as in a standard variable).
ex3[,"sex"] <- factor(ex3[,"sex"], labels=c("female", "male"))
Change region to a factor variable where the first group is the
reference group, which is north
here.
ex3[,"region"] <- factor(ex3[,"region"], labels=c("north", "center", "northeast", "south"))
Change the reference group to south
by the
relevel
function and view the data set again.
ex3[,"region"] <- relevel(ex3[,"region"], ref="south")
head(ex3, 15)
## id chilimg region sex age
## 1 1 4.08 north female 23
## 2 2 11.15 northeast male 43
## 3 3 7.75 center male 36
## 4 4 0.00 north female 39
## 5 5 20.09 south male 24
## 6 6 6.78 center male 30
## 7 7 2.36 center female 59
## 8 8 6.18 center female 45
## 9 9 1.81 center male 43
## 10 10 0.00 north female 52
## 11 11 7.47 northeast female 21
## 12 12 0.05 center male 56
## 13 13 1.24 center female 18
## 14 14 0.00 center male 50
## 15 15 6.89 north male 20
Find the descriptive statistics of the variables in the data set
describe(ex2)
## vars n mean sd median trimmed mad min max range skew kurtosis
## face 1 50 51.64 15.26 51.50 51.02 15.57 21.00 89.00 68.00 0.27 -0.51
## shape 2 50 50.96 16.75 49.00 50.75 15.57 13.00 99.00 86.00 0.25 0.09
## habit 3 50 48.14 15.27 47.00 47.62 14.08 16.00 90.00 74.00 0.31 0.06
## close 4 50 50.40 16.40 49.50 50.15 17.79 19.00 88.00 69.00 0.14 -0.67
## cute 5 50 52.85 18.50 54.54 53.57 19.76 6.34 88.84 82.51 -0.37 -0.33
## se
## face 2.16
## shape 2.37
## habit 2.16
## close 2.32
## cute 2.62
Find the frequency tables of both categorical variables.
with(ex3, table(sex))
## sex
## female male
## 200 200
with(ex3, table(region))
## region
## south north center northeast
## 100 100 100 100
Test the correlation excluding categorical variables.
corr.test(ex3[,c("chilimg", "age")])
## Call:corr.test(x = ex3[, c("chilimg", "age")])
## Correlation matrix
## chilimg age
## chilimg 1.00 -0.09
## age -0.09 1.00
## Sample Size
## [1] 400
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## chilimg age
## chilimg 0.00 0.08
## age 0.08 0.00
##
## To see confidence intervals of the correlations, print with the short=FALSE option
Run multiple regression where age
is centered at 40.
sex
and region
will be transformed to dummy
variables automatically because both variables are set as factors
previously.
out3 <- lm(chilimg ~ I(age - 40) + sex + region, data=ex3)
summary(out3)
##
## Call:
## lm(formula = chilimg ~ I(age - 40) + sex + region, data = ex3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2202 -2.5846 -0.0098 2.0427 10.2763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.67700 0.36355 37.621 <2e-16 ***
## I(age - 40) -0.03047 0.01239 -2.459 0.0144 *
## sexmale 0.49914 0.32511 1.535 0.1255
## regionnorth -10.89126 0.45981 -23.686 <2e-16 ***
## regioncenter -9.89767 0.46027 -21.504 <2e-16 ***
## regionnortheast -6.13874 0.45978 -13.351 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 394 degrees of freedom
## Multiple R-squared: 0.6402, Adjusted R-squared: 0.6356
## F-statistic: 140.2 on 5 and 394 DF, p-value: < 2.2e-16
Find the variance partitioning by the anova
function. Be
careful that the resulting sum of squares are in Type 1 (controlling
variables in previous lines).
anova(out3)
## Analysis of Variance Table
##
## Response: chilimg
## Df Sum Sq Mean Sq F value Pr(>F)
## I(age - 40) 1 86.2 86.22 8.1574 0.004516 **
## sex 1 25.0 25.01 2.3661 0.124802
## region 3 7297.6 2432.52 230.1526 < 2.2e-16 ***
## Residuals 394 4164.2 10.57
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To test each set of predictors, multiple regression results with and
without the set of predictors can be compared by the anova
function. First, check whether region
is significant.
out3noregion <- lm(chilimg ~ I(age - 40) + sex, data=ex3)
summary(out3noregion)
##
## Call:
## lm(formula = chilimg ~ I(age - 40) + sex, data = ex3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2233 -4.3438 -0.8815 3.7121 16.9294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.94329 0.38001 18.271 <2e-16 ***
## I(age - 40) -0.03545 0.02044 -1.735 0.0836 .
## sexmale 0.50009 0.53733 0.931 0.3526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.373 on 397 degrees of freedom
## Multiple R-squared: 0.009611, Adjusted R-squared: 0.004621
## F-statistic: 1.926 on 2 and 397 DF, p-value: 0.1471
anova(out3noregion, out3)
## Analysis of Variance Table
##
## Model 1: chilimg ~ I(age - 40) + sex
## Model 2: chilimg ~ I(age - 40) + sex + region
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 397 11461.8
## 2 394 4164.2 3 7297.6 230.15 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova function is necessary to test region
because
region
consists of many dummy variables. sex
and age
can be compared by anova
too but the
p value is the same as t-test in the summary
function of the multiple regression with all predictors.
out3nosex <- lm(chilimg ~ I(age - 40) + region, data=ex3)
anova(out3nosex, out3)
## Analysis of Variance Table
##
## Model 1: chilimg ~ I(age - 40) + region
## Model 2: chilimg ~ I(age - 40) + sex + region
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 395 4189.2
## 2 394 4164.2 1 24.913 2.3571 0.1255
out3noage <- lm(chilimg ~ sex + region, data=ex3)
anova(out3noage, out3)
## Analysis of Variance Table
##
## Model 1: chilimg ~ sex + region
## Model 2: chilimg ~ I(age - 40) + sex + region
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 395 4228.2
## 2 394 4164.2 1 63.907 6.0465 0.01436 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Even though it is possible to find standardized regression coefficients of dummy variables. To compare whether age, sex, or region has the most predictive power on the dependent variable, standardized regression coefficients cannot be used directly because region has three dummy variables whereas other variables have only one predictor.
BetaDelta(out3)
## Call:
## BetaDelta(object = out3)
##
## Standardized regression slopes with MVN standard errors:
## est se t df p 0.05% 0.5% 2.5%
## I(age - 40) -0.0745 0.0301 -2.4725 394 0.0138 -0.1744 -0.1525 -0.1337
## sexmale 0.0464 0.0300 1.5457 394 0.1230 -0.0531 -0.0313 -0.0126
## regionnorth -0.8768 0.0389 -22.5656 394 0.0000 -1.0056 -0.9773 -0.9532
## regioncenter -0.7968 0.0401 -19.8661 394 0.0000 -0.9298 -0.9006 -0.8756
## regionnortheast -0.4942 0.0390 -12.6839 394 0.0000 -0.6234 -0.5950 -0.5708
## 97.5% 99.5% 99.95%
## I(age - 40) -0.0153 0.0035 0.0254
## sexmale 0.1054 0.1241 0.1459
## regionnorth -0.8004 -0.7762 -0.7480
## regioncenter -0.7179 -0.6930 -0.6638
## regionnortheast -0.4176 -0.3933 -0.3650
To compare all predictors, \(\Delta
R^2\) provides how much each variable explain unique variances of
the dependent variable. summary(lmoutput)$r.squared
is used
to extract \(R^2\). That is, the
summary method provides an output object from the lm
function, which its type is list. One element of the output list is
called r.squared
. You can extract by using $ to get the
element.
summary(out3)$r.squared - summary(out3noage)$r.squared
## [1] 0.005522048
summary(out3)$r.squared - summary(out3nosex)$r.squared
## [1] 0.002152659
summary(out3)$r.squared - summary(out3noregion)$r.squared
## [1] 0.6305658