笔记&代码

2024-07-09 11:27| 来源: 网络整理| 查看: 265

在这里插入图片描述

多元线性回归模型及其参数估计

多元线性回归建模的步骤

确定所关注的因变量𝑦和影响因变量的𝑘个自变量假定因变量𝑦与𝑘个自变量之间为线性关系，并建立线性关系模型对模型进行估计和检验判别模型中是否存在多重共线性，如果存在，进行处理利用回归方程进行预测对回归模型进行诊断回归模型与回归方程参数的最小二乘估计 25家餐馆的调查数据，建立多元线性回归模型，并解释各回归系数的含义在这里插入图片描述

#回归模型的拟合 > model1 summary(model1) Call: lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = example10_1) Residuals: Min 1Q Median 3Q Max -16.7204 -6.0600 0.7152 3.2144 21.4805 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.2604768 10.4679833 0.407 0.68856 x1 0.1273254 0.0959790 1.327 0.20037 x2 0.1605660 0.0556834 2.884 0.00952 ** x3 0.0007636 0.0013556 0.563 0.57982 x4 -0.3331990 0.3986248 -0.836 0.41362 x5 -0.5746462 0.3087506 -1.861 0.07826 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.65 on 19 degrees of freedom Multiple R-squared: 0.8518, Adjusted R-squared: 0.8128 F-statistic: 21.84 on 5 and 19 DF, p-value: 2.835e-07

多元线性回归方程： y=4.2604768 +0.1273254x1 +0.1605660x2 +0.0007636x3 -0.3331990x4 -0.5746462*x5

#计算回归系数的置信区间 > confint(model1,level=0.95) 2.5 % 97.5 % (Intercept) -17.649264072 26.170217667 x1 -0.073561002 0.328211809 x2 0.044019355 0.277112598 x3 -0.002073719 0.003600932 x4 -1.167530271 0.501132297 x5 -1.220868586 0.071576251

对于x1来说，其含义是： x2345不变的条件下，周边居民每变动1万人，日平均营业额变动在 -0.073561002~0.328211809之间

#输出方差分析表 > anova(model1) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x1 1 10508.9 10508.9 92.7389 9.625e-09 *** x2 1 1347.1 1347.1 11.8878 0.002696 ** x3 1 85.4 85.4 0.7539 0.396074 x4 1 40.5 40.5 0.3573 0.557082 x5 1 392.5 392.5 3.4641 0.078262 . Residuals 19 2153.0 113.3 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 拟合优度和显著性检验模型的拟合优度多重决定系数R^2=SSR/SST估计标准误Se 模型的显著性检验线性关系检验整体显著性检验回归系数检验每个回归系数分别进行t检验对例题模型的线性关系和回归系数分别进行显著性检验（a=0.05）

线性关系的显著性检验：假设H0=β12345=0 第一个代码块中可得F=21.84，p library(psych) > corr.test(example10_1[3:7],use="complete") Call:corr.test(x = example10_1[3:7], use = "complete") Correlation matrix x1 x2 x3 x4 x5 x1 1.00 0.74 0.88 -0.62 -0.28 x2 0.74 1.00 0.55 -0.54 -0.32 x3 0.88 0.55 1.00 -0.52 -0.29 x4 -0.62 -0.54 -0.52 1.00 0.10 x5 -0.28 -0.32 -0.29 0.10 1.00 Sample Size [1] 25 Probability values (Entries above the diagonal are adjusted for multiple tests.) x1 x2 x3 x4 x5 x1 0.00 0.00 0.00 0.01 0.47 x2 0.00 0.00 0.03 0.03 0.46 x3 0.00 0.00 0.00 0.04 0.47 x4 0.00 0.01 0.01 0.00 0.65 x5 0.18 0.12 0.16 0.65 0.00 To see confidence intervals of the correlations, print with the short=FALSE option

可得只有x5与其它4个自变量之间关系不显著

当模型的线性关系检验(F检验)显著时，几乎所有回归系数的t检验却不显著回归系数的正负号与预期的相反用容忍度与方差扩大因子VIF识别 > library(carData) > library(car) > vif(model1) x1 x2 x3 x4 x5 8.233159 2.629940 5.184365 1.702361 1.174053 > 1/vif(model1) x1 x2 x3 x4 x5 0.1214601 0.3802368 0.1928877 0.5874195 0.8517500

可得容忍度均>0.1，vif均 model2 model2 summary(model2) Call: lm(formula = y ~ x1 + x2 + x5, data = example10_1) Residuals: Min 1Q Median 3Q Max -14.027 -5.361 -1.560 2.304 23.001 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.68928 6.25242 -0.270 0.78966 x1 0.19022 0.04848 3.923 0.00078 *** x2 0.15763 0.05052 3.120 0.00518 ** x5 -0.56979 0.29445 -1.935 0.06656 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.39 on 21 degrees of freedom Multiple R-squared: 0.8439, Adjusted R-squared: 0.8216 F-statistic: 37.85 on 3 and 21 DF, p-value: 1.187e-08

得到估计方程： y=-1.68928 +0.19022x1 +0.15763x2 -0.56979x5

最后进行诊断，判断模型是否满足各种假定，绘制模型诊断图

plot(model2)

在这里插入图片描述看出残差具有某种曲线关系，意味着可能需要在模型中加二次项显示残差的正态性假定存在问题

相对重要性和模型比较自变量的相对重要性

标准化回归方程

例题计算标准化回归系数，分析各自变量对预测日均营业额的相对重要性 > library(lm.beta) > model1.beta summary(model1.beta) Call: lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = example10_1) Residuals: Min 1Q Median 3Q Max -16.7204 -6.0600 0.7152 3.2144 21.4805 Coefficients: Estimate Standardized Std. Error t value Pr(>|t|) (Intercept) 4.2604768 0.0000000 10.4679833 0.407 0.68856 x1 0.1273254 0.3361822 0.0959790 1.327 0.20037 x2 0.1605660 0.4130034 0.0556834 2.884 0.00952 ** x3 0.0007636 0.1132753 0.0013556 0.563 0.57982 x4 -0.3331990 -0.0963203 0.3986248 -0.836 0.41362 x5 -0.5746462 -0.1781104 0.3087506 -1.861 0.07826 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.65 on 19 degrees of freedom Multiple R-squared: 0.8518, Adjusted R-squared: 0.8128 F-statistic: 21.84 on 5 and 19 DF, p-value: 2.835e-07

标准化回归系数是Standardized下的数字，按绝对值比大小，最大的是最重要的变量

模型比较

嵌套模型分为完全模型、简化模型

例题用多元线性回归和逐步回归法建立的两个模型的回归结果进行比较（a=0.05） 5个变量的看作完全模型 3个变量的看作简化模型 #上述两个模型的查看 > model1 Call: lm(formula = y ~ x1 + x2 + x3 + x4 + x5, data = example10_1) Coefficients: (Intercept) x1 x2 x3 x4 4.2604768 0.1273254 0.1605660 0.0007636 -0.3331990 x5 -0.5746462 > model2 Call: lm(formula = y ~ x1 + x2 + x5, data = example10_1) Coefficients: (Intercept) x1 x2 x5 -1.6893 0.1902 0.1576 -0.5698 利用模型比较法，如果没有差异说明逐步回归模型预测效果不差，根据简约原则应选择逐步回归法 > anova(model1,model2) Analysis of Variance Table Model 1: y ~ x1 + x2 + x3 + x4 + x5 Model 2: y ~ x1 + x2 + x5 Res.Df RSS Df Sum of Sq F Pr(>F) 1 19 2153.0 2 21 2267.2 -2 -114.17 0.5038 0.6121

p>0.05不拒绝原假设，没有证据显示两个模型有显著差异使用anova（）函数要求模型必须嵌套

还可以用AIC准则 > AIC(model2,model1) df AIC model2 5 193.6325 model1 7 196.3408

结果显示model2的AIC更小，意味着逐步模型回归比包含全部5个变量好

利用回归方程进行预测例题用逐步回归模型得到的回归方程求日均营业额95%的置信区间和预测区间 > x pre res zre con_int pre_int mysummary round(mysummary,3)

在这里插入图片描述

> x0 predict(model2,newdata=x0) 1 17.88685 逐步回归模型的回归方程求x1=50,x2=100,x5=100时日均营业额的点预测值、置信区间、预测区间 > predict(model2,data.frame(x1=50,x2=100,x5=10),interval="confidence",level=0.95) fit lwr upr 1 17.88685 10.98784 24.78585 > predict(model2,data.frame(x1=50,x2=100,x5=10),interval="prediction",level=0.95) fit lwr upr 1 17.88685 -4.795935 40.56963 哑变量回归

需要将文字用代码表示的类别自变量

在模型中引入哑变量含有一个哑变量的回归沿用例题。假定在分析影响日均营业额的因素中，再考虑“交通方便程度”变量，并设其取值为“方便”和“不方便”。为便于理解，原来的5个自变量我们只保留用餐平均支出一个数值自变量。假定调查得到的数据表在这里插入图片描述

建立两个模型日均营业额与用餐平均支出的一元回归模型 > model_s summary(model_s) Call: lm(formula = 日均营业额 ~ 用餐平均支出, data = example10_7) Residuals: Min 1Q Median 3Q Max -19.7604 -10.7832 0.7195 4.3343 28.9301 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.75023 5.25068 -1.095 0.285 用餐平均支出 0.32394 0.04482 7.227 2.34e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.9 on 23 degrees of freedom Multiple R-squared: 0.6943, Adjusted R-squared: 0.681 F-statistic: 52.23 on 1 and 23 DF, p-value: 2.343e-07

p model_dummy summary(model_dummy) Call: lm(formula = 日均营业额 ~ 用餐平均支出 + 交通方便程度, data = example10_7) Residuals: Min 1Q Median 3Q Max -19.443 -11.579 -1.256 8.607 23.456 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -8.45413 4.69817 -1.799 0.08568 . 用餐平均支出 0.28641 0.04145 6.909 6.15e-07 *** 交通方便程度方便 14.62088 5.17802 2.824 0.00989 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.17 on 22 degrees of freedom Multiple R-squared: 0.7756, Adjusted R-squared: 0.7552 F-statistic: 38.02 on 2 and 22 DF, p-value: 7.269e-08

【本文地址】

公司简介

联系我们