时间序列模型

2023-11-12 11:17| 来源: 网络整理| 查看: 265

ARIMA模型和因子预测

文章目录 ARIMA模型和因子预测一、ARIMA模型（整个周期）1.数据预处理2.展示时序图2.数据建模（1）差分d（2）p和q（3）选择模型（4）检验残差序列（5）观察是否呈正态分布（6）残差序列Ljung-Box检验，也叫Q检验预测（7）预测（8）结论二、ARIMA模型（增长时期）1.分析2.数学建模（1）展示时序图（2）一阶差分d=1（3）p和q（4）选择模型（5）检验残差序列（6）是否符合正态分布（7）残差序列Ljung-Box检验，也叫Q检验预测（8）预测这次接触到的算法为时序模型。数据比较复杂，且没有明显的规律。时序图在前期比较规律，成周期性上升，之后出现了断点，之后的数据呈不规律的现象。由于刚开始对时序模型了解不够，因此刚上手时，就直接对整个时期的数据进行ARIMA处理，而没有认真分析数据的周期性和基本走向。在得到指点之后，对前期比较规律的数据进行处理和预测。但是采用该模型得到的预测结果过于平缓，与实际的数据有较大的差距，因此，在前期的基础上采用了时序模型的因子预测（baseline）。具体实现过程如下所示：

一、ARIMA模型（整个周期） 1.数据预处理

前期对于数据的预处理过程不再赘述，处理之后的数据类型如图所示：在这里插入图片描述

2.展示时序图 from __future__ import print_function import pandas as pd import numpy as np from scipy import stats import matplotlib.pyplot as plt import statsmodels.api as sm import time from datetime import datetime from statsmodels.graphics.api import qqplot dta=pd.Series(dta) #dta.index = pd.Index(sm.tsa.datetools.dates_from_range('2021-09-27','2022-05-04')) #应该是2090，不是2100 dta.index=pd.date_range('2021-10-22','2022-04-07') dta.plot(figsize=(12,8)) #训练集为2021-10-22到2022-04-07的数据

运行处理的结果如下图所示：整个时期的时序图为：在这里插入图片描述

2.数据建模（1）差分d

ARIMA 模型对时间序列的要求是平稳型。因此，当你得到一个非平稳的时间序列时，首先要做的即是做时间序列的差分，直到得到一个平稳时间序列。如果你对时间序列做d次差分才能得到一个平稳序列，那么可以使用ARIMA(p,d,q)模型，其中d是差分次数。

fig = plt.figure(figsize=(12,8)) ax1= fig.add_subplot(111) diff1 = dta.diff(1) diff1.plot(ax=ax1) #一阶差分d=1

差分后为：在这里插入图片描述

现在我们已经得到一个基本平稳的时间序列，二阶差分相差效果也不大。接来下就是选择合适的ARIMA模型，即ARIMA模型中合适的p,q

（2）p和q

检查自相关图(ACF)和偏自相关图(PACF)，如下所示：

diff1= dta.diff(1)#原文有错误应该是diff1= dta.diff(1)，而非dta= dta.diff(1) fig = plt.figure(figsize=(12,8)) ax1=fig.add_subplot(211) fig = sm.graphics.tsa.plot_acf(dta,lags=40,ax=ax1) ax2 = fig.add_subplot(212) fig = sm.graphics.tsa.plot_pacf(dta,lags=40,ax=ax2)

其中lags 表示滞后的阶数，以上分别得到自相关图和偏自相关图在这里插入图片描述

（3）选择模型

根据上图，猜测有以下模型可以供选择： ARMA(0,1)模型：即自相关图在滞后1阶之后缩小为0，且偏自相关缩小至0，则是一个阶数q=1的移动平均模型； ARMA(7,0)模型：即偏自相关图在滞后7阶之后缩小为0，且自相关缩小至0，则是一个阶层p=7的自回归模型； //原文错写为3 ARMA(7,1)模型：即使得自相关和偏自相关都缩小至零。则是一个混合模型。有其他供选择的模型（实际上下文选了ARMA(8,0)） ARMA(8,0)的得出的结果最佳，因此选了ARMA(8,0)

现在有以上这么多可供选择的模型，我们通常采用ARMA模型的AIC法则。我们知道：增加自由参数的数目提高了拟合的优良性，AIC鼓励数据拟合的优良性但是尽量避免出现过度拟合(Overfitting)的情况。所以优先考虑的模型应是AIC值最小的那一个。赤池信息准则的方法是寻找可以最好地解释数据但包含最少自由参数的模型。不仅仅包括AIC准则，目前选择模型常用如下准则：

AIC=-2 ln(L) + 2 k 中文名字：赤池信息量 akaike information criterionBIC=-2 ln(L) + ln(n)*k 中文名字：贝叶斯信息量 bayesian information criterionHQ=-2 ln(L) + ln(ln(n))*k hannan-quinn criterion

构造这些统计量所遵循的统计思想是一致的，就是在考虑拟合残差的同时，依自变量个数施加“惩罚”。但要注意的是，这些准则不能说明某一个模型的精确度，也即是说，对于三个模型Ａ，Ｂ，Ｃ，我们能够判断出Ｃ模型是最好的，但不能保证Ｃ模型能够很好地刻画数据，因为有可能三个模型都是糟糕的。

arma_mod70 = sm.tsa.ARMA(dta,(7,0)).fit() print(arma_mod70.aic,arma_mod70.bic,arma_mod70.hqic) arma_mod30 = sm.tsa.ARMA(dta,(0,1)).fit() print(arma_mod30.aic,arma_mod30.bic,arma_mod30.hqic) arma_mod71 = sm.tsa.ARMA(dta,(7,1)).fit() print(arma_mod71.aic,arma_mod71.bic,arma_mod71.hqic) arma_mod80 = sm.tsa.ARMA(dta,(8,0)).fit() print(arma_mod80.aic,arma_mod80.bic,arma_mod80.hqic) arma_mod74 = sm.tsa.ARMA(dta,(7,4)).fit() print(arma_mod74.aic,arma_mod74.bic,arma_mod74.hqic)

运行结果如下所示：

2129.656666131619 2157.7723419462486 2141.0673765077413 2291.007243099765 2300.379135037975 2294.810813225139 2131.538695375416 2162.7783351694484 2144.2172624599957 2121.897731126717 2153.1373709207496 2134.576298211297 2125.4594548268437 2166.070986559086 2141.9415920367974

这样的话应该是ARMA(8，0)模型拟合效果最好。

（4）检验残差序列 resid = arma_mod80.resid fig = plt.figure(figsize=(12,8)) ax1 = fig.add_subplot(211) fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1) ax2 = fig.add_subplot(212) fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2) plt.show()

残差的ACF和PACF图，可以看到序列残差基本为白噪声在这里插入图片描述进一步进行D-W检验，德宾-沃森（Durbin-Watson）检验。德宾-沃森检验,简称D-W检验，是目前检验自相关性最常用的方法，但它只使用于检验一阶自相关性。因为自相关系数ρ的值介于-1和1之间，所以 0≤DW≤４。并且 DW＝O＝＞ρ＝１　　即存在正自相关性 DW＝４＜＝＞ρ＝－１　即存在负自相关性 DW＝２＜＝＞ρ＝０　　即不存在（一阶）自相关性因此，当DW值显著的接近于O或４时，则存在自相关性，而接近于２时，则不存在（一阶）自相关性。这样只要知道ＤＷ统计量的概率分布，在给定的显著水平下，根据临界值的位置就可以对原假设Ｈ0进行检验。

print(sm.stats.durbin_watson(arma_mod80.resid.values))

结果为2.0133314048143824，所以残差序列不存在自相关性。

（5）观察是否呈正态分布

这里使用QQ图，它用于直观验证一组数据是否来自某个分布，或者验证某两组数据是否来自同一（族）分布。

print(stats.normaltest(resid)) fig = plt.figure(figsize=(12,8)) ax = fig.add_subplot(111) fig = qqplot(resid, line='q', ax=ax, fit=True) #plt.show()

在这里插入图片描述结果表明基本符合正态分布

（6）残差序列Ljung-Box检验，也叫Q检验预测 r,q,p = sm.tsa.acf(resid.values.squeeze(), qstat=True) data = np.c_[range(1,41), r[1:], q, p] table = pd.DataFrame(data, columns=['lag', "AC", "Q", "Prob(>Q)"]) print(table.set_index('lag'))

结果为：

```python AC Q Prob(>Q) lag 1.0 -0.030173 0.155691 0.693155 2.0 0.008053 0.166850 0.919960 3.0 0.048691 0.577219 0.901627 4.0 -0.001428 0.577574 0.965523 5.0 -0.019167 0.641946 0.986002 6.0 0.037645 0.891782 0.989384 7.0 -0.016444 0.939748 0.995743 8.0 0.079051 2.055212 0.979272 9.0 -0.061182 2.727584 0.974124 10.0 -0.066925 3.537199 0.965820 11.0 -0.011707 3.562132 0.981024 12.0 -0.051549 4.048626 0.982543 13.0 -0.040926 4.357254 0.986764 14.0 0.101101 6.252850 0.959710 15.0 0.025625 6.375421 0.972728 16.0 -0.029551 6.539499 0.981150 17.0 -0.081411 7.793074 0.970773 18.0 0.064239 8.578780 0.968717 19.0 0.045146 8.969444 0.973983 20.0 -0.072731 9.990243 0.968349 21.0 0.097076 11.821146 0.944321 22.0 0.051323 12.336409 0.950037 23.0 0.010156 12.356723 0.964592 24.0 -0.001530 12.357188 0.975577 25.0 0.034176 12.590458 0.981137 26.0 -0.017893 12.654848 0.986836 27.0 0.006812 12.664247 0.991227 28.0 -0.078873 13.933306 0.987655 29.0 0.106693 16.272206 0.972315 30.0 -0.161050 21.640081 0.866813 31.0 -0.133496 25.355192 0.751608 32.0 0.000244 25.355204 0.791393 33.0 -0.038534 25.669338 0.814774 34.0 0.011979 25.699921 0.846091 35.0 0.091071 27.480951 0.813603 36.0 -0.021248 27.578633 0.841875 37.0 0.052502 28.179588 0.851110 38.0 0.013256 28.218192 0.876500 39.0 -0.107110 30.758160 0.824196 40.0 0.053220 31.390132 0.832942

prob值均大于0.05，所以残差序列不存在自相关性

（7）预测 predict_dta = arma_mod80.predict('2022-04-01', '2022-05-05', dynamic=True) print(predict_dta) fig, ax = plt.subplots(figsize=(12, 8)) ax = dta.ix['2021':].plot(ax=ax) fig = arma_mod80.plot_predict('2022-04-01', '2022-05-05', dynamic=True, ax=ax, plot_insample=False) plt.show()

结果如下：

2022-04-01 652.749971 2022-04-02 684.499544 2022-04-03 665.002454 2022-04-04 625.281021 2022-04-05 658.218001 2022-04-06 679.779735 2022-04-07 603.998849 2022-04-08 568.153215 2022-04-09 572.121355 2022-04-10 547.806658 2022-04-11 527.050779 2022-04-12 538.175429 2022-04-13 531.915602 2022-04-14 498.532942 2022-04-15 485.598870 2022-04-16 481.851214 2022-04-17 466.009817 2022-04-18 456.888313 2022-04-19 458.740678 2022-04-20 450.692260 2022-04-21 436.480672 2022-04-22 431.249928 2022-04-23 426.935136 2022-04-24 418.493018 2022-04-25 414.512014 2022-04-26 413.593327 2022-04-27 408.285758 2022-04-28 402.317694 2022-04-29 399.823316 2022-04-30 396.757807 2022-05-01 392.651952 2022-05-02 390.792554 2022-05-03 389.534999 2022-05-04 386.620759 2022-05-05 384.085042

与实际数据相比的结果为：在这里插入图片描述

（8）结论

由于预测值和实际值差距过大，且整体数据的周期性不明显，因此后续选用了前期趋势且周期明显的数据再对该模型进行验证和分析。

二、ARIMA模型（增长时期）

增长时期的分析和整体时期的相似，过程不再赘述。

1.分析

该组数据的整体的时序图为：在这里插入图片描述故选择红线以前的增长趋势且周期明显的数据进行分析

2.数学建模（1）展示时序图 dta=pd.Series(dta) dta.index=pd.date_range('2021-10-22','2022-01-04') dta.plot(figsize=(12,8))

选用了2021-10-22至2022-01-04的数据，时序图如下所示：在这里插入图片描述

（2）一阶差分d=1 fig = plt.figure(figsize=(12,8)) ax1= fig.add_subplot(111) diff1 = dta.diff(1) diff1.plot(ax=ax1)

在这里插入图片描述差分之后基本平稳，接来下就是选择合适的ARIMA模型，即ARIMA模型中合适的p,q。

（3）p和q

检查自相关图(ACF)和偏自相关图(PACF)，如下所示：

diff1= dta.diff(1)#原文有错误应该是diff1= dta.diff(1)，而非dta= dta.diff(1) fig = plt.figure(figsize=(12,8)) ax1=fig.add_subplot(211) fig = sm.graphics.tsa.plot_acf(dta,lags=40,ax=ax1)#自相关性 ax2 = fig.add_subplot(212) fig = sm.graphics.tsa.plot_pacf(dta,lags=40,ax=ax2)#偏自相关性

在这里插入图片描述

（4）选择模型

由于截取前半部分的数据，因此数据量较小，因此整体数据采用的ARIMA（８，０）无法使用，权衡之下，采用了ARIMA(7,0)模型，代码和结果如下：

arma_mod70 = sm.tsa.ARMA(dta,(7,0)).fit() print(arma_mod70.aic,arma_mod70.bic,arma_mod70.hqic) arma_mod30 = sm.tsa.ARMA(dta,(0,1)).fit() print(arma_mod30.aic,arma_mod30.bic,arma_mod30.hqic) 909.4966211297485 930.3540141515753 917.8247491327668 977.9256002839423 984.8780646245513 980.7016429516151

因此模型选择了ARIMA(7,0)模型

（5）检验残差序列 resid = arma_mod70.resid fig = plt.figure(figsize=(12,8)) ax1 = fig.add_subplot(211) fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1) ax2 = fig.add_subplot(212) fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2) plt.show()

在这里插入图片描述 print(sm.stats.durbin_watson(arma_mod70.resid.values)) 运行的结果为：1.7382355364281972，所以残差序列不存在自相关性。

（6）是否符合正态分布 print(stats.normaltest(resid)) fig = plt.figure(figsize=(12,8)) ax = fig.add_subplot(111) fig = qqplot(resid, line='q', ax=ax, fit=True) #plt.show()

在这里插入图片描述符合正态分布。

（7）残差序列Ljung-Box检验，也叫Q检验预测 r,q,p = sm.tsa.acf(resid.values.squeeze(), qstat=True) data = np.c_[range(1,41), r[1:], q, p] table = pd.DataFrame(data, columns=['lag', "AC", "Q", "Prob(>Q)"]) print(table.set_index('lag'))

结果为：

AC Q Prob(>Q) lag 1.0 -0.071408 0.397933 0.528159 2.0 0.052961 0.619821 0.733513 3.0 0.103103 1.472462 0.688641 4.0 0.106855 2.401171 0.662416 5.0 0.062941 2.728004 0.741832 6.0 0.049266 2.931145 0.817438 7.0 0.193439 6.108963 0.527086 8.0 0.093908 6.869078 0.550822 9.0 0.044186 7.039911 0.632964 10.0 0.002296 7.040379 0.721627 11.0 0.000201 7.040383 0.795804 12.0 0.099488 7.947691 0.789205 13.0 0.106332 9.000827 0.772881 14.0 0.086737 9.713081 0.782845 15.0 0.093249 10.550013 0.783792 16.0 -0.071777 11.054293 0.806122 17.0 0.088422 11.832767 0.810156 18.0 0.004166 11.834525 0.855662 19.0 0.013138 11.852326 0.891840 20.0 0.029368 11.942885 0.918028 21.0 0.126790 13.662086 0.883672 22.0 0.080898 14.375193 0.887630 23.0 -0.068815 14.901103 0.898143 24.0 -0.031568 15.013943 0.920350 25.0 0.000519 15.013974 0.941055 26.0 0.048989 15.296817 0.951662 27.0 -0.021997 15.355032 0.964152 28.0 -0.038363 15.535869 0.972206 29.0 0.016043 15.568180 0.980061 30.0 0.037380 15.747493 0.984780 31.0 -0.034871 15.907088 0.988567 32.0 -0.017748 15.949393 0.991995 33.0 -0.062280 16.482731 0.992693 34.0 0.069382 17.160785 0.992786 35.0 0.029572 17.287041 0.994726 36.0 -0.048253 17.631821 0.995662 37.0 -0.087770 18.802551 0.994388 38.0 0.020357 18.867235 0.996015 39.0 -0.041768 19.147097 0.996832 40.0 -0.010745 19.166148 0.997837

prob值均大于0.05，所以残差序列不存在自相关性

（8）预测 predict_dta = arma_mod70.predict('2022-01-05', '2022-02-05', dynamic=True) print(predict_dta) fig, ax = plt.subplots(figsize=(12, 8)) ax = diff1.ix['2021':].plot(ax=ax) fig = arma_mod70.plot_predict('2022-01-05', '2022-02-05', dynamic=True, ax=ax, plot_insample=True) plt.show()

预测结果为：

2022-01-05 557.074462 2022-01-06 691.337169 2022-01-07 830.895302 2022-01-08 860.555486 2022-01-09 813.367771 2022-01-10 730.149683 2022-01-11 631.156654 2022-01-12 616.607622 2022-01-13 691.528138 2022-01-14 774.658239 2022-01-15 810.980619 2022-01-16 790.727217 2022-01-17 727.331466 2022-01-18 663.653218 2022-01-19 650.828923 2022-01-20 690.857934 2022-01-21 744.782255 2022-01-22 775.795757 2022-01-23 766.481570 2022-01-24 724.795391 2022-01-25 682.440919 2022-01-26 670.068608 2022-01-27 691.523000 2022-01-28 726.688352 2022-01-29 750.119375 2022-01-30 746.606986 2022-01-31 720.405650 2022-02-01 691.863006 2022-02-02 680.805836 2022-02-03 692.029022 2022-02-04 714.605396 2022-02-05 731.392701 Freq: D, dtype: float64

预测的结果如图所示：在这里插入图片描述由于实际一月份的数据有很多的缺失数据，因此没有将预测数据和实际数据进行比较。但是和整体数据相比，本次预测的结果好了很多。但是效果还是不明显，为了进一步优化，决定采用时间序列模型的因子预测。如下所示：

参考链接：https://blog.csdn.net/u012856866/article/details/106994799

【本文地址】

公司简介

联系我们