lstm对时间数据的预测作用（多变量对多变量预测）

2023-08-12 16:11| 来源: 网络整理| 查看: 265

思考：

如何处理原始数据，空值？无意义列？数据归一化？异常值？时序数据预测问题的数据格式如何构造（X,Y）；如何使用LSTM来预测时序数据（输入输出？多加几成lstm注意什么？return_sequences=True？）；如何用模型预测？生成结果注意什么？反归一化再进行比较。归（反归）一化注意什么？模型参数如何调整？例如：学习率如何调整？ 1.使用lstm进行空气污染预测介绍采用空气质量数据集。数据来源自位于北京的美国大使馆在2010年至2014年共5年间每小时采集的天气及空气污染指数。目的：利用前一个或几个小时的天气条件和污染数据预测下一个（当前）时刻/(或者后几个时刻的数据)的污染程度（预测PM2.5浓度）(本文示例用前五个时刻的数据预测后五个时刻的数据)原始数据如下：1.No 行数 2.year 年 3.month 月 4.day 日 5.hour 小时 6.pm2.5 PM2.5浓度 7.DEWP 露点 8.TEMP 温度 9.PRES 大气压 10.cbwd 风向 11.lws 风速 12.ls 累积雪量 13.lr 累积雨量在这里插入图片描述

2.数据预处理

1. 数据清洗

将日期（年月日时）合并为一个日期，并作为表的索引删除没有意义的NO列为列名重起好理解的名字空值处理（NA），充填0观察到2010年1月1日pm2.5为NA，将该天数据删除形成一个新表

导入包

#导入用到的包 import pandas as pd from datetime import datetime from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder, MinMaxScaler from sklearn.metrics import mean_squared_error from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM, Dropout from numpy import concatenate from math import sqrt from keras.callbacks import ReduceLROnPlateau

数据清洗（详细注释）：

def parse(x): return datetime.strptime(x, '%Y %m %d %H') def read_raw(): dataset = pd.read_csv('raw.csv', parse_dates=[['year', 'month', 'day', 'hour']], index_col=0, date_parser=parse) #将日期合并 dataset.drop('No', axis=1, inplace=True) #删除no列（无意义） # manually specify column names dataset.columns = ['pollution', 'dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain'] #给每列换个好理解的名 dataset.index.name = 'date' #行的索引名为name，解析得到的日期格式列会作为DataFrame的第一列。则此时会以新生成的time_date列而不是name作为Index。因此保险的方法是指定列名，如index_col = 'name' # mark all NA values with 0 dataset['pollution'].fillna(0, inplace=True) # drop the first 24 hours dataset = dataset[24:] #删除第一天的数据，取列表24行之后的数据 # summarize first 5 rows print(dataset.head(5)) # save to file dataset.to_csv('pollution.csv')

形成的数据格式：在这里插入图片描述

2.对每列数据进行绘图观测（5年数据），除了日期。绘图代码：

def drow_pollution(): dataset = pd.read_csv('pollution.csv', header=0, index_col=0) values = dataset.values #取文件中的值 # specify columns to plot groups = [0, 1, 2, 3, 5, 6, 7] i = 1 # plot each column pyplot.figure(figsize=(10, 10)) for group in groups: pyplot.subplot(len(groups), 1, i) pyplot.plot(values[:, group]) pyplot.title(dataset.columns[group], y=0.5, loc='right') i += 1 pyplot.show()

结果：在这里插入图片描述

3.LSTM的时序数据划分使用series_to_supervised(data, n_in=5, n_out=5, dropnan=True) :确定模型的输入输出（n_in=5表示输入5个时刻的数据,n_out=5表示预测后五个时刻的数据。如果n_in=1, n_out=1表示用前一刻数据预测后一刻数据）首先将数据划分为：代码： def series_to_supervised(data, n_in=5, n_out=5, dropnan=True): # convert series to supervised learning n_vars = 1 if type(data) is list else data.shape[1] df = pd.DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j + 1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j + 1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j + 1, i)) for j in range(n_vars)] # put it all together agg = pd.concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) # normalize features return agg #shape:(43791, 64)

提取需要的输入：

def cs_to_sl(): # load dataset dataset = pd.read_csv('pollution.csv', header=0, index_col=0) values = dataset.values # integer encode direction encoder = LabelEncoder() #先构造encoder,通过fit函数传入需要编码的数据，在内部生成对应的key-value,然后encoder 用于需要转化的数据，用transform函数 values[:, 4] = encoder.fit_transform(values[:, 4]) #先将object对象转化为硬编码 # ensure all data is float values = values.astype('float32') # frame as supervised learning reframed = series_to_supervised(values, 5, 5) # drop columns we don't want to predict 删掉我们不想要的输出，留下40，48，56，64，72（5天的数据） reframed.drop(reframed.columns[[41,42,43,44,45,46,47,49,50,51,52,53,54,55,57,58,59,60,61,62,63,65,66,67,68,69,70,71,73, 74, 75, 76, 77, 78, 79]], axis=1, inplace=True) print(reframed.head()) return reframed #reframed.shape:(43791, 45)

生成结果：在这里插入图片描述

下列对上面代码做个简单的介绍：例如用前个三个时刻的数据，预测后三个时刻的数据，使用shift操作生成对应的表，然后将它们拼接，删除空值的行，再后面对于不需要预测的列，进行删除，输出需要的信息。在这里插入图片描述

4.模型的训练集和测试集的划分：首先对数据进行归一化处理，返回归一化的函数，后面好使用反归一化。归一化不关注数据的大小长度信息，关注数据与数据之间的联系，相对属性三年前的数据为训练集，三年后的为测试集调整模型数据的输入维度（lstm需要输入三维数据） def train_test(reframed): # split into train and test sets values = reframed.values scaler11 = MinMaxScaler(feature_range=(0, 1)) values = scaler11.fit_transform(values) n_train_hours = 365 * 24*3 # train = values[:n_train_hours, :] test = values[n_train_hours:, :] # split into input and outputs train_X, train_y = train[:, :-5],train[:, -5:] test_X, test_y = test[:, :-5], test[:, -5:] train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1])) #train_X开始2维，（26280,40）变成三维（26280, 1, 40）train_y:(26280,5） test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1])) print(train_X.shape, train_y.shape, test_X.shape, test_y.shape) return train_X, train_y, test_X, test_y, scaler11

结果为：在这里插入图片描述

5.模型构造使用三层的lstm，隐藏层有50个神经元，输出层5个神经元，输入变量是5个时间步的特征，损失函数采用Mean Absolute Error(MAE)，优化算法采用Adam，模型采用100个epochs，每个batch的大小为72设置validation_data（验证集）参数，记录训练集和测试集的损失，并在完成训练和测试后绘制损失图。使用keras中的回调函数ReduceLROnPlateau（）在训练过程中缩小学习率。注意：需要将预测结果和部分测试集数据组合然后进行比例反转（invert the scaling）。　　为什么进行比例反转，是因为我们将原始数据进行了预处理（使用归一化MinMaxScaler（）操作），此时的误差损失计算是在处理之后的数据上进行的，为了计算在原始比例上的误差需要将数据转化为原始数据形式。反转时的数据维度一定要和原来的大小（shape）完全相同，否则就会报错。　　通过以上处理之后，再结合RMSE（均方根误差）计算损失。 def fit_network(train_X, train_y, test_X, test_y, scaler): model = Sequential() model.add(LSTM(50, return_sequences=True,input_shape=(train_X.shape[1], train_X.shape[2]))) model.add(Dropout(0.3)) model.add(LSTM(50,return_sequences=True)) model.add(Dropout(0.3)) model.add(LSTM(50)) model.add(Dense(5)) model.compile(loss='mae', optimizer='adam') # fit network reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=10, mode='auto') # model.fit返回一个对象，对象有history属性，记录训练的数据的损失 history = model.fit(train_X, train_y, epochs=100, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False, callbacks=[reduce_lr]) # plot history pyplot.plot(history.history['loss'], label='train1') pyplot.plot(history.history['val_loss'], label='test2345') pyplot.legend() pyplot.show() # make a prediction yhat = model.predict(test_X) test_X = test_X.reshape((test_X.shape[0], test_X.shape[2])) inv_yhat = concatenate((test_X,yhat), axis=1) # invert scaling for forecast #inv_yhat = concatenate((yhat, test_X[:, 5:]), axis=1) inv_yhat = scaler.inverse_transform(inv_yhat) print(type(inv_yhat)) print(inv_yhat[-1]) print(inv_yhat[-1][-5:]) print(inv_yhat[-1:,-5:]) #多一个括号 inv_yhat = inv_yhat[:, -5:] print(inv_yhat[-1]) print(inv_yhat[-1:]) # invert scaling for actual inv1_yhat1 = concatenate((test_X,test_y), axis=1) inv_y = scaler.inverse_transform(inv1_yhat1) inv_y = inv_y[:, -5:] # calculate RMSE rmse = sqrt(mean_squared_error(inv_y, inv_yhat)) print('Test RMSE: %.3f' % rmse) return inv_yhat, inv_y

实验结果：在这里插入图片描述总结：

误差相对于使用上一个时刻的数据预测下一个时刻的数据有些大，但是该模型的应用范围比较广。如果会多变量数据的预测就可以很简单预测单变量数据该模型还可继续进行不断的改进，满足各自的需求

参考： 1.使用Keras进行LSTM实战

【本文地址】

公司简介

联系我们