探讨EMD数据泄露问题的时序预测模型：EMD-CNN-LSTM实现与分析

EMD-CNN-LSTM是一种结合了经验模态分解（EMD）、卷积神经网络（CNN）和长短期记忆网络（LSTM）的模型，用于处理时序数据分析，能够自适应地提取时频特征并建模序列依赖关系。

1. 前言

在进行EMD-CNN-LSTM模型之前先简单介绍一下EMD信号分解方法及作者见解：

EMD是一种基于局部特性的信号分解方法，将非线性非平稳信号分解成一系列固有模态函数（IMFs），每个IMF代表特定频率的振荡模式，不同于VMD信号分解（VMD可指定分解数）EMD是一种自适应的方法，根据信号的局部特性自动确定分解出的固有模态函数（IMF）的数量，这就导致了在模型构建过程中存在一定的数据泄露——建立深度学习模型需要划分数据集为训练集、验证集、测试集，在模型构建流程中为避免数据泄露问题都会采用训练集的统计指标来对验证集和测试集进行相应的处理，而EMD对训练集、验证集、测试集进行单独分解时因为其实现方式的原因是不能保证分解数IMF是一致的从而模型构建会存在问题，所以只有对所有数据集进行EMD分解保证IMF数量一致然后划分数据集然而在作者眼里这是存在一定数据泄露的因为未来信息泄露到了训练集中，当然有人会说那为什么不直接采用VMD分解呢？可以人为进行指定分解数从而保证训练集、验证集、测试集的分解数一致，这种情况下又存在另外一个问题就是，假如我的训练集最后的分解数为10然后验证集、测试集的分解数也确定为10，但是验证集、测试集的最好分解数可能不为10导致训练集的IMF1和验证集、测试集的IMF1就不是一回事了，最后的预测效果也不好，当然以上都是作者的理解，目前也存在很多论文采用这些方法，如果有更好的解决方法，欢迎私信友好交流，以下代码只考虑如何实现EMD-CNN-LSTM模型。

2. 代码实现

2.1 读取数据

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

plt.rcParams['font.sans-serif'] = 'SimHei' # 设置中文显示

plt.rcParams['axes.unicode_minus'] = False



df = pd.read_excel('数据.xlsx',index_col=0, parse_dates=['数据时间'])

2.2 在不同的数据集上进行EMD分解

2.2.1 数据集划分

train_ratio = 0.7  # 训练集所占比例

n = len(df)

train_size = int(n * train_ratio)



train_df = df.iloc[:train_size]

test_df = df.iloc[train_size:]



print("训练集长度:", len(train_df))

print("测试集长度:", len(test_df))

在这里只把数据集划分为训练集、测试集来验证EMD分解可能会在不同数据集下产生不同数量的分解数IMF，从而不能建立模型。

2.2.2 EMD在训练集上分解

from PyEMD import EMD

# 执行EMD分解

emd = EMD()

IMFs = emd(np.array(train_df).reshape(-1))



# 可视化原始信号和IMFs

plt.figure(figsize=(20, 15))



plt.subplot(len(IMFs) + 1, 1, 1)

plt.plot(train_df.index, np.array(train_df).reshape(-1), 'r')

plt.title('原始信号')



for i, IMF in enumerate(IMFs):

    plt.subplot(len(IMFs) + 1, 1, i + 2)

    plt.plot(train_df.index, IMF)

    plt.title(f'IMF {i + 1}')



plt.tight_layout()

plt.show()

可以发现在训练集上EMD分解存在8个IMF。

2.2.3 EMD在测试集上分解

emd = EMD()

IMFs = emd(np.array(test_df).reshape(-1))



plt.figure(figsize=(20, 15))



plt.subplot(len(IMFs) + 1, 1, 1)

plt.plot(test_df.index, np.array(test_df).reshape(-1), 'r')

plt.title('原始信号')



for i, IMF in enumerate(IMFs):

    plt.subplot(len(IMFs) + 1, 1, i + 2)

    plt.plot(test_df.index, IMF)

    plt.title(f'IMF {i + 1}')



plt.tight_layout()

plt.show()

可以发现在测试集上EMD分解只存在6个IMF，与在训练集上的8个IMF数量不相等，从而导致模型不能正常构建于是只能考虑对整体数据进行EMD分解保证IMF数量一致，然后在进行数据集划分，当然这种做法存在数据泄露问题。

2.3 整体数据进行EMD分解

from PyEMD import EMD



emd = EMD()

IMFs = emd(np.array(df).reshape(-1))



# 计算残差

residue = np.array(df).reshape(-1) - np.sum(IMFs, axis=0)



plt.figure(figsize=(20, 15))



plt.subplot(len(IMFs) + 2, 1, 1)

plt.plot(df.index, np.array(df).reshape(-1), 'r')

plt.title('原始信号')



for i, IMF in enumerate(IMFs):

    plt.subplot(len(IMFs) + 2, 1, i + 2)

    plt.plot(df.index, IMF)

    plt.title(f'IMF {i + 1}')



plt.subplot(len(IMFs) + 2, 1, len(IMFs) + 2)

plt.plot(df.index, residue)

plt.title('残差')



plt.tight_layout()

plt.show()

EMD分解信号时会产生残差，这些残差是原始信号与所有提取的固有模态函数（IMFs）之和的差异，在建模时，可以考虑将残差加入模型中，以提高模型的性能，也可以不纳入模型构建，根据实际情况选择，这里我们只考虑IMF。

2.4 对各个IMF划分数据集

# 保存模态分解结果

imf_df = pd.DataFrame(index=df.index)

for i, IMF in enumerate(IMFs):

    imf_df[f'IMF{i + 1}'] = IMF



# 定义划分比例

train_ratio = 0.7

val_ratio = 0.1

test_ratio = 0.2



# 计算划分的索引

train_split = int(train_ratio * len(imf_df))

val_split = int((train_ratio + val_ratio) * len(imf_df))



# 划分数据集

train_set = imf_df.iloc[:train_split]

val_set = imf_df.iloc[train_split:val_split]

test_set = imf_df.iloc[val_split:]

将数据集按照70%的训练集、10%的验证集和20%的测试集进行划分，训练集用于训练模型，验证集用于调整模型超参数和监控模型性能，测试集用于评估模型的泛化性能。

2.5 数据集标准化

from sklearn.preprocessing import MinMaxScaler



def normalize_dataframe(train_set, val_set, test_set):

    scaler = MinMaxScaler()

    scaler.fit(train_set)  # 在训练集上拟合归一化模型



    train = pd.DataFrame(scaler.transform(train_set), columns=train_set.columns, index = train_set.index)

    val = pd.DataFrame(scaler.transform(val_set), columns=val_set.columns, index = val_set.index)

    test = pd.DataFrame(scaler.transform(test_set), columns=test_set.columns, index = test_set.index)

    return train, val, test, scaler



train, val, test, scaler = normalize_dataframe(train_set, val_set, test_set)

对数据进行标准化处理，以提高模型的训练稳定性和收敛速度。

2.6 时间窗口划分

def prepare_data(data, win_size, target_feature_idxs):

    num_features = data.shape[1]

    X = []  

    y = []  

    for i in range(len(data) - win_size):

        temp_x = data[i:i + win_size, :]  

        temp_y = [data[i + win_size, idx] for idx in target_feature_idxs]  

        X.append(temp_x)

        y.append(temp_y)

    X = np.asarray(X)

    y = np.asarray(y)



    return X, y



win_size = 12 # 时间窗口

target_feature_idxs = np.arange(9) # 指定待预测特征列索引

train_x, train_y = prepare_data(train.values, win_size, target_feature_idxs)

val_x, val_y = prepare_data(val.values, win_size, target_feature_idxs)

test_x, test_y = prepare_data(test.values, win_size, target_feature_idxs)

print("训练集形状:", train_x.shape, train_y.shape)

print("验证集形状:", val_x.shape, val_y.shape)

print("测试集形状:", test_x.shape, test_y.shape)

这里不对时间窗口这个内容作过多描述。

2.7 模型构建

from tensorflow.keras.layers import LSTM, Dense

from tensorflow.keras.models import Model

from tensorflow.keras.layers import Input

from tensorflow.keras.layers import Conv1D, MaxPooling1D, Reshape, Flatten



# 输入维度

input_shape = Input(shape=(train_x.shape[1], train_x.shape[2]))



# LSTM层

lstm_layer = LSTM(256, activation='relu')(input_shape)

# 卷积

shape = Reshape((256, 1))(lstm_layer)

conv = Conv1D(filters=64, kernel_size=7, activation='relu')(shape)

maxpool = MaxPooling1D(pool_size=2)(conv)

fal = Flatten()(maxpool)

# 全连接层

dense_1 = Dense(128, activation='relu')(fal)

dense_2 = Dense(64, activation='relu')(dense_1)



# 输出层

output_1 = Dense(1, name='IMF1')(dense_2)

output_2 = Dense(1, name='IMF2')(dense_2)

output_3 = Dense(1, name='IMF3')(dense_2)

output_4 = Dense(1, name='IMF4')(dense_2)

output_5 = Dense(1, name='IMF5')(dense_2)

output_6 = Dense(1, name='IMF6')(dense_2)

output_7 = Dense(1, name='IMF7')(dense_2)

output_8 = Dense(1, name='IMF8')(dense_2)

output_9 = Dense(1, name='IMF9')(dense_2)



model = Model(inputs = input_shape, outputs=[output_1, output_2, output_3, output_4, output_5, output_6, 

                                             output_7, output_8, output_9])

model.compile(loss='mse', optimizer='adam')

# 模型拟合

history = model.fit(train_x, [train_y[:,i] for i in range(train_y.shape[1])], epochs=50, batch_size=32, 

                    validation_data=(val_x, [val_y[:,i] for i in range(val_y.shape[1])]))



plt.figure()

plt.plot(history.history['loss'], c='b', label='loss')

plt.plot(history.history['val_loss'], c='g', label='val_loss')

plt.legend()

plt.show()

model.summary()

该模型是一个结合了LSTM和卷积神经网络的多输入多输出模型，用于处理时序数据，其中每个输出对应于经验模态分解（EMD）产生的不同的固有模态函数（IMF），当然也可以单独对每一个IMF建立一个不同参数的预测模型，不一定要像这样统一建模。

2.8 模型向后预测

def predict_next_11_days(model, input_data):

    input_sequence = input_data.copy()



    # 预测未来 11 天的数据

    future_predictions = []

    for _ in range(11):

        predictions = model.predict(np.expand_dims(input_sequence[-1], axis=0))

        next_data = np.append(input_sequence[-1, 1:], np.array(predictions).reshape(1,9), axis=0)

        input_sequence = np.append(input_sequence, [next_data], axis=0)

        future_predictions.append(predictions)

    future_predictions = np.array(future_predictions).reshape(11, 9)



    return future_predictions



future_predictions = predict_next_11_days(model, test_x[-1:])

future_predictions

2.9 反归一化

min_values = train_set.min()

max_values = train_set.max()



# 测试集预测结果反归一化

IMFs = []

for i in range(9):  

    IMF = y_pred[i] * (max_values[i] - min_values[i]) + min_values[i]

    IMFs.append(IMF)



# 向后预测反归一化

IMFn = []

for i in range(9): 

    IMF = future_predictions[:,i] * (max_values[i] - min_values[i]) + min_values[i]

    IMFn.append(IMF)

2.10 模型预测可视化

plt.figure(figsize=(20, 30), dpi=300)



for i in range(9):  # 假设有12个IMF

    plt.subplot(3, 3, i+1)

    plt.plot(train_set[f'IMF{i+1}'], label='训练集', color='blue', alpha=0.8)

    plt.plot(val_set[f'IMF{i+1}'], label='验证集', color='gold', alpha=0.8)

    plt.plot(test_set[f'IMF{i+1}'], label='测试集', color='r', alpha=0.8)

    plt.plot(pd.date_range(start='2020-12-19', end='2021-08-31', freq='D'), IMFs[i].reshape(-1), label='测试集预测', color='navy', alpha=0.8)

    plt.plot(pd.date_range(start='2021-08-31', end='2021-09-10', freq='D'), IMFn[i].reshape(-1), label='向后预测10天', color='limegreen', alpha=0.8)

    plt.legend()

    plt.title(f'IMF{i+1}')

    plt.grid(True)



plt.tight_layout()

plt.show()

可以发现模型预测，在每一个IMF的表现是不一样的，甚至可能差异比较大，这里我们就可以对这种预测结果不理想的IMF进行单独建立模型，以提高其模型拟合度，模型拟合效果较好的还是考虑统一建模以减轻工作量。

2.11 各IMF预测结果相加得到最后的预测结果

sum_IMFs = np.sum(IMFs, axis=0)

sum_IMFn = np.sum(IMFn, axis=0)



plt.figure(figsize=(15, 5))

plt.plot(np.sum(train_set, axis=1), label='训练集', color='blue', alpha=0.8)

plt.plot(np.sum(val_set, axis=1), label='验证集', color='gold', alpha=0.8)

plt.plot(np.sum(test_set, axis=1), label='测试集', color='r', alpha=0.8)

plt.plot(pd.date_range(start='2020-12-19', end='2021-08-31', freq='D'), sum_IMFs, label='测试集预测', color='navy', alpha=0.8)

plt.plot(pd.date_range(start='2021-08-31', end='2021-09-10', freq='D'), sum_IMFn, label='向后预测10天', color='limegreen', alpha=0.8)

plt.legend()

plt.title('预测结果')

plt.grid(True)

plt.tight_layout()

plt.show()

mse = metrics.mean_squared_error(np.sum(test_set.iloc[12:], axis=1), np.array([i for arr in sum_IMFs for i in arr]))

rmse = np.sqrt(mse)

mae = metrics.mean_absolute_error(np.sum(test_set.iloc[12:], axis=1), np.array([i for arr in sum_IMFs for i in arr]))

r2 = r2_score(np.sum(test_set.iloc[12:], axis=1), np.array([i for arr in sum_IMFs for i in arr]))

print("均方误差 (MSE):", mse)

print("均方根误差 (RMSE):", rmse)

print("平均绝对误差 (MAE):", mae)

print("拟合优度:", r2)

最后计算了总预测结果的评价指标，可以发现拟合优度R^2为0.8786，虽然拟合优度有明显提示但是作者认为这是数据泄露的功劳，且目前作者对于使用EMD分解如何避免数据泄露问题还没有发现一个好的思路。

文章转自微信公众号@Python机器学习AI