「日拱一码」022 机器学习—

基于单次随机划分的方法

普通单次随机划分（train_test_split）

分层单次随机划分(使用 train_test_split 的 stratify 参数)

基于多次随机划分的方法

普通多次随机划分(ShuffleSplit)

分层多次随机划分（StratifiedShuffleSplit）

基于交叉验证的方法

K 折交叉验证（KFold）

分层 K 折交叉验证（StratifiedKFold）

基于分组划分的方法

分组随机划分（GroupShuffleSplit）

分组 K 折交叉验证（GroupKFold）

分组分层 K 折交叉验证（StratifiedGroupKFold）

基于时间序列划分的方法

基于自定义划分的方法

数据划分是数据预处理中的一个重要环节，通常用于将数据集分为训练集、验证集和测试集，以便在机器学习或数据分析中进行模型训练、超参数调整和性能评估。以下是几种常见的数据划分方法：

基于单次随机划分的方法

普通单次随机划分（train_test_split）

普通单次随机划分是将数据集随机分为训练集和测试集（或训练集、验证集和测试集）。这种方法适用于数据分布较为均匀的情况

## 基于单次随机划分的方法
# 创建数据集import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split# 创建一个简单的分类数据集,100个样本,4个特征,2个相关特征,2个冗余特征
X, y = make_classification(n_samples=100, n_features=4, n_informative=2, n_redundant=2, random_state=42)# 将数据转换为 DataFrame
df = pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3', 'Feature4'])
df['Label'] = y# 添加分组信息（假设每个样本属于一个组）
df['Group'] = np.random.choice(['Group1', 'Group2', 'Group3'], size=len(df), p=[0.4, 0.3, 0.3])# 添加时间戳信息（假设数据是按时间顺序生成的）
df['Timestamp'] = pd.date_range(start='2025-01-01', periods=len(df), freq='D')# print(df)
#     Feature1  Feature2  Feature3  Feature4  Label   Group  Timestamp
# 0  -1.053839 -1.027544 -0.329294  0.826007      1  Group1 2025-01-01
# 1   1.569317  1.306542 -0.239385 -0.331376      0  Group1 2025-01-02
# 2  -0.358856 -0.691021 -1.225329  1.652145      1  Group2 2025-01-03
# 3  -0.136856  0.460938  1.896911 -2.281386      0  Group3 2025-01-04
# 4  -0.048629  0.502301  1.778730 -2.171053      0  Group2 2025-01-05
# ..       ...       ...       ...       ...    ...     ...        ...
# 95 -2.241820 -1.248690  2.357902 -2.009185      0  Group2 2025-04-06
# 96  0.573042  0.362054 -0.462814  0.341294      1  Group2 2025-04-07
# 97 -0.375121 -0.149518  0.588465 -0.575002      0  Group2 2025-04-08
# 98  1.594888  0.780256 -2.030223  1.863789      1  Group1 2025-04-09
# 99 -0.149941 -0.566037 -1.416933  1.804741      1  Group1 2025-04-10
#
# [100 rows x 7 columns]# 普通单次随机划分(train_test_split)
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Label', 'Group', 'Timestamp']), df['Label'], test_size=0.2, random_state=42)print("普通单次随机划分结果：")
print("训练集大小：", X_train.shape) # (80, 4)
print("测试集大小：", X_test.shape) # (20, 4)

分层单次随机划分(使用 train_test_split 的 stratify 参数)

分层单次随机划分是在随机划分的基础上，确保每个划分后的子集在目标变量的分布上与原始数据集保持一致。这对于分类问题尤其重要，特别是当数据集中某些类别样本较少时

# 分层单次随机划分X_train_strat, X_test_strat, y_train_strat, y_test_strat = train_test_split(df.drop(columns=['Label', 'Group', 'Timestamp']), df['Label'], test_size=0.2, stratify=df['Label'], random_state=42)print("分层单次随机划分结果：")
print("训练集标签分布：", y_train_strat.value_counts())
# Label
# 0    40
# 1    40
# Name: count, dtype: int64
print("测试集标签分布：", y_test_strat.value_counts())
# Label
# 1    10
# 0    10
# Name: count, dtype: int64

基于多次随机划分的方法

普通多次随机划分(ShuffleSplit)

普通多次随机划分会随机打乱数据，然后根据指定的比例划分训练集和测试集，划分时不考虑目标变量（标签）的分布。

## 基于多次随机划分的方法
# 普通多次随机划分ShuffleSplitfrom sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit
ss = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)for fold, (train_idx, test_idx) in enumerate(ss.split(df)):train_data = df.iloc[train_idx]test_data = df.iloc[test_idx]print(f"第 {fold + 1} 次划分：")print("训练集大小：", train_data.shape)print("测试集大小：", test_data.shape)print("训练集标签分布：", train_data['Label'].value_counts())print("测试集标签分布：", test_data['Label'].value_counts())print()
# 第 1 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 0    42
# 1    38
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    12
# 0     8
# Name: count, dtype: int64
# 
# 第 2 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 0    43
# 1    37
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    13
# 0     7
# Name: count, dtype: int64
# 
# 第 3 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
# 
# 第 4 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 1    42
# 0    38
# Name: count, dtype: int64
# 测试集标签分布： Label
# 0    12
# 1     8
# Name: count, dtype: int64
# 
# 第 5 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 0    41
# 1    39
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    11
# 0     9
# Name: count, dtype: int64

分层多次随机划分（StratifiedShuffleSplit）

分层多次随机划分会在划分数据时保持目标变量（标签）的分布与原始数据集一致

# 分层多次随机划分StratifiedShuffleSplitsss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)for fold, (train_idx, test_idx) in enumerate(sss.split(df.drop(columns=['Label']), df['Label'])):train_data = df.iloc[train_idx]test_data = df.iloc[test_idx]print(f"第 {fold + 1} 次划分：")print("训练集大小：", train_data.shape)print("测试集大小：", test_data.shape)print("训练集标签分布：", train_data['Label'].value_counts())print("测试集标签分布：", test_data['Label'].value_counts())print()
# 第 1 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 0    40
# 1    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
#
# 第 2 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64
#
# 第 3 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64
#
# 第 4 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
#
# 第 5 次划分：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 训练集标签分布： Label
# 0    40
# 1    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64

基于交叉验证的方法

交叉验证是一种更稳健的划分方法，通过多次划分数据集并训练模型，可以更好地评估模型的性能

K 折交叉验证（KFold）

将数据集分为 K 个子集，每次选择一个子集作为测试集，其余作为训练集，重复 K 次

## 基于交叉验证的方法
# K折交叉验证
from sklearn.model_selection import KFold# K 折交叉验证
kf = KFold(n_splits=5, shuffle=True, random_state=42)for fold, (train_idx, test_idx) in enumerate(kf.split(df)):train_data_kfold = df.iloc[train_idx]test_data_kfold = df.iloc[test_idx]print(f"第 {fold + 1} 折：")print("训练集大小：", train_data_kfold.shape)print("测试集大小：", test_data_kfold.shape)
# 第 1 折：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 第 2 折：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 第 3 折：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 第 4 折：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)
# 第 5 折：
# 训练集大小： (80, 7)
# 测试集大小： (20, 7)

分层 K 折交叉验证（StratifiedKFold）

在 K 折交叉验证的基础上，确保每个子集在目标变量的分布上与原始数据集保持一致

# 分层 K 折交叉验证
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)for fold, (train_idx, test_idx) in enumerate(skf.split(df.drop(columns=['Label', 'Group', 'Timestamp']), df['Label'])):train_data_skfold = df.iloc[train_idx]test_data_skfold = df.iloc[test_idx]print(f"第 {fold + 1} 折：")print("训练集标签分布：", train_data_skfold['Label'].value_counts())print("测试集标签分布：", test_data_skfold['Label'].value_counts())
# 第 1 折：
# 训练集标签分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64
# 第 2 折：
# 训练集标签分布： Label
# 0    40
# 1    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
# 第 3 折：
# 训练集标签分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64
# 第 4 折：
# 训练集标签分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    10
# 0    10
# Name: count, dtype: int64
# 第 5 折：
# 训练集标签分布： Label
# 1    40
# 0    40
# Name: count, dtype: int64
# 测试集标签分布： Label
# 0    10
# 1    10
# Name: count, dtype: int64

基于分组划分的方法

当数据中存在分组结构时（例如用户、实验组等），需要确保每个划分后的子集中包含完整的分组，而不是将同一组的数据分到不同的子集中

分组随机划分（GroupShuffleSplit）

分组随机划分是一种随机划分方法，它会随机选择完整的分组作为训练集和测试集。这种方法可以确保同一组的数据不会被拆分到不同的子集中

## 基于分组划分的方法
from sklearn.model_selection import GroupShuffleSplit# 分组随机划分 GroupShuffleSplit
gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(df.drop(columns=['Label', 'Group']), df['Label'], groups=df['Group']))train_data_gss = df.iloc[train_idx]
test_data_gss = df.iloc[test_idx]print("分组随机划分结果：")
print("训练集大小：", train_data_gss.shape) # (64, 7)
print("测试集大小：", test_data_gss.shape) # (36, 7)
print("训练集分组分布：", train_data_gss['Group'].value_counts())
# Group
# Group2    34
# Group3    30
# Name: count, dtype: int64
print("测试集分组分布：", test_data_gss['Group'].value_counts())
# Group
# Group1    36
# Name: count, dtype: int64

分组 K 折交叉验证（GroupKFold）

分组 K 折交叉验证是一种多次划分方法，它将数据集分为 K 个子集，每次选择一个子集作为测试集，其余作为训练集。与普通 K 折不同的是，分组 K 折确保每个子集包含完整的分组

from sklearn.model_selection import GroupKFold# 分组K折交叉划分 GroupKFold
gkf = GroupKFold(n_splits=3)for fold, (train_idx, test_idx) in enumerate(gkf.split(df.drop(columns=['Label', 'Group']), df['Label'], groups=df['Group'])):train_data_gkf = df.iloc[train_idx]test_data_gkf = df.iloc[test_idx]print(f"第 {fold + 1} 折：")print("训练集大小：", train_data_gkf.shape)print("测试集大小：", test_data_gkf.shape)print("训练集分组分布：", train_data_gkf['Group'].value_counts())print("测试集分组分布：", test_data_gkf['Group'].value_counts())
# 第 1 折：
# 训练集大小： (58, 7)
# 测试集大小： (42, 7)
# 训练集分组分布： Group
# Group3    31
# Group2    27
# Name: count, dtype: int64
# 测试集分组分布： Group
# Group1    42
# Name: count, dtype: int64
# 第 2 折：
# 训练集大小： (69, 7)
# 测试集大小： (31, 7)
# 训练集分组分布： Group
# Group1    42
# Group2    27
# Name: count, dtype: int64
# 测试集分组分布： Group
# Group3    31
# Name: count, dtype: int64
# 第 3 折：
# 训练集大小： (73, 7)
# 测试集大小： (27, 7)
# 训练集分组分布： Group
# Group1    42
# Group3    31
# Name: count, dtype: int64
# 测试集分组分布： Group
# Group2    27
# Name: count, dtype: int64

分组分层 K 折交叉验证（StratifiedGroupKFold）

分组分层 K 折交叉验证结合了分组和分层的思想。它不仅确保同一组的数据不会被拆分，还保证了每个子集在目标变量的分布上与原始数据集一致

from sklearn.model_selection import KFold
import numpy as npclass StratifiedGroupKFold:def __init__(self, n_splits=5):self.n_splits = n_splitsdef split(self, X, y, groups):unique_groups = np.unique(groups)group_to_y = {group: y[groups == group] for group in unique_groups}group_to_idx = {group: np.where(groups == group)[0] for group in unique_groups}# Sort groups by the proportion of positive labelssorted_groups = sorted(unique_groups, key=lambda g: np.mean(group_to_y[g]))# Split groups into foldsfolds = np.array_split(sorted_groups, self.n_splits)for fold in folds:test_idx = np.concatenate([group_to_idx[group] for group in fold])train_idx = np.setdiff1d(np.arange(len(groups)), test_idx)yield train_idx, test_idx# 分组分层K折交叉划分 StratifiedGroupKFold
sgkf = StratifiedGroupKFold(n_splits=3)for fold, (train_idx, test_idx) in enumerate(sgkf.split(df.drop(columns=['Label', 'Group']), df['Label'], groups=df['Group'])):train_data_sgkf = df.iloc[train_idx]test_data_sgkf = df.iloc[test_idx]print(f"第 {fold + 1} 折：")print("训练集大小：", train_data_sgkf.shape)print("测试集大小：", test_data_sgkf.shape)print("训练集分组分布：", train_data_sgkf['Group'].value_counts())print("测试集分组分布：", test_data_sgkf['Group'].value_counts())print("训练集标签分布：", train_data_sgkf['Label'].value_counts())print("测试集标签分布：", test_data_sgkf['Label'].value_counts())
# 第 1 折：
# 训练集大小： (73, 7)
# 测试集大小： (27, 7)
# 训练集分组分布： Group
# Group1    38
# Group3    35
# Name: count, dtype: int64
# 测试集分组分布： Group
# Group2    27
# Name: count, dtype: int64
# 训练集标签分布： Label
# 1    38
# 0    35
# Name: count, dtype: int64
# 测试集标签分布： Label
# 0    15
# 1    12
# Name: count, dtype: int64
# 第 2 折：
# 训练集大小： (65, 7)
# 测试集大小： (35, 7)
# 训练集分组分布： Group
# Group1    38
# Group2    27
# Name: count, dtype: int64
# 测试集分组分布： Group
# Group3    35
# Name: count, dtype: int64
# 训练集标签分布： Label
# 1    34
# 0    31
# Name: count, dtype: int64
# 测试集标签分布： Label
# 0    19
# 1    16
# Name: count, dtype: int64
# 第 3 折：
# 训练集大小： (62, 7)
# 测试集大小： (38, 7)
# 训练集分组分布： Group
# Group3    35
# Group2    27
# Name: count, dtype: int64
# 测试集分组分布： Group
# Group1    38
# Name: count, dtype: int64
# 训练集标签分布： Label
# 0    34
# 1    28
# Name: count, dtype: int64
# 测试集标签分布： Label
# 1    22
# 0    16
# Name: count, dtype: int64

基于时间序列划分的方法

对于时间序列数据，不能简单地随机划分，因为时间序列数据具有时间依赖性。通常需要将数据按照时间顺序分为训练集、验证集和测试集，确保训练集中的数据早于验证集和测试集

## 基于时间序列的划分方法
# 按时间顺序划分
df_sorted = df.sort_values(by='Timestamp')
train_size = int(len(df_sorted) * 0.8)
train_data = df_sorted[:train_size]
test_data = df_sorted[train_size:]print("按时间顺序划分结果：")
print("训练集时间范围：", train_data['Timestamp'].min(), "到", train_data['Timestamp'].max()) # 2025-01-01 00:00:00 到 2025-03-21 00:00:00
print("测试集时间范围：", test_data['Timestamp'].min(), "到", test_data['Timestamp'].max()) # 2025-03-22 00:00:00 到 2025-04-10 00:00:00

基于自定义划分的方法

在某些情况下，可能需要根据特定的业务逻辑或数据特性进行自定义划分

## 自定义划分
# 根据 Group 列的值进行自定义划分
train_data = df[df['Group'] == 'Group1']
val_data = df[df['Group'] == 'Group2']
test_data = df[df['Group'] == 'Group3']# 输出划分结果
print("训练集大小：", train_data.shape)  # (40, 7)
print("验证集大小：", val_data.shape)  # (24, 7)
print("测试集大小：", test_data.shape)  # (36, 7)print("\n训练集分组分布：")
print(train_data['Group'].value_counts())
# Group
# Group1    40
# Name: count, dtype: int64print("\n验证集分组分布：")
print(val_data['Group'].value_counts())
# Group
# Group2    24
# Name: count, dtype: int64print("\n测试集分组分布：")
print(test_data['Group'].value_counts())
# Group
# Group3    36
# Name: count, dtype: int64