矿物分类系统开发笔记(一):数据预处理

目录

一、数据基础与预处理目标

二、具体预处理步骤及代码解析

2.1 数据加载与初步清洗

2.2 标签编码

2.3 缺失值处理

(1)删除含缺失值的样本

(2)按类别均值填充

(3)按类别中位数填充

(4)按类别众数填充

(5)线性回归填充

(6)随机森林填充

2.4 特征标准化

2.5 数据集拆分与类别平衡

(1)拆分训练集与测试集

(2)处理类别不平衡

2.6 数据保存

三、具体代码

四、预处理小结


一、数据基础与预处理目标

本矿物分类系统基于矿物数据.xlsx展开,该数据包含 1044 条样本,涵盖 13 种元素特征(氯、钠、镁等)和 4 类矿物标签(A、B、C、D)。因数据敏感,故无法提供,仅提供代码用于学习。数据预处理的核心目标是:通过规范格式、处理缺失值、平衡类别等操作,为后续模型训练提供可靠输入。


二、具体预处理步骤及代码解析

2.1 数据加载与初步清洗

首先加载数据并剔除无效信息:

import pandas as pd# 加载数据,仅保留有效样本
data = pd.read_excel('矿物数据.xlsx', sheet_name='Sheet1')
# 删除特殊类别E(仅1条样本,无统计意义)
data = data[data['矿物类型'] != 'E']
# 转换数据类型,将非数值符号(如"/"、空格)转为缺失值NaN
for col in data.columns:if col not in ['序号', '矿物类型']:  # 标签列不转换# errors='coerce'确保非数值转为NaNdata[col] = pd.to_numeric(data[col], errors='coerce')

此步骤解决了原始数据中格式混乱的问题,确保所有特征列均为数值型,为后续处理奠定基础。

2.2 标签编码

将字符标签(A/B/C/D)转为模型可识别的整数:

# 建立标签映射关系
label_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
# 转换标签并保持DataFrame格式
data['矿物类型'] = data['矿物类型'].map(label_dict)
# 分离特征与标签
X = data.drop(['序号', '矿物类型'], axis=1)  # 特征集
y = data['矿物类型']  # 标签集

编码后,标签范围为 0-3,符合机器学习模型对输入格式的要求。

2.3 缺失值处理

针对数据中存在的缺失值,设计了 6 种处理方案,具体如下:

(1)删除含缺失值的样本

适用于缺失率极低(<1%)的场景,直接剔除无效样本:

def drop_missing(train_data, train_label):# 合并特征与标签,便于按行删除combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)  # 重置索引,避免删除后索引混乱cleaned = combined.dropna()  # 删除含缺失值的行# 分离特征与标签return cleaned.drop('矿物类型', axis=1), cleaned['矿物类型']

该方法优点是无偏差,缺点是可能丢失有效信息(当缺失率较高时)。

(2)按类别均值填充

对数值型特征,按矿物类型分组计算均值,用组内均值填充缺失值(减少跨类别干扰):

def mean_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)# 按矿物类型分组填充filled_groups = []for type_id in combined['矿物类型'].unique():group = combined[combined['矿物类型'] == type_id]# 计算组内各特征均值,用于填充该组缺失值filled_group = group.fillna(group.mean())filled_groups.append(filled_group)# 合并各组数据filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('矿物类型', axis=1), filled['矿物类型']

适用于特征分布较均匀的场景,避免了不同类别间的均值混淆。

(3)按类别中位数填充

当特征存在极端值(如个别样本钠含量远高于均值)时,用中位数填充更稳健:

def median_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)filled_groups = []for type_id in combined['矿物类型'].unique():group = combined[combined['矿物类型'] == type_id]# 中位数对极端值不敏感filled_group = group.fillna(group.median())filled_groups.append(filled_group)filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('矿物类型', axis=1), filled['矿物类型']

(4)按类别众数填充

针对离散型特征(如部分元素含量为整数编码),采用众数(出现次数最多的值)填充:

def mode_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)filled_groups = []for type_id in combined['矿物类型'].unique():group = combined[combined['矿物类型'] == type_id]# 对每列取众数,无众数时返回Nonefill_values = group.apply(lambda x: x.mode().iloc[0] if not x.mode().empty else None)filled_group = group.fillna(fill_values)filled_groups.append(filled_group)filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('矿物类型', axis=1), filled['矿物类型']

(5)线性回归填充

利用特征间的线性相关性(如氯与钠含量的关联)预测缺失值:

from sklearn.linear_model import LinearRegressiondef linear_reg_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)features = combined.drop('矿物类型', axis=1)# 按缺失值数量升序处理(从缺失少的列开始)null_counts = features.isnull().sum().sort_values()for col in null_counts.index:if null_counts[col] == 0:continue  # 无缺失值则跳过# 构建训练数据:用其他特征预测当前列X_train = features.drop(col, axis=1).dropna()  # 其他特征无缺失的样本y_train = features.loc[X_train.index, col]  # 当前列的非缺失值# 待填充样本(当前列缺失,其他特征完整)X_pred = features.drop(col, axis=1).loc[features[col].isnull()]# 训练线性回归模型lr = LinearRegression()lr.fit(X_train, y_train)# 预测并填充缺失值features.loc[features[col].isnull(), col] = lr.predict(X_pred)return features, combined['矿物类型']

该方法要求特征间存在一定线性关系,适用于元素含量呈比例关联的场景。

(6)随机森林填充

对于特征间非线性关系,采用随机森林模型预测缺失值:

from sklearn.ensemble import RandomForestRegressordef rf_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)features = combined.drop('矿物类型', axis=1)null_counts = features.isnull().sum().sort_values()for col in null_counts.index:if null_counts[col] == 0:continue# 分离训练样本和待填充样本X_train = features.drop(col, axis=1).dropna()y_train = features.loc[X_train.index, col]X_pred = features.drop(col, axis=1).loc[features[col].isnull()]# 训练随机森林回归器(100棵树,固定随机种子确保结果可复现)rfr = RandomForestRegressor(n_estimators=100, random_state=10)rfr.fit(X_train, y_train)# 填充预测结果features.loc[features[col].isnull(), col] = rfr.predict(X_pred)return features, combined['矿物类型']

随机森林能捕捉特征间复杂关系,填充精度通常高于线性方法,但计算成本略高。

2.4 特征标准化

不同元素含量数值差异大(如钠可达上千,硒多为 0-1),需消除量纲影响:

from sklearn.preprocessing import StandardScalerdef standardize_features(X_train, X_test):# 用训练集的均值和标准差进行标准化(避免测试集信息泄露)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)  # 拟合训练集并转换X_test_scaled = scaler.transform(X_test)  # 用相同参数转换测试集# 转回DataFrame格式,保留特征名称return pd.DataFrame(X_train_scaled, columns=X_train.columns), pd.DataFrame(X_test_scaled, columns=X_test.columns)

标准化后,所有特征均值为 0、标准差为 1,确保模型不受数值大小干扰。

2.5 数据集拆分与类别平衡

(1)拆分训练集与测试集

按 7:3 比例拆分,保持类别分布一致:

from sklearn.model_selection import train_test_split# stratify=y确保测试集与原始数据类别比例一致
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y
)

(2)处理类别不平衡

采用 SMOTE 算法生成少数类样本,平衡各类别数量:

from imblearn.over_sampling import SMOTE# 仅对训练集过采样(测试集保持原始分布)
smote = SMOTE(k_neighbors=1, random_state=0)  # 近邻数=1,避免引入过多噪声
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

2.6 数据保存

将预处理后的数据存储,供后续模型训练使用:

def save_processed_data(X_train, y_train, X_test, y_test, method):# 拼接特征与标签train_df = pd.concat([X_train, pd.DataFrame(y_train, columns=['矿物类型'])], axis=1)test_df = pd.concat([X_test, pd.DataFrame(y_test, columns=['矿物类型'])], axis=1)# 保存为Excel,明确标识预处理方法train_df.to_excel(f'训练集_{method}.xlsx', index=False)test_df.to_excel(f'测试集_{method}.xlsx', index=False)# 示例:保存经随机森林填充和标准化的数据
save_processed_data(X_train_balanced, y_train_balanced, X_test, y_test, 'rf_fill_standardized')


三、具体代码

数据预处理.py

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import fill_datadata = pd.read_excel('矿物数据.xlsx')
data = data[data['矿物类型'] != 'E']  # 删除特殊类别E,整个数据集中只存在1个E数据
null_num = data.isnull()
null_total = data.isnull().sum()X_whole = data.drop('矿物类型', axis=1).drop('序号', axis=1)
y_whole = data.矿物类型'''将数据中的中文标签转换为字符'''
label_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
encoded_labels = [label_dict[label] for label in y_whole]
y_whole = pd.DataFrame(encoded_labels, columns=['矿物类型'])'''字符串数据转换成float,异常数据('\'和空格)转换成nan'''
for column_name in X_whole.columns:X_whole[column_name] = pd.to_numeric(X_whole[column_name], errors='coerce')'''Z标准化'''
scaler = StandardScaler()
X_whole_Z = scaler.fit_transform(X_whole)
X_whole = pd.DataFrame(X_whole_Z, columns=X_whole.columns)  # Z标准化处理后为numpy数据,这里再转换回pandas数据'''数据集切分'''
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_whole, y_whole, test_size=0.3, random_state=0)'''数据填充,6种方法'''
# # 1.删除空缺行
# X_train_fill, y_train_fill = fill_data.cca_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.cca_test_fill(X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.cca_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)
#
# # 2.平均值填充
# X_train_fill, y_train_fill = fill_data.mean_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.mean_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.mean_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)
#
# # 3.中位数填充
# X_train_fill, y_train_fill = fill_data.median_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.median_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.median_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# # 4.众数填充
# X_train_fill, y_train_fill = fill_data.mode_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.mode_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.mode_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# # 5.线性回归填充
# X_train_fill, y_train_fill = fill_data.linear_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.linear_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.linear_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# 6.随机森林填充
X_train_fill, y_train_fill = fill_data.RandomForest_train_fill(X_train_w, y_train_w)
X_test_fill, y_test_fill = fill_data.RandomForest_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
fill_data.RandomForest_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)

fill_data.py

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression'''过采样'''
def oversampling(train_data, train_label):oversampler = SMOTE(k_neighbors=1, random_state=0)os_x_train, os_y_train = oversampler.fit_resample(train_data, train_label)return os_x_train, os_y_train'''1.删除空缺行'''
def cca_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)  # 重置索引df_filled = data.dropna()  # 删除包含缺失值的行或列return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def cca_test_fill(test_data, test_label):data = pd.concat([test_data, test_label], axis=1)data = data.reset_index(drop=True)df_filled = data.dropna()return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def cca_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[删除空缺行].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[删除空缺行].xlsx', index=False)'''2.平均值填充'''
def mean_train_method(data):fill_values = data.mean()return data.fillna(fill_values)def mean_test_method(train_data, test_data):fill_values = train_data.mean()return test_data.fillna(fill_values)def mean_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['矿物类型'] == 0]B = data[data['矿物类型'] == 1]C = data[data['矿物类型'] == 2]D = data[data['矿物类型'] == 3]A = mean_train_method(A)B = mean_train_method(B)C = mean_train_method(C)D = mean_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def mean_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['矿物类型'] == 0]B_train = train_data_all[train_data_all['矿物类型'] == 1]C_train = train_data_all[train_data_all['矿物类型'] == 2]D_train = train_data_all[train_data_all['矿物类型'] == 3]A_test = test_data_all[test_data_all['矿物类型'] == 0]B_test = test_data_all[test_data_all['矿物类型'] == 1]C_test = test_data_all[test_data_all['矿物类型'] == 2]D_test = test_data_all[test_data_all['矿物类型'] == 3]A = mean_test_method(A_train, A_test)B = mean_test_method(B_train, B_test)C = mean_test_method(C_train, C_test)D = mean_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def mean_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[平均值填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[平均值填充].xlsx', index=False)'''3.中位数填充'''
def median_train_method(data):fill_values = data.median()return data.fillna(fill_values)def median_test_method(train_data, test_data):fill_values = train_data.median()return test_data.fillna(fill_values)def median_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['矿物类型'] == 0]B = data[data['矿物类型'] == 1]C = data[data['矿物类型'] == 2]D = data[data['矿物类型'] == 3]A = median_train_method(A)B = median_train_method(B)C = median_train_method(C)D = median_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def median_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['矿物类型'] == 0]B_train = train_data_all[train_data_all['矿物类型'] == 1]C_train = train_data_all[train_data_all['矿物类型'] == 2]D_train = train_data_all[train_data_all['矿物类型'] == 3]A_test = test_data_all[test_data_all['矿物类型'] == 0]B_test = test_data_all[test_data_all['矿物类型'] == 1]C_test = test_data_all[test_data_all['矿物类型'] == 2]D_test = test_data_all[test_data_all['矿物类型'] == 3]A = median_test_method(A_train, A_test)B = median_test_method(B_train, B_test)C = median_test_method(C_train, C_test)D = median_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def median_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[中位数填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[中位数填充].xlsx', index=False)'''4.众数填充'''
def mode_train_method(data):fill_values = data.apply(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)a = data.mode()return data.fillna(fill_values)def mode_test_method(train_data, test_data):fill_values = train_data.apply(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)a = train_data.mode()return test_data.fillna(fill_values)def mode_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['矿物类型'] == 0]B = data[data['矿物类型'] == 1]C = data[data['矿物类型'] == 2]D = data[data['矿物类型'] == 3]A = mode_train_method(A)B = mode_train_method(B)C = mode_train_method(C)D = mode_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def mode_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['矿物类型'] == 0]B_train = train_data_all[train_data_all['矿物类型'] == 1]C_train = train_data_all[train_data_all['矿物类型'] == 2]D_train = train_data_all[train_data_all['矿物类型'] == 3]A_test = test_data_all[test_data_all['矿物类型'] == 0]B_test = test_data_all[test_data_all['矿物类型'] == 1]C_test = test_data_all[test_data_all['矿物类型'] == 2]D_test = test_data_all[test_data_all['矿物类型'] == 3]A = mode_test_method(A_train, A_test)B = mode_test_method(B_train, B_test)C = mode_test_method(C_train, C_test)D = mode_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def mode_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[众数填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[众数填充].xlsx', index=False)'''5.线性回归填充'''
def linear_train_fill(train_data, train_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('矿物类型', axis=1)null_num = train_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X = train_data_X[filling_feature].drop(i, axis=1)y = train_data_X[i]row_numbers_mg_null = train_data_X[train_data_X[i].isnull()].index.tolist()X_train = X.drop(row_numbers_mg_null)y_train = y.drop(row_numbers_mg_null)X_test = X.iloc[row_numbers_mg_null]lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_test)train_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成训练数据集中的{i}列数据的填充')return train_data_X, train_data_all.矿物类型def linear_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('矿物类型', axis=1)test_data_X = test_data_all.drop('矿物类型', axis=1)null_num = test_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X_train = train_data_X[filling_feature].drop(i, axis=1)y_train = train_data_X[i]X_test  = test_data_X[filling_feature].drop(i, axis=1)row_numbers_mg_null = test_data_X[test_data_X[i].isnull()].index.tolist()X_test = X_test.iloc[row_numbers_mg_null]lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_test)test_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成测试数据集中的{i}列数据的填充')return test_data_X, test_data_all.矿物类型def linear_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[线性回归填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[线性回归填充].xlsx', index=False)'''6.随机森林填充'''
def RandomForest_train_fill(train_data, train_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('矿物类型', axis=1)null_num = train_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X = train_data_X[filling_feature].drop(i, axis=1)y = train_data_X[i]row_numbers_mg_null = train_data_X[train_data_X[i].isnull()].index.tolist()X_train = X.drop(row_numbers_mg_null)y_train = y.drop(row_numbers_mg_null)X_test = X.iloc[row_numbers_mg_null]rfg = RandomForestRegressor(n_estimators=100, random_state=10)rfg.fit(X_train, y_train)y_pred = rfg.predict(X_test)train_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成训练数据集中的{i}列数据的填充')return train_data_X, train_data_all.矿物类型def RandomForest_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('矿物类型', axis=1)test_data_X = test_data_all.drop('矿物类型', axis=1)null_num = test_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X_train = train_data_X[filling_feature].drop(i, axis=1)y_train = train_data_X[i]X_test  = test_data_X[filling_feature].drop(i, axis=1)row_numbers_mg_null = test_data_X[test_data_X[i].isnull()].index.tolist()X_test = X_test.iloc[row_numbers_mg_null]rfg = RandomForestRegressor(n_estimators=100, random_state=10)rfg.fit(X_train, y_train)y_pred = rfg.predict(X_test)test_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成测试数据集中的{i}列数据的填充')return test_data_X, test_data_all.矿物类型def RandomForest_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[随机森林填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[随机森林填充].xlsx', index=False)

四、预处理小结

数据预处理完成以下关键工作:

  1. 清洗无效样本与异常符号,统一数据格式;
  2. 通过多种方法处理缺失值,适应不同数据特征;
  3. 标准化特征,消除量纲差异;
  4. 拆分并平衡数据集,为模型训练做准备。

经处理后的数据已满足模型输入要求,下一阶段将进行模型训练,包括:

  • 选择随机森林、SVM 等分类算法;
  • 开展模型评估与超参数调优;
  • 对比不同模型的分类性能。

后续将基于本文预处理后的数据,详细介绍模型训练过程及结果分析。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如若转载,请注明出处:http://www.pswp.cn/pingmian/93665.shtml
繁体地址,请注明出处:http://hk.pswp.cn/pingmian/93665.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

《UE5_C++多人TPS完整教程》学习笔记43 ——《P44 奔跑混合空间(Running Blending Space)》

本文为B站系列教学视频 《UE5_C多人TPS完整教程》 —— 《P44 奔跑混合空间&#xff08;Running Blending Space&#xff09;》 的学习笔记&#xff0c;该系列教学视频为计算机工程师、程序员、游戏开发者、作家&#xff08;Engineer, Programmer, Game Developer, Author&…

TensorRT-LLM.V1.1.0rc1:Dockerfile.multi文件解读

一、TensorRT-LLM有三种安装方式&#xff0c;从简单到难 1.NGC上的预构建发布容器进行部署,见《tensorrt-llm0.20.0离线部署DeepSeek-R1-Distill-Qwen-32B》。 2.通过pip进行部署。 3.从源头构建再部署&#xff0c;《TensorRT-LLM.V1.1.0rc0:在无 GitHub 访问权限的服务器上编…

UniApp 实现pdf上传和预览

一、上传1、html<template><button click"takeFile">pdf上传</button> </template>2、JStakeFile() {// #ifdef H5// H5端使用input方式选择文件const input document.createElement(input);input.type file;input.accept .pdf;input.onc…

《用Proxy解构前端壁垒:跨框架状态共享库的从零到优之路》

一个项目中同时出现React的函数式组件、Vue的模板语法、Angular的依赖注入时,数据在不同框架体系间的流转便成了开发者不得不面对的难题—状态管理,这个本就复杂的命题,在跨框架场景下更显棘手。而Proxy,作为JavaScript语言赋予开发者的“元编程利器”,正为打破这道壁垒提…

MOESI FSM的全路径测试用例

MOESI FSM的全路径测试用例摘要&#xff1a;本文首先提供一个UVM版本的测试序列&#xff08;基于SystemVerilog和UVM框架&#xff09;&#xff0c;设计为覆盖MOESI FSM的全路径&#xff1b;其次详细解释如何使用覆盖组&#xff08;covergroup&#xff09;来量化测试的覆盖率&am…

git仓库和分支的关系

1️⃣ 仓库分支&#xff08;Repository Branch&#xff09;每个 Git 仓库都有自己的分支结构。分支决定你当前仓库看到的代码版本。示例&#xff1a;仓库分支只是局部修改&#xff0c;项目分支才是全局管理所有仓库分支的概念。wifi_camera 仓库&#xff1a; - main - dev - fe…

Linux的基本操作

Linux 系统基础操作完整指南一、文件与目录操作1. 导航与查看pwd (Print Working Directory)作用&#xff1a;显示当前所在目录的完整路径示例&#xff1a;pwd → 输出 /home/user/documents使用场景&#xff1a;当你在多层目录中迷失时快速定位当前位置ls (List)常用选项&…

npm设置了镜像 pnpm还需要设置镜像吗

npm配置镜像后是否需要为pnpm单独设置镜像&#xff1f; 是的&#xff0c;即使您已经为npm设置了镜像源&#xff08;如淘宝镜像&#xff09;&#xff0c;仍然需要单独为pnpm配置镜像源。这是因为npm和pnpm是两个独立的包管理工具&#xff0c;它们的配置系统和环境变量是分离的&a…

Linux管道

预备知识&#xff1a;进程通信进程需要某种协同&#xff0c;协同的前提条件是通信。有些数据是用来通知就绪的&#xff0c;有些是单纯的传输数据&#xff0c;还有一些是控制相关信息。进程具有独立性&#xff0c;所以通信的成本可能稍微高一点&#xff1b;进程间通信前提是让不…

基于Spring Boot的快递物流仓库管理系统 商品库存管理系统

&#x1f525;作者&#xff1a;it毕设实战小研&#x1f525; &#x1f496;简介&#xff1a;java、微信小程序、安卓&#xff1b;定制开发&#xff0c;远程调试 代码讲解&#xff0c;文档指导&#xff0c;ppt制作&#x1f496; 精彩专栏推荐订阅&#xff1a;在下方专栏&#x1…

脚手架开发-Common封装基础通用工具类<基础工具类>

书接上文 java一个脚手架搭建_redission java脚手架-CSDN博客 以微服务为基础搭建一套脚手架开始前的介绍-CSDN博客 脚手架开发-准备配置-进行数据初始化-配置文件的准备-CSDN博客 脚手架开发-准备配置-配置文件的准备项目的一些中间件-CSDN博客 脚手架开发-Nacos集成-CSD…

软件系统运维常见问题

系统部署常见问题 环境配置、兼容性问题。生产与测试环境的操作系统、库版本、中间件版本不一致&#xff0c;运行环境软件版本不匹配。新旧版本代码/依赖不兼容。依赖缺失或冲突问题。后端包启动失败&#xff0c;提示类/方法/第三方依赖库找不到或者版本冲突。配置错误。系统启…

2021 IEEE【论文精读】用GAN让音频隐写术骗过AI检测器 - 对抗深度学习的音频信息隐藏

使用GAN生成音频隐写术的隐写载体 本文为个人阅读GAN音频隐写论文&#xff0c;部分内容注解&#xff0c;由于原文篇幅较长这里就不再一一粘贴&#xff0c;仅对原文部分内容做注解&#xff0c;仅供参考详情参考原文链接 原文链接&#xff1a;https://ieeexplore.ieee.org/abstra…

PWA技术》》渐进式Web应用 Push API 和 WebSocket 、webworker 、serviceworker

PWA # 可离线 # 高性能 # 无需安装 # 原生体验Manifest {"name": "天气助手", // 应用全名"short_name": "天气", // 短名称&#xff08;主屏幕显示&#xff09;"start_url": "/index.html&…

数据结构——栈和队列oj练习

225. 用队列实现栈 - 力扣&#xff08;LeetCode&#xff09; 这一题需要我们充分理解队列和栈的特点。 队列&#xff1a;队头出数据&#xff0c;队尾入数据。 栈&#xff1a;栈顶出数据和入数据。 我们可以用两个队列实现栈&#xff0c;在这过程中&#xff0c;我们总要保持其…

Java基础 8.19

目录 1.局部内部类的使用 总结 1.局部内部类的使用 说明&#xff1a;局部内部类是定义在外部类的局部位置&#xff0c;比如方法中&#xff0c;并且有类名可以直接访问外部类的所有成员&#xff0c;包含私有的不能添加访问修饰符&#xff0c;因为它的地位就是一个局部变量。局…

从父类到子类:C++ 继承的奇妙旅程(2)

前言&#xff1a;各位代码航海家&#xff0c;欢迎回到C继承宇宙&#xff01;上回我们解锁了继承的「基础装备包」&#xff0c;成功驯服了public、protected和花式成员隐藏术。但——⚠️前方高能预警&#xff1a; 继承世界的暗流涌动远不止于此&#xff01;今天我们将勇闯三大神…

【图像算法 - 16】庖丁解牛:基于YOLO12与OpenCV的车辆部件级实例分割实战(附完整代码)

庖丁解牛&#xff1a;基于YOLO12与OpenCV的车辆部件级实例分割实战&#xff08;附完整代码&#xff09; 摘要&#xff1a; 告别“只见整车不见细节”&#xff01;本文将带您深入实战&#xff0c;利用YOLO12-seg训练实例分割模型&#xff0c;结合OpenCV的强大图像处理能力&…

ubuntu22.04配置远程桌面

文章目录前言检查桌面类型xorg远程桌面(xrdp)安装xrdpxrdp添加到ssl-certwayland远程桌面(gnome-remote-desktop)检查安装开启开启状况检查自动登录奇技淫巧前言 在windows上使用远程桌面服务&#xff0c;连接ubuntu主机的远程桌面 检查桌面类型 查看桌面类型、协议 echo $…

SQL Server 中子查询、临时表与 CTE 的选择与对比

在 SQL Server 的实际开发过程中&#xff0c;我们常常需要将复杂的查询逻辑分解为多个阶段进行处理。实现这一目标的常见手段有 子查询 (Subquery)、临时表 (Temporary Table) 和 CTE (Common Table Expression)。这三者在语法、执行效率以及可维护性方面各有优势与局限。如何选…