Datawhale AI 夏令营【更新中】

夏令营简介
大模型技术（文本）方向：用AI做带货视频评论分析
机器学习（数据挖掘）方向：用AI预测新增用户

夏令营简介

本次AI夏令营是Datawhale在暑期发起的大规模AI学习活动，汇聚产学研资源和开源社区力量，为学习者提供项目实践和学习机会，提升专业能力和就业竞争力。合作企业包括：科大讯飞、蚂蚁集团、魔搭社区、阿里云天池、英特尔、浪潮信息、上海科学智能研究院等。为线上活动，全程免费。

第一期夏令营主要有三个方向可供学习：大模型技术、机器学习、MCP Server开发。

大模型技术（文本）方向：用AI做带货视频评论分析

本次学习实践基于科大讯飞主办的2025 iFLYTEK AI开发者大赛中的基于带货视频评论的用户洞察挑战赛赛项，在实践中学习知识。

Datawhale提供一个Baseline给零基础学员熟悉环境，从零开始学习，夏令营的task 1就是跑通Baseline。代码采用Python编写，利用 TF-IDF 和 线性分类器/KMeans 聚类 来完成商品识别、情感分析和评论聚类。最终可获得约 176 左右 的分数。本次活动提供了基于魔搭Notebook的网络编程环境。

下面对Baseline进行分析：

[1]导入 Pandas 库，并且读取两个 CSV 文件，将它们加载为 DataFrame。其中，origin_videos_data.csv存放的是视频相关的数据，origin_comments_data.csv则存储着评论数据。

# [1]
import pandas as pd
video_data = pd.read_csv("origin_videos_data.csv")
comments_data = pd.read_csv("origin_comments_data.csv")

[2]随机抽取视频数据中的 10 行样本。

# [2]
video_data.sample(10)

[3]显示评论数据的前几行内容。

# [3]
comments_data.head()

[4]把视频描述（video_desc）和视频标签（video_tags）组合成一个新的文本特征（text）。对于原数据中可能存在的缺失值，使用空字符串进行填充，这样可以保证新生成的文本特征不会因为缺失值而出现问题。

# [4]
video_data["text"] = video_data["video_desc"].fillna("") + " " + video_data["video_tags"].fillna("")

[5]这里导入了一系列后续会用到的库和工具：

jieba用于中文文本分词。
TfidfVectorizer用于将文本转换为 TF-IDF 特征向量。
SGDClassifier是一种随机梯度下降分类器。
LinearSVC是线性支持向量分类器。
KMeans用于聚类分析。
make_pipeline用于构建机器学习流水线。

# [5]
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

[6]这部分构建了一个预测产品名称的模型：

利用TfidfVectorizer结合 jieba 分词对文本进行处理，只保留前 50 个最重要的特征词。
运用SGDClassifier训练分类模型。
对所有视频数据的产品名称进行预测，包括那些原本产品名称缺失的数据。

# [6]
product_name_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut, max_features=50), SGDClassifier()
)
product_name_predictor.fit(video_data[~video_data["product_name"].isnull()]["text"],video_data[~video_data["product_name"].isnull()]["product_name"],
)
video_data["product_name"] = product_name_predictor.predict(video_data["text"])

[7]查看评论数据的列名。

# [7]
comments_data.columns

[8]此代码针对评论数据进行多类别预测：

对情感类别、用户场景、用户问题和用户建议这四个类别依次进行预测。
对于每个类别，使用有标签的数据训练模型，然后预测所有评论的对应类别。

# [8]
for col in ['sentiment_category','user_scenario', 'user_question', 'user_suggestion']:predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), SGDClassifier())predictor.fit(comments_data[~comments_data[col].isnull()]["comment_text"],comments_data[~comments_data[col].isnull()][col],)comments_data[col] = predictor.predict(comments_data["comment_text"])

[9]设定了后续聚类分析中每个聚类提取的主题词数量为 10 个。

# [9]
top_n_words = 10

[10]这部分对情感类别为 1 或 3 的正向评论进行聚类分析：

采用 K-means 算法将评论分为 2 个聚类。
提取每个聚类中 TF-IDF 值最高的 10 个词作为主题词。
把聚类主题添加到评论数据中。

# [10]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["sentiment_category"].isin([1, 3])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["sentiment_category"].isin([1, 3])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["sentiment_category"].isin([1, 3]), "positive_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

[11]对情感类别为 2 或 3 的负向评论进行聚类，流程与正向评论聚类类似。

# [11]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["sentiment_category"].isin([2, 3])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["sentiment_category"].isin([2, 3])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["sentiment_category"].isin([2, 3]), "negative_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

[12-14]分别对用户场景、用户问题和用户建议进行聚类分析，处理流程与前面的情感评论聚类相同。

# [12]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["user_scenario"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_scenario"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_scenario"].isin([1]), "scenario_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

# [13]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["user_question"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_question"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_question"].isin([1]), "question_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

# [14]
kmeans_predictor = make_pipeline(TfidfVectorizer(tokenizer=jieba.lcut), KMeans(n_clusters=2)
)kmeans_predictor.fit(comments_data[comments_data["user_suggestion"].isin([1])]["comment_text"])
kmeans_cluster_label = kmeans_predictor.predict(comments_data[comments_data["user_suggestion"].isin([1])]["comment_text"])kmeans_top_word = []
tfidf_vectorizer = kmeans_predictor.named_steps['tfidfvectorizer']
kmeans_model = kmeans_predictor.named_steps['kmeans']
feature_names = tfidf_vectorizer.get_feature_names_out()
cluster_centers = kmeans_model.cluster_centers_
for i in range(kmeans_model.n_clusters):top_feature_indices = cluster_centers[i].argsort()[::-1]top_word = ' '.join([feature_names[idx] for idx in top_feature_indices[:top_n_words]])kmeans_top_word.append(top_word)comments_data.loc[comments_data["user_suggestion"].isin([1]), "suggestion_cluster_theme"] = [kmeans_top_word[x] for x in kmeans_cluster_label]

[15]使用 shell 命令创建了名为submit的文件夹，用于存放后续的结果文件。

# [15]
!mkdir submit

[16]将处理好的视频数据和评论数据保存为 CSV 文件，存储在之前创建的submit文件夹中。

# [16]
video_data[["video_id", "product_name"]].to_csv("submit/submit_videos.csv", index=None)
comments_data[['video_id', 'comment_id', 'sentiment_category','user_scenario', 'user_question', 'user_suggestion','positive_cluster_theme', 'negative_cluster_theme','scenario_cluster_theme', 'question_cluster_theme','suggestion_cluster_theme']].to_csv("submit/submit_comments.csv", index=None)

[17]使用 shell 命令将submit文件夹压缩为submit.zip文件，方便提交。

# [17]
!zip -r submit.zip submit/

该Baseline代码部分地方有不足，可以进行优化。

[1]没有对文件路径是否正确以及文件格式是否符合要求进行检查。
[5]代码导入了LinearSVC却没有使用它，这可能是冗余操作。
[6]max_features=50可能会导致特征不足，降低模型的预测准确性。
[8]没有对模型进行参数调优，可能会影响预测效果。
[10]硬性指定n_clusters=2，可能无法准确反映数据的真实聚类情况。
[11]类别 3 同时出现在了正向和负向聚类中，这可能会导致聚类结果不准确。
[12-14]同样采用了固定的n_clusters=2，可能无法很好地适应不同类型数据的特点。\

机器学习（数据挖掘）方向：用AI预测新增用户

下面对Baseline进行分析：

[1]该代码通过pip命令安装LightGBM 库。LightGBM 是一种高效的梯度提升决策树框架，常用于机器学习中的分类、回归等任务，后续代码将使用该库进行模型训练，因此需要提前安装。

# [1]
!pip install lightgbm

[2]该代码导入了后续数据处理和模型训练所需的核心库与工具：

pandas和numpy用于数据读取、清洗和数值计算；
json用于处理可能的 JSON 格式数据；
lightgbm是梯度提升模型库，将用于构建预测模型；
sklearn.model_selection.StratifiedKFold用于分层交叉验证，确保类别分布一致；
sklearn.metrics.f1_score用于模型评估（F1 分数计算）；
sklearn.preprocessing.LabelEncoder用于类别特征的编码转换；
最后通过warnings.filterwarnings('ignore')忽略运行过程中的警告信息，避免干扰输出。

# [2]
import pandas as pd
import numpy as np
import json
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

[3]该代码实现数据加载与时间特征工程：

加载训练集（train.csv）和测试集（testA_data.csv）数据，存储为 DataFrame，并创建提交结果的基础 DataFrame（submit）；
合并训练集和测试集为full_df，用于后续可能的全局特征处理；
对三个 DataFrame（train_df、test_df、full_df）进行时间特征提取：
- 将原始毫秒级时间戳（common_ts）转换为 datetime 格式（ts）；
- 从ts中提取日（day）、星期几（dayofweek）、小时（hour）作为新特征；
- 删除临时的ts列，减少冗余数据。

# [3]
%%time
# 1. 数据加载
train_df = pd.read_csv('./train.csv')
test_df = pd.read_csv('./testA_data.csv')
submit = test_df[['did']]full_df = pd.concat([train_df, test_df], axis=0)# 2. 时间特征工程
for df in [train_df, test_df, full_df]:# 转换为时间戳df['ts'] = pd.to_datetime(df['common_ts'], unit='ms')# 提取时间特征df['day'] = df['ts'].dt.daydf['dayofweek'] = df['ts'].dt.dayofweekdf['hour'] = df['ts'].dt.hour# 删除原始时间列df.drop(['ts'], axis=1, inplace=True)

[4]该代码用于分析训练集与测试集的用户重叠情况：

提取训练集（train_df）和测试集（test_df）中唯一的用户标识（did），并转换为集合；
计算两个集合的交集（overlap_dids），即同时出现在训练集和测试集的用户；
统计重叠用户的数量、在训练集总用户中的占比、在测试集总用户中的占比；
输出结果以评估训练集与测试集的用户分布一致性，为模型泛化能力分析提供参考。

# [4]
%%time
############################### 简单分析
# 获取 train 和 test 中唯一的 did
train_dids = set(train_df['did'].unique())
test_dids = set(test_df['did'].unique())# 计算交集
overlap_dids = train_dids & test_dids# 数量统计
num_overlap = len(overlap_dids)
num_train = len(train_dids)
num_test = len(test_dids)# 占比
ratio_in_train = num_overlap / num_train if num_train > 0 else 0
ratio_in_test = num_overlap / num_test if num_test > 0 else 0# 输出结果
print(f"重叠 did 数量: {num_overlap}")
print(f"占 train 比例: {ratio_in_train:.4f} ({num_overlap}/{num_train})")
print(f"占 test 比例: {ratio_in_test:.4f} ({num_overlap}/{num_test})")

[5]该代码对类别特征进行标签编码处理：

定义需要编码的类别特征列表（cat_features），包括设备品牌、网络类型、地区、操作系统等；
初始化字典（label_encoders）用于保存每个特征的编码器；
对每个类别特征：
- 使用LabelEncoder将类别值转换为 0 开始的自然数；
- 合并训练集和测试集的特征值以训练编码器，确保编码规则在两数据集间一致；
- 将编码后的值替换原始特征，并保存编码器供后续使用。
  通过编码，将非数值型的类别特征转换为模型可处理的数值形式。

# [5]
%%time
# 需要编码的特征列表
cat_features = ['device_brand', 'ntt', 'operator', 'common_country','common_province', 'common_city', 'appver', 'channel','os_type', 'udmap'
]
# 初始化编码器字典
label_encoders = {}for feature in cat_features:# 创建编码器，将类别特征转为0-N的自然数le = LabelEncoder()# 合并训练集和测试集的所有类别all_values = pd.concat([train_df[feature], test_df[feature]]).astype(str)# 训练编码器（使用所有可能值）le.fit(all_values)# 保存编码器label_encoders[feature] = le# 应用编码train_df[feature] = le.transform(train_df[feature].astype(str))test_df[feature] = le.transform(test_df[feature].astype(str))

[6]该代码用于准备模型训练的输入数据：

定义模型使用的特征列表（features），包括原始特征（如设备信息、地区信息）和时间特征（如小时、星期几）；
从训练集（train_df）中提取特征列作为模型输入（X_train），提取标签列（is_new_did，需预测的目标变量）作为y_train；
从测试集（test_df）中提取与训练集相同的特征列作为X_test，为后续模型预测做准备。
此步骤完成了模型输入数据的筛选与划分。

# [6]
%%time
# 基础特征 + 目标编码特征 + 聚合特征
features = [# 原始特征'mid', 'eid', 'device_brand', 'ntt', 'operator', 'common_country', 'common_province', 'common_city','appver', 'channel', 'os_type', 'udmap',# 时间特征'hour', 'dayofweek', 'day', 'common_ts'
]# 准备训练和测试数据
X_train = train_df[features]
y_train = train_df['is_new_did']
X_test = test_df[features]

[7]该代码实现LightGBM 模型的训练与交叉验证：

定义find_optimal_threshold函数，通过搜索阈值最大化 F1 分数（二分类评估指标）；
配置 LightGBM 模型参数（params），包括目标函数、树深度、学习率等，并设置动态随机种子；
使用五折分层交叉验证（StratifiedKFold）训练模型：
- 每个折中划分训练集和验证集，训练模型并通过早停机制（early_stopping）防止过拟合；
- 在验证集上预测概率，搜索最优阈值并计算 F1 分数，保存每个折的模型和结果；
- 对测试集进行预测，累加各折预测结果的平均值作为最终测试集概率。
  此步骤完成模型训练、验证与测试集初步预测。

# [7]
%%time
# 6. F1阈值优化函数
def find_optimal_threshold(y_true, y_pred_proba):"""寻找最大化F1分数的阈值"""best_threshold = 0.5best_f1 = 0for threshold in [0.1,0.15,0.2,0.25,0.3,0.35,0.4]:y_pred = (y_pred_proba >= threshold).astype(int)f1 = f1_score(y_true, y_pred)if f1 > best_f1:best_f1 = f1best_threshold = thresholdreturn best_threshold, best_f1# 7. 模型训练与交叉验证
import time
# 动态生成随机种子（基于当前时间）
seed = int(time.time()) % 1000000  # 取当前时间戳模一个数，避免太大
params = {'objective': 'binary','metric': 'binary_logloss','max_depth': '12','num_leaves': 63,'learning_rate': 0.1,'feature_fraction': 0.7,'bagging_fraction': 0.8,'bagging_freq': 5,'min_child_samples': 10,'verbose': -1,'n_jobs':8,'seed': seed  # 使用动态生成的 seed
}# 五折交叉验证，使用五折构建特征时的切分规则，保证切分一致
n_folds = 5
kf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
test_preds = np.zeros(len(X_test))
fold_thresholds = []
fold_f1_scores = []
models = []
oof_preds = np.zeros(len(X_train))
oof_probas = np.zeros(len(X_train))print("\n开始模型训练...")
for fold, (train_idx, val_idx) in enumerate(kf.split(X_train, y_train)):print(f"\n======= Fold {fold+1}/{n_folds} =======")X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]# 创建数据集（指定类别特征）train_set = lgb.Dataset(X_tr, label=y_tr)val_set = lgb.Dataset(X_val, label=y_val)# 模型训练model = lgb.train(params,train_set,num_boost_round=1000,valid_sets=[train_set, val_set],callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=False),lgb.log_evaluation(period=100)])models.append(model)# 验证集预测val_pred_proba = model.predict(X_val)oof_probas[val_idx] = val_pred_proba# 阈值优化best_threshold, best_f1 = find_optimal_threshold(y_val, val_pred_proba)fold_thresholds.append(best_threshold)# 使用优化阈值计算F1val_pred_labels = (val_pred_proba >= best_threshold).astype(int)fold_f1 = f1_score(y_val, val_pred_labels)fold_f1_scores.append(fold_f1)oof_preds[val_idx] = val_pred_labelsprint(f"Fold {fold+1} Optimal Threshold: {best_threshold:.4f}")print(f"Fold {fold+1} F1 Score: {fold_f1:.5f}")# 测试集预测test_preds += model.predict(X_test) / n_folds

[8]该代码完成模型评估与预测结果生成：

评估交叉验证整体性能：计算各折最优阈值的平均值，基于该阈值生成训练集的 OOF（Out-of-Fold）预测标签，计算最终的 OOF F1 分数；
输出交叉验证结果，包括平均阈值、各折 F1 分数、平均 F1 分数和 OOF F1 分数，评估模型稳定性；
生成测试集预测结果：使用平均阈值将测试集预测概率转换为标签（is_new_did），保存到提交文件（submit.csv）；
分析特征重要性：提取模型训练的特征重要性分数，输出 Top 10 重要特征，为特征优化提供参考。
此步骤完成从模型评估到提交文件生成的全流程。

# [8]
# 8. 整体结果评估
# 使用交叉验证平均阈值
avg_threshold = np.mean(fold_thresholds)
final_oof_preds = (oof_probas >= avg_threshold).astype(int)
final_f1 = f1_score(y_train, final_oof_preds)print("\n===== Final Results =====")
print(f"Average Optimal Threshold: {avg_threshold:.4f}")
print(f"Fold F1 Scores: {[f'{s:.5f}' for s in fold_f1_scores]}")
print(f"Average Fold F1: {np.mean(fold_f1_scores):.5f}")
print(f"OOF F1 Score: {final_f1:.5f}")# 9. 测试集预测与提交文件生成
# 使用平均阈值进行预测
test_pred_labels = (test_preds >= avg_threshold).astype(int)
submit['is_new_did'] = test_pred_labels# 保存提交文件
submit[['is_new_did']].to_csv('submit.csv', index=False)
print("\nSubmission file saved: submit.csv")
print(f"Predicted new user ratio: {test_pred_labels.mean():.4f}")
print(f"Test set size: {len(test_pred_labels)}")# 10. 特征重要性分析
feature_importance = pd.DataFrame({'Feature': features,'Importance': models[0].feature_importance(importance_type='gain')
}).sort_values('Importance', ascending=False)print("\nTop 10 Features:")
print(feature_importance.head(10))