Python 爬虫教程 | 豆瓣 TOP250 数据抓取与分析实战

一、项目背景与数据价值

豆瓣TOP250是影视行业的重要榜单，具有以下数据价值：

评分与评价人数：衡量电影市场热度；
导演与演员信息：分析人才价值与影视趋势；
类型 / 地区 / 年份：洞察电影类型与年代变迁；
经典台词：可用于 NLP 情感分析或推荐系统训练数据。

二、技术栈与环境配置

安装核心 Python 库：

pip install requests beautifulsoup4 pandas numpy matplotlib seaborn fake_useragent

版本建议：

库	应用场景	版本要求
requests	网络请求	≥ 2.25
BeautifulSoup	HTML 解析	≥ 4.9
pandas	数据处理	≥ 1.2
fake_useragent	伪装 User‑Agent	≥ 1.1
time、random	请求延时	Python 标准库

三、网页结构深入解析

目标页面：https://movie.douban.com/top250。通过浏览器开发者工具（F12）分析页面结构存在以下关键点：

电影单元数据 位于 <div class="item"> 内，包含标题、导演、评分等；
翻页机制 使用参数 start=0, 25, … 225 控制页面切换。

四、反爬策略与突破技巧

豆瓣可能设置以下反爬措施：

User‑Agent 检测：使用 fake_useragent 随机生成请求头；
频率限制：结合 time.sleep() 与 random.uniform() 实现随机延迟；
IP 封锁防范：配置代理 IP 池，实现请求匿名化。

示例代码：

from fake_useragent import UserAgent
import random, timeheaders = {'User-Agent': UserAgent().random
}
time.sleep(random.uniform(1.5, 3.5))

五、完整代码实现

以下为完整爬虫流程，包括抓取、解析、异常处理和数据存储：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import re
from fake_useragent import UserAgentdef scrape_douban_top250():base_url = "https://movie.douban.com/top250?start={}"movies = []ua = UserAgent()for page in range(0, 250, 25):url = base_url.format(page)headers = {'User-Agent': ua.random,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8','Connection': 'keep-alive','Referer': 'https://movie.douban.com/','Cookie': 'bid=123456789;'}try:print(f"正在抓取第{page // 25 + 1}页数据...")response = requests.get(url, headers=headers, timeout=15)response.encoding = 'utf-8'if response.status_code != 200:print(f"请求失败，状态码: {response.status_code}")continuesoup = BeautifulSoup(response.text, 'html.parser')grid_view = soup.find('ol', class_='grid_view')if not grid_view:print("未找到电影列表，可能页面结构有变化")continueitems = grid_view.find_all('li')for item in items:# 解析内容（排名、标题、导演、年份、地区、类型、评分、评价人数、经典台词）rank = item.find('em').text if item.find('em') else "N/A"title = item.find('span', class_='title').text.strip() if item.find('span', class_='title') else "未知标题"other_title = item.find('span', class_='other').text.strip() if item.find('span', class_='other') else ""bd = item.find('div', class_='bd')info_text = bd.find('p').get_text(strip=True).replace('\xa0', ' ') if bd and bd.find('p') else ""director = info_text.split("导演:")[1].split("主演")[0].strip().split('/')[0] if "导演:" in info_text else ""year = re.search(r'\d{4}', info_text).group(0) if re.search(r'\d{4}', info_text) else ""parts = info_text.split("/")region = parts[-2].strip() if len(parts) >= 2 else ""genre = parts[-1].strip() if len(parts) >= 1 else ""rating = item.find('span', class_='rating_num').text if item.find('span', class_='rating_num') else "0.0"star_div = item.find('div', class_='star')spans = star_div.find_all('span') if star_div else []votes = spans[3].text.replace('人评价', '') if len(spans) >= 4 else "0"quote = item.find('span', class_='inq').text if item.find('span', class_='inq') else ""movies.append({'排名': rank,'标题': title,'其他标题': other_title,'导演': director,'年份': year,'地区': region,'类型': genre,'评分': float(rating) if rating.replace('.', '', 1).isdigit() else 0.0,'评价人数': int(votes.replace(',', '')) if votes.isdigit() else 0,'经典台词': quote})print(f"成功抓取第{page // 25 + 1}页数据，共{len(movies)}条记录")time.sleep(random.uniform(3, 7))except Exception as e:print(f"第{page // 25 + 1}页抓取失败: {str(e)}")return pd.DataFrame(movies)df = scrape_douban_top250()
if not df.empty:df['年份'] = pd.to_numeric(df['年份'], errors='coerce').fillna(0).astype(int)df.to_csv('douban_top250.csv', index=False, encoding='utf-8-sig')print(f"数据抓取完成，共{len(df)}条记录，已保存至 douban_top250.csv")print("\n前 5 条数据预览：")print(df.head())
else:print("未抓取到任何数据，请检查网络或反爬策略是否生效")

代码亮点：

随机延时防止被封；
异常捕获增强稳定性；
数据清洗：评分转为 float，票数转为 int；
通过结构化字典构建 DataFrame 。

六、数据清洗与转换建议

建议在 DataFrame 构建后进一步处理：

df['year'] = df['year'].apply(lambda x: re.search(r'\d{4}', x).group() if re.search(r'\d{4}', x) else None)
df['director'] = df['director'].str.split('/').str[0]
df['quote'] = df['quote'].fillna('无')

这些能优化数据可用性，便于后续分析。

七、数据可视化分析范例

结合 Matplotlib 和 Seaborn，展示深度分析流程：

import matplotlib.pyplot as plt
import seaborn as snsplt.figure(figsize=(15,10))# 1. 评分分布直方图
plt.subplot(2,2,1)
sns.histplot(df['评分'], bins=20, kde=True)
plt.title('豆瓣TOP250评分分布')# 2. 年代趋势（TOP10）
plt.subplot(2,2,2)
sns.countplot(x='年份', data=df, order=df['年份'].value_counts().index[:10])
plt.xticks(rotation=45)
plt.title('电影上映年代分布TOP10')# 3. 导演上榜数TOP10
plt.subplot(2,2,3)
top_directors = df['导演'].value_counts().head(10)
sns.barplot(x=top_directors.values, y=top_directors.index)
plt.title('导演上榜作品数量TOP10')# 4. 评分与评价人数关系（对数坐标）
plt.subplot(2,2,4)
sns.scatterplot(x='评价人数', y='评分', data=df, hue='评分', palette='viridis')
plt.xscale('log')
plt.title('评分 vs 投票数（对数刻度）')plt.tight_layout()
plt.savefig('douban_analysis.png', dpi=300)

数据洞察：

评分偏左集中，均值约为 8.9，最低约 8.3；
1994–2004 年份涌现大量经典影片；
宫崎骏电影上榜数量最多，全胜导演之一；
投票数超过 150 万的影片评分均超过 9.0 。

八、常见问题与解决方案一览

问题现象	解决方案
返回 403 内	升级 UA 库、更换代理 IP
数据部分缺失	添加备用 CSS 选择路径
爬取速度过慢	缩短延时为 1–2 秒
pandas 中文乱码	使用 `encoding='utf-8-sig'` 保存
请求超时	增加重试机制或设置更高 `timeout`

九、项目拓展建议

推荐系统构建：使用 TF-IDF 分析台词文本，构建内容推荐模型；
自动定时更新：部署定时任务（如 crontab 或 Airflow）实现数据自动抓取与增量更新。

结语

本文全面涵盖了从豆瓣 TOP 250 数据 抓取 → 清洗 → 可视化 → 洞察 的完整流程，核心技巧包括：

动态 UA 和代理 IP 破解反爬；
稳定的异常处理机制；
深度数据清洗与类型转换；
多维数据分析洞察趋势。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若转载，请注明出处：http://www.pswp.cn/pingmian/94223.shtml
繁体地址，请注明出处：http://hk.pswp.cn/pingmian/94223.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！