Python爬取知乎评论：多线程与异步爬虫的性能优化

1. 知乎评论爬取的技术挑战

知乎的评论数据通常采用动态加载（Ajax），这意味着直接使用**requests**+**BeautifulSoup**无法获取完整数据。此外，知乎还设置了反爬机制，包括：

请求头（Headers）验证（如**User-Agent**、**Referer**）
Cookie/Session 校验（未登录用户只能获取部分数据）
频率限制（频繁请求可能导致IP被封）

因此，我们需要：

模拟浏览器请求（携带Headers和Cookies）
解析动态API接口（而非静态HTML）
优化爬取速度（多线程/异步）

2. 获取知乎评论API分析

（1）查找评论API

打开知乎任意一个问题（如 **https://www.zhihu.com/question/xxxxxx**），按**F12**进入开发者工具，切换到**Network**选项卡，筛选**XHR**请求

（2）解析评论数据结构

评论通常嵌套在**data**字段中，结构如下：

{"data": [{"content": "评论内容","author": { "name": "用户名" },"created_time": 1620000000}],"paging": { "is_end": false, "next": "下一页URL" }
}

我们需要递归翻页（**paging.next**）爬取所有评论。

3. Python爬取知乎评论的三种方式

（1）单线程爬虫（基准测试）

使用**requests**库直接请求API，逐页爬取：

import requests
import timedef fetch_comments(question_id, max_pages=5):base_url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers"headers = {"User-Agent": "Mozilla/5.0","Cookie": "你的Cookie"  # 登录后获取}comments = []for page in range(max_pages):url = f"{base_url}?offset={page * 10}&limit=10"resp = requests.get(url, headers=headers).json()for answer in resp["data"]:comments.append(answer["content"])time.sleep(1)  # 避免请求过快return commentsstart_time = time.time()
comments = fetch_comments("12345678")  # 替换为知乎问题ID
print(f"单线程爬取完成，耗时：{time.time() - start_time:.2f}秒")

缺点：逐页请求，速度慢（假设每页1秒，10页需10秒）。

（2）多线程爬虫（ThreadPoolExecutor）

使用**concurrent.futures**实现多线程并发请求：

from concurrent.futures import ThreadPoolExecutordef fetch_page(page, question_id):url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"headers = {"User-Agent": "Mozilla/5.0"}resp = requests.get(url, headers=headers).json()return [answer["content"] for answer in resp["data"]]def fetch_comments_multi(question_id, max_pages=5, threads=4):with ThreadPoolExecutor(max_workers=threads) as executor:futures = [executor.submit(fetch_page, page, question_id) for page in range(max_pages)]comments = []for future in futures:comments.extend(future.result())return commentsstart_time = time.time()
comments = fetch_comments_multi("12345678", threads=4)
print(f"多线程爬取完成，耗时：{time.time() - start_time:.2f}秒")

优化点：

线程池控制并发数（避免被封）
比单线程快约3-4倍（4线程爬10页仅需2-3秒）

（3）异步爬虫（Asyncio + aiohttp）

使用**aiohttp**实现异步HTTP请求，进一步提高效率：

import aiohttp
import asyncio
import time# 代理配置
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"async def fetch_page_async(session, page, question_id):url = f"https://www.zhihu.com/api/v4/questions/{question_id}/answers?offset={page * 10}&limit=10"headers = {"User-Agent": "Mozilla/5.0"}async with session.get(url, headers=headers) as resp:data = await resp.json()return [answer["content"] for answer in data["data"]]async def fetch_comments_async(question_id, max_pages=5):# 设置代理连接器proxy_auth = aiohttp.BasicAuth(proxyUser, proxyPass)connector = aiohttp.TCPConnector(limit=20,  # 并发连接数限制force_close=True,enable_cleanup_closed=True,proxy=f"http://{proxyHost}:{proxyPort}",proxy_auth=proxy_auth)async with aiohttp.ClientSession(connector=connector) as session:tasks = [fetch_page_async(session, page, question_id) for page in range(max_pages)]comments = await asyncio.gather(*tasks)return [item for sublist in comments for item in sublist]if __name__ == "__main__":start_time = time.time()comments = asyncio.run(fetch_comments_async("12345678"))  # 替换为知乎问题IDprint(f"异步爬取完成，耗时：{time.time() - start_time:.2f}秒")print(f"共获取 {len(comments)} 条评论")

优势：

无GIL限制，比多线程更高效
适合高并发IO密集型任务（如爬虫）

4. 性能对比与优化建议

爬取方式	10页耗时（秒）	适用场景
单线程	~10	少量数据，简单爬取
多线程（4线程）	~2.5	中等规模，需控制并发
异步（Asyncio）	~1.8	大规模爬取，高并发需求

优化建议

控制并发数：避免触发反爬（建议10-20并发）。
随机延迟：**time.sleep(random.uniform(0.5, 2))** 模拟人类操作。
代理IP池：防止IP被封（如使用**requests**+**ProxyPool**）。
数据存储优化：异步写入数据库（如**MongoDB**或**MySQL**）。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若转载，请注明出处：http://www.pswp.cn/pingmian/87932.shtml
繁体地址，请注明出处：http://hk.pswp.cn/pingmian/87932.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！